# Difference between revisions of "Exploring schedules for incremental and annealing quantization algorithms"

### From iis-projects

(4 intermediate revisions by the same user not shown) | |||

Line 1: | Line 1: | ||

== Introduction == | == Introduction == | ||

− | Empirical evidence is supporting confidence in the fact that | + | Empirical evidence is supporting confidence in the fact that quantisation-aware training algorithms are necessary when targetting aggressive quantisation schemes (1-bit, 2-bits). |

− | Most algorithms in this family interpret the | + | Most algorithms in this family interpret the quantisation problem as an approximation task, where the quantised network is either obtained by projecting a full-precision network on a constrained space [1, 2], or by progressively “hardening” a relaxed version of a QNN towards its natural discrete definition [3, 4]. |

− | An idea which is central to some of these algorithms is that of achieving | + | An idea which is central to some of these algorithms is that of achieving quantisation progressively. |

− | The rationale for this choice is allowing the full-precision part of the network to compensate for the error introduced at the different steps of the | + | The rationale for this choice is allowing the full-precision part of the network to compensate for the error introduced at the different steps of the quantisation process. |

− | For instance, the incremental network | + | For instance, the incremental network quantisation algorithm (INQ) [2] defines a partition P = {p_{1}, …, p_{T}} of the weights space of a given network topology, and assigns to each of its elements p_{i} an integer t_{i} representing its quantisation epoch. |

− | The algorithm then starts training a full-precision network, and whenever a | + | The algorithm then starts training a full-precision network, and whenever a quantisation epoch t_{i} is reached, the weights in p_{i} are projected onto the corresponding quantised weights space. |

− | These weights are no longer allowed to be updated, whereas the weights which have not yet been | + | These weights are no longer allowed to be updated, whereas the weights which have not yet been quantised can adapt to compensate for the error introduced. |

− | In this way, at epoch T, all the network’s weights will be | + | In this way, at epoch T, all the network’s weights will be quantised. |

Another example is that of the additive noise annealing algorithm (ANA) [3]. | Another example is that of the additive noise annealing algorithm (ANA) [3]. | ||

In this case, the target QNN is regularised through the addition of noise to the parameters, allowing gradients and updates to be computed. | In this case, the target QNN is regularised through the addition of noise to the parameters, allowing gradients and updates to be computed. | ||

To each parameter w_{j}, j ∊ {1, ..., N}, an annealing schedule f_{j}(t) is attached, governing the amount of noise for each parameter. | To each parameter w_{j}, j ∊ {1, ..., N}, an annealing schedule f_{j}(t) is attached, governing the amount of noise for each parameter. | ||

− | By ensuring that all the schedules {f_j(t)}_{j ∊ {1, ..., N}} are decreasing and that they eventually reach zero when t is the final training epoch, each regularised layer is sharpened to converge towards its | + | By ensuring that all the schedules {f_j(t)}_{j ∊ {1, ..., N}} are decreasing and that they eventually reach zero when t is the final training epoch, each regularised layer is sharpened to converge towards its quantised counterpart that will be implemented on the real hardware. |

This annealing prioritises the lower layers and then proceeds to the upper layers. | This annealing prioritises the lower layers and then proceeds to the upper layers. | ||

− | The original intuitive rationale was that the features of the lower layers should | + | The original intuitive rationale was that the features of the lower layers should stabilise before the features of the upper layers, allowing the latter to adapt to the hardening of the former. |

Later, this strategy has also been motivated by theoretical results [5]. | Later, this strategy has also been motivated by theoretical results [5]. | ||

Line 61: | Line 61: | ||

=== Project character === | === Project character === | ||

+ | |||

+ | * 20% Theory | ||

+ | * 80% Deep learning | ||

+ | |||

+ | |||

+ | == Logistics == | ||

+ | |||

+ | The student and the advisor will meet on a weekly basis to check the progress of the project, clarify doubts, and decide the next steps. | ||

+ | The schedule of this weekly update meeting will be agreed at the beginning of the project by both parties. | ||

+ | Of course, additional meetings can be organised to address urgent issues. | ||

+ | |||

+ | At the end of the project, you will have to present your work during a 15 minutes talk in front of the IIS team and defend it during the following 5 minutes discussion. | ||

Line 71: | Line 83: | ||

Student: Sebastian Bensland | Student: Sebastian Bensland | ||

+ | |||

Supervisors: [[:User:Spmatteo | Matteo Spallanzani]] [mailto:spmatteo@iis.ee.ethz.ch spmatteo@iis.ee.ethz.ch], [[:User:Lukasc | Lukas Cavigelli (Huawei RC Zuerich)]] | Supervisors: [[:User:Spmatteo | Matteo Spallanzani]] [mailto:spmatteo@iis.ee.ethz.ch spmatteo@iis.ee.ethz.ch], [[:User:Lukasc | Lukas Cavigelli (Huawei RC Zuerich)]] | ||

## Latest revision as of 20:22, 24 September 2021

## Contents

## Introduction

Empirical evidence is supporting confidence in the fact that quantisation-aware training algorithms are necessary when targetting aggressive quantisation schemes (1-bit, 2-bits). Most algorithms in this family interpret the quantisation problem as an approximation task, where the quantised network is either obtained by projecting a full-precision network on a constrained space [1, 2], or by progressively “hardening” a relaxed version of a QNN towards its natural discrete definition [3, 4]. An idea which is central to some of these algorithms is that of achieving quantisation progressively. The rationale for this choice is allowing the full-precision part of the network to compensate for the error introduced at the different steps of the quantisation process.

For instance, the incremental network quantisation algorithm (INQ) [2] defines a partition P = {p_{1}, …, p_{T}} of the weights space of a given network topology, and assigns to each of its elements p_{i} an integer t_{i} representing its quantisation epoch. The algorithm then starts training a full-precision network, and whenever a quantisation epoch t_{i} is reached, the weights in p_{i} are projected onto the corresponding quantised weights space. These weights are no longer allowed to be updated, whereas the weights which have not yet been quantised can adapt to compensate for the error introduced. In this way, at epoch T, all the network’s weights will be quantised.

Another example is that of the additive noise annealing algorithm (ANA) [3]. In this case, the target QNN is regularised through the addition of noise to the parameters, allowing gradients and updates to be computed. To each parameter w_{j}, j ∊ {1, ..., N}, an annealing schedule f_{j}(t) is attached, governing the amount of noise for each parameter. By ensuring that all the schedules {f_j(t)}_{j ∊ {1, ..., N}} are decreasing and that they eventually reach zero when t is the final training epoch, each regularised layer is sharpened to converge towards its quantised counterpart that will be implemented on the real hardware. This annealing prioritises the lower layers and then proceeds to the upper layers. The original intuitive rationale was that the features of the lower layers should stabilise before the features of the upper layers, allowing the latter to adapt to the hardening of the former. Later, this strategy has also been motivated by theoretical results [5].

Although these algorithms have shown promising results, the tuning of their scheduling hyper-parameters is not yet well understood, requiring tedious and time-consuming iterative searches. When learning algorithms are governed by some hyper-parameters (e.g., the learning rate and the momentum in stochastic gradient descent), an appealing possibility is applying machine learning systems to learn these hyper-parameters automatically, in a process called meta-learning. For instance, this approach has recently been applied to learning the hyper-parameters governing stochastic gradient descent [6].

## Project description

In this project, we will start by designing suitable parametric models to describe the scheduling processes for the INQ and ANA algorithms. Then, you will be in charge of implementing the experiments and collect performance data about hand-coded approaches. Successively, we will analyse your findings and, if possible, derive some heuristic rules and design meta-learning algorithms which can improve the effectiveness and efficiency of the INQ and ANA algorithms.

If time remains, we could also consider deploying some of the models trained with the improved algorithms on ternary network accelerators developed by other members of the IIS team.

### References

[1] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNet classification using binary convolutional neural networks,” ECCV 2016.

[2] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quantization: towards lossless CNNs with low-precision weights,” ICLR 2017.

[3] M. Spallanzani, L. Cavigelli, G. P. Leonardi, M. Bertogna, and L. Benini, “Additive noise annealing and approximation properties of quantized neural networks,” arXiv:1905.10452, 2019.

[4] C. Louizos, M. Reisser, T. Blankevoort, E. Gavves, and M. Welling, “Relaxed quantization for discretized neural networks,” ICLR 2019.

[5] G. P. Leonardi and M. Spallanzani, “Analytical aspects of non-differentiable neural networks,” arXiv:2011.01858, 2020.

[6] N. Maheswaranathan, D. Sussillo, L. Metz, R. Sun, and J. Sohl-Dickstein, “Reverse engineering learned optimizers reveals known and novel mechanisms,” arXiv:2011.02159, 2020.

## Skills and project character

### Skills

Required:

- Algorithms & data structures
- Python programming
- Fundamental concepts of deep learning (convolutional neural networks, backpropagation, computational graphs)

Optional:

- Knowledge of the PyTorch deep learning framework
- C/C++ programming

### Project character

- 20% Theory
- 80% Deep learning

## Logistics

The student and the advisor will meet on a weekly basis to check the progress of the project, clarify doubts, and decide the next steps. The schedule of this weekly update meeting will be agreed at the beginning of the project by both parties. Of course, additional meetings can be organised to address urgent issues.

At the end of the project, you will have to present your work during a 15 minutes talk in front of the IIS team and defend it during the following 5 minutes discussion.

## Professor

## Status: Completed

Student: Sebastian Bensland

Supervisors: Matteo Spallanzani spmatteo@iis.ee.ethz.ch, Lukas Cavigelli (Huawei RC Zuerich)