llvm/mlir/docs/Quantization.md

# Quantization

This document outlines the design of the MLIR quantization system. While the
term "quantization" is highly overloaded, in this case, it refers to a fairly
narrow scope of techniques in use to enable conversion of floating-point
computations to corresponding and plausible variants expressed in integer math
for inference, as has historically been supported by low-bit depth inference
engines such as TFLite, various accelerator hardware, and many DSPs.

Much of this is inspired by the approach taken
[in this paper](https://arxiv.org/abs/1712.05877) with many extensions and
adaptations folded in. It specifically documents the positions that MLIR has
taken on the topic, and is not a general reference.

[TOC]

## Uniform quantization

The primary quantization mechanism supported by MLIR is a scheme which can
express fixed point and affine transformations via uniformly spaced point on the
[Real](https://en.wikipedia.org/wiki/Real_number) number line.

Further, the scheme can be applied:

*   *per-layer* : Applying to every value within the target type.
*   *per-axis* (also called *per-channel*) : Applying individually to each index
    along a specific axis of a tensor type.

### Fixed point values

[Fixed point](https://en.wikipedia.org/wiki/Fixed-point_arithmetic) values are a
[Real](https://en.wikipedia.org/wiki/Real_number) number divided by a *scale*.
We will call the result of the divided real the *scaled value*.

$$ real\_value = scaled\_value * scale $$

The scale can be interpreted as the distance, in real units, between neighboring
scaled values. For example, if the scale is $$ \pi $$, then fixed point values
with this scale can only represent multiples of $$ \pi $$, and nothing in
between. The maximum rounding error to convert an arbitrary Real to a fixed
point value with a given $$ scale $$ is $$ \frac{scale}{2} $$. Continuing the
previous example, when $$ scale = \pi $$, the maximum rounding error will be $$
\frac{\pi}{2} $$.

Multiplication can be performed on scaled values with different scales, using
the same algorithm as multiplication of real values (note that product scaled
value has $$ scale_{product} = scale_{left \mbox{ } operand} * scale_{right
\mbox{ } operand} $$). Addition can be performed on scaled values, so long as
they have the same scale, using the same algorithm for addition of real values.
This makes it convenient to represent scaled values on a computer as signed
integers, and perform arithmetic on those signed integers, because the results
will be correct scaled values.

### Affine values

Mathematically speaking, affine values are the result of
[adding a Real-valued *zero point*, to a scaled value](https://en.wikipedia.org/wiki/Affine_transformation#Representation).
Alternatively (and equivalently), subtracting a zero point from an affine value results in a
scaled value:

$$ real\_value = scaled\_value * scale = (affine\_value - zero\_point) * scale $$

Essentially, affine values are a shift of the scaled values by some constant
amount. Arithmetic (i.e., addition, subtraction, multiplication, division)
cannot, in general, be directly performed on affine values; they must first be
[converted](#affine-to-fixed-point) to the equivalent scaled values.

As alluded to above, the motivation for using affine values is to more
efficiently represent real values that will actually be encountered during
computation. Frequently, real values that will be encountered are not
symmetric around the real zero. We also make the assumption that the real zero
is encountered during computation, and should thus be represented.

In this case, it is inefficient to store scaled values represented by signed
integers, as some of the signed integers will never be used. In effect, the bit patterns
corresponding to those signed integers are going to waste.

In order to exactly represent the real zero with an integral-valued affine
value, the zero point must be an integer between the minimum and maximum affine
value (inclusive). For example, given an affine value represented by an 8 bit
unsigned integer, we have: $$ 0 \leq zero\_point \leq 255$$. This is important,
because in convolution-like operations of deep neural networks, we frequently
need to zero-pad inputs and outputs, so zero must be exactly representable, or
the result will be biased.

### Relation

Real values, fixed point values, and affine values relate through the following
equation, which demonstrates how to convert one type of number to another:

$$ real\_value = scaled\_value * scale = (affine\_value - zero\_point) * scale $$

Note that computers generally store mathematical values using a finite number of
bits. Thus, while the above conversions are exact, to store the result in a
finite number of bits, we must, in general, round the result of the conversion
(this applies to both cases: storing using floating point and storing using
fixed point). Note that a full discussion of rounding behavior is outside the
scope of this document, and it is safe to assume unless otherwise stated that
rounding should be according to the IEEE754 default of RNE (where hardware
permits).

### Converting between real and fixed point or affine

To convert a real value to a fixed point value, we must know the scale. To
convert a real value to an affine value, we must know the scale and the zero point.

#### Real to affine

To convert an input tensor of real-valued elements (usually represented by a
floating point format, frequently
[Single precision](https://en.wikipedia.org/wiki/Single-precision_floating-point_format))
to a tensor of affine elements represented by an integral type (e.g. 8-bit
unsigned integer), the following conversion can be performed (note that it is
not required that all representable values of the integral type are used):

$$
\begin{align*}
af&fine\_value_{uint8 \, or \, uint16} \\
      &= clampToTargetSize(roundToNearestInteger( \frac{real\_value_{Single}}{scale_{Single}})_{sint32} + zero\_point_{uint8 \, or \, uint16})
\end{align*}
$$

In the above, we assume that $$real\_value$$ is a Single, $$scale$$ is a Single,
$$roundToNearestInteger$$ returns a signed 32-bit integer, and $$zero\_point$$
is an unsigned 8-bit or 16-bit integer. Note that bit depth and number of fixed
point values are indicative of common types on typical hardware but is not
constrained to particular bit depths or a requirement that the entire range of
an N-bit integer is used.

#### Affine to real

To convert an output tensor of affine elements represented by uint8
or uint16 to a tensor of real-valued elements (usually represented with a
floating point format, frequently Single precision), the following conversion
can be performed:

$$
\begin{align*}
re&al\_value_{Single} \\
      &= roundToNearestFloat((affine\_value_{uint8 \, or \, uint16} - zero\_point_{uint8 \, or \, uint16})_{sint32})_{Single} * scale_{Single}
\end{align*}
$$

In the above, we assume that the result of subtraction is in 32-bit signed
integer format, and that $$roundToNearestFloat$$ returns a Single.

#### Affine to fixed point

When the affine and fixed point scales are the same, subtract the zero point
from the affine value to get the equivalent fixed point value.

$$
scaled\_value = affine\_value_{non\mbox{-}negative} - zero\_point_{non\mbox{-}negative}
$$

#### Fixed point to affine

When the affine and fixed point scales are the same, add the zero point to the
fixed point value to get the equivalent affine value.

$$
affine\_value_{non\mbox{-}negative} = scaled\_value + zero\_point_{non\mbox{-}negative}
$$

## Usage within MLIR

There are several components to the quantization system being developed within
MLIR:

*   *Quantization* dialect containing:

    *   A family of [QuantizedTypes](#quantized-type) which represent the
        mapping between *expressed* values (typically of a floating point
        computer type) and *storage* values (typically of an integral computer
        type).
    *   [Type conversion ops](#quantized-type-conversion-ops) for converting
        between types based on a QuantizedType and its *expressed* and *storage*
        sub-types.
    *   [Instrumentation ops](#instrumentation-and-constraint-ops) for assigning
        instrumentation points within the computation where runtime statistics
        may help guide the quantization process.

*   [Integration with simulated quantization at training time](#integration-with-simulated-quantization-at-training-time)

*   [TFLite native quantization](#tflite-native-quantization)

    *   The TFLite op-set natively supports uniform-quantized variants.
    *   Passes and tools exist to convert directly from the *TensorFlow* dialect
        to the TFLite quantized operation set.

Not every application of quantization will use all of these facilities. Specifically, the
TensorFlow to TensorFlow Lite conversion uses the QuantizedTypes but has its own
operations for type conversion and expression of the supporting math.

## Quantization Dialect

### Quantized type

TODO : Flesh this section out.

*   QuantizedType base class
*   UniformQuantizedType

### Quantized type conversion operations

*   qcast : Convert from an expressed type to QuantizedType
*   dcast : Convert from a QuantizedType to its expressed type
*   scast : Convert between a QuantizedType and its storage type

### Instrumentation and constraint operations

*   const_fake_quant : Emulates the logic of the historic TensorFlow
    fake_quant_with_min_max_args operation.
*   stats_ref : Declares that statistics should be gathered at this point with a
    unique key and made available to future passes of the solver.
*   stats : Declares inline statistics (per layer and per axis) for the point in
    the computation. stats_ref ops are generally converted to statistical operations once
    trial runs have been performed.
*   coupled_ref : Declares points in the computation to be coupled from a type
    inference perspective based on a unique key.

## Integration with simulated quantization at training time

TensorFlow has historically used the
[tf.quantization.fake_quant_\*](https://www.tensorflow.org/api_docs/python/tf/quantization/fake_quant_with_min_max_args)
family of operations to simulate the effect of quantization at training time.

As originally implemented, TensorFlow Lite was the primary user of such
operations at inference time. When quantized inference was enabled, if every
eligible tensor passed through an appropriate fake_quant node (the rules of
which tensors can have fake_quant applied are somewhat involved), then
TensorFlow Lite would use the attributes of the fake_quant operations to make a
judgment about how to convert to use kernels from its quantized operations subset.

In MLIR-based quantization, fake_quant_\* operationss are handled by converting them to
a sequence of *qcast* (quantize) followed by *dcast* (dequantize) with an
appropriate *UniformQuantizedType* as the target of the qcast operation.

This allows subsequent compiler passes to preserve the knowledge that
quantization was simulated in a certain way, while giving the compiler
flexibility to move the casts as it simplifies the computation and converts it
to a form based on integral arithmetic.

This scheme also naturally allows computations that are *partially quantized*
where the parts which could not be reduced to integral operationss are still carried out
in floating point with appropriate conversions at the boundaries.

## TFLite native quantization

TODO : Flesh this out

### General algorithm

1.  Take input min/max information and set the ArrayInfo (which really is
    InputOrOutputArrayInfo.
1.  In LegalizeTF, convert ArrayInfo min/max to tf.Quantize and tf.Dequantize
    nodes. (or tf.FakeQuant) Convert all constant FakeQuants to (tf.FQ -> tfl.Q
    -> tfl.DQ).
1.  Hardcode logic/propagation needs to happen here.
1.  Run TF constant folding.
1.  In PrepareTFL, convert all tf.FQ to (tfl.Q -> tfl.DQ).
1.  Run quantization pass that take (tfl.DQ (for both input and weights) -> op
    -> tfl.Q) and replaces with (op). Also replace (constant_float -> tfl.Q)
    with (constant_quant).