當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Pytorch模型量化

發(fā)布時(shí)間：2023/12/10 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 Pytorch模型量化小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

在深度學(xué)習(xí)中，量化指的是使用更少的bit來(lái)存儲(chǔ)原本以浮點(diǎn)數(shù)存儲(chǔ)的tensor，以及使用更少的bit來(lái)完成原本以浮點(diǎn)數(shù)完成的計(jì)算。這么做的好處主要有如下幾點(diǎn)：

更少的模型體積，接近4倍的減少；
可以更快的計(jì)算，由于更少的內(nèi)存訪問(wèn)和更快的int8計(jì)算，可以快2~4倍。

一個(gè)量化后的模型，其部分或者全部的tensor操作會(huì)使用int類型來(lái)計(jì)算，而不是使用量化之前的float類型。當(dāng)然，量化還需要底層硬件支持，x86 CPU(支持AVX2)、ARM CPU、Google TPU、Nvidia Volta/Turing/Ampere、Qualcomm DSP這些主流硬件都對(duì)量化提供了支持。

PyTorch對(duì)量化的支持目前有如下三種方式：

Post Training Dynamic Quantization：模型訓(xùn)練完畢后的動(dòng)態(tài)量化；
Post Training Static Quantization：模型訓(xùn)練完畢后的靜態(tài)量化；
QAT (Quantization Aware Training)：模型訓(xùn)練中開啟量化。

在開始這三部分之前，先介紹下最基礎(chǔ)的Tensor的量化。

Tensor的量化

量化：$$公式1：xq=round(\frac{x}{scale}+zero\_point)$$

反量化：$$公式2：x = (xq-zero\_point)*scale$$

式中，scale是縮放因子，zero_point是零基準(zhǔn)，也就是fp32中的零在量化tensor中的值

　　為了實(shí)現(xiàn)量化，PyTorch 引入了能夠表示量化數(shù)據(jù)的Quantized Tensor，可以存儲(chǔ) int8/uint8/int32類型的數(shù)據(jù)，并攜帶有scale、zero_point這些參數(shù)。把一個(gè)標(biāo)準(zhǔn)的float Tensor轉(zhuǎn)換為量化Tensor的步驟如下：

import torchx = torch.randn(2, 2, dtype=torch.float32) # tensor([[ 0.9872, -1.6833], # [-0.9345, 0.6531]])# 公式1(量化)：xq = round(x / scale + zero_point) # 使用給定的scale和 zero_point 來(lái)把一個(gè)float tensor轉(zhuǎn)化為 quantized tensor xq = torch.quantize_per_tensor(x, scale=0.5, zero_point=8, dtype=torch.quint8) # tensor([[ 1.0000, -1.5000], # [-1.0000, 0.5000]], size=(2, 2), dtype=torch.quint8, # quantization_scheme=torch.per_tensor_affine, scale=0.5, zero_point=8)print(xq.int_repr()) # 給定一個(gè)量化的張量，返回一個(gè)以 uint8_t 作為數(shù)據(jù)類型的張量 # tensor([[10, 5], # [ 6, 9]], dtype=torch.uint8)# 公式2(反量化)：xdq = (xq - zero_point) * scale # 使用給定的scale和 zero_point 來(lái)把一個(gè) quantized tensor 轉(zhuǎn)化為 float tensor xdq = xq.dequantize() # tensor([[ 1.0000, -1.5000], # [-1.0000, 0.5000]])

xdq和x的值已經(jīng)出現(xiàn)了偏差的事實(shí)告訴了我們兩個(gè)道理：

量化會(huì)有精度損失
我們隨便選取的scale和zp太爛，選擇合適的scale和zp可以有效降低精度損失。不信你把scale和zp分別換成scale = 0.0036, zero_point = 0試試

而在PyTorch中，選擇合適的scale和zp的工作就由各種observer來(lái)完成。

Tensor的量化支持兩種模式：per tensor 和 per channel。

Per tensor：是說(shuō)一個(gè)tensor里的所有value按照同一種方式去scale和offset；
Per channel：是對(duì)于tensor的某一個(gè)維度(通常是channel的維度)上的值按照一種方式去scale和offset，也就是一個(gè)tensor里有多種不同的scale和offset的方式(組成一個(gè)vector)，如此以來(lái)，在量化的時(shí)候相比per tensor的方式會(huì)引入更少的錯(cuò)誤。PyTorch目前支持conv2d()、conv3d()、linear()的per channel量化。

在我們正式了解pytorch模型量化前我們?cè)賮?lái)檢查一下pytorch的官方量化是否能滿足我們的需求，如果不能，后面的都不需要看了

?	靜態(tài)量化	動(dòng)態(tài)量化
nn.linear	Y	Y
nn.Conv1d/2d/3d	Y	N (因?yàn)閜ytorch認(rèn)為卷積參數(shù)來(lái)了個(gè)太小了，對(duì)卷積核進(jìn)行量化會(huì)造成更多損失，所以pytorch選擇不量化)
nn.LSTM	N(LSTM的好像又可以了，官方給出了一個(gè)例子，傳送門)	Y
nn.GRU	N	Y
nn.RNNCell	N	Y
nn.GRUCell	N	Y
nn.LSTMCell	N	Y
nn.EmbeddingBag	Y(激活在fp32)	Y
nn.Embedding	Y	N
nn.MultiheadAttention	N	N
Activations	大部分支持	不變，計(jì)算停留在fp32中

第二點(diǎn)：pytorch模型的動(dòng)態(tài)量化只量化權(quán)重，不量化偏置

Post Training Dynamic Quantization (訓(xùn)練后動(dòng)態(tài)量化)

　　意思就是對(duì)訓(xùn)練后的模型權(quán)重執(zhí)行動(dòng)態(tài)量化，將浮點(diǎn)模型轉(zhuǎn)換為動(dòng)態(tài)量化模型，僅對(duì)模型權(quán)重進(jìn)行量化，偏置不會(huì)量化。默認(rèn)情況下，僅對(duì) Linear 和 RNN 變體量化 (因?yàn)檫@些layer的參數(shù)量很大，收益更高)。

torch.quantization.quantize_dynamic(model, qconfig_spec=None, dtype=torch.qint8, mapping=None, inplace=False)

參數(shù)：

model：浮點(diǎn)模型
qconfig_spec：
- 下面的任意一種
  - 集合：比如：?qconfig_spec={nn.LSTM, nn.Linear}?。羅列要量化的NN?
  - 字典：?qconfig_spec = {nn.Linear : default_dynamic_qconfig, nn.LSTM : default_dynamic_qconfig}?
dtype：?float16 或 qint8
mapping：就地執(zhí)行模型轉(zhuǎn)換，原始模塊發(fā)生變異
inplace：將子模塊的類型映射到需要替換子模塊的相應(yīng)動(dòng)態(tài)量化版本的類型

返回：動(dòng)態(tài)量化后的模型

我們來(lái)吃一個(gè)栗子：

# -*- coding:utf-8 -*- # Author:凌逆戰(zhàn) | Never # Date: 2022/10/17 """ 只量化權(quán)重，不量化激活 """ import torch from torch import nnclass DemoModel(torch.nn.Module):def __init__(self):super(DemoModel, self).__init__()self.conv = nn.Conv2d(in_channels=1,out_channels=1,kernel_size=1)self.relu = nn.ReLU()self.fc = torch.nn.Linear(2, 2)def forward(self, x):x = self.conv(x)x = self.relu(x)x = self.fc(x)return xif __name__ == "__main__":model_fp32 = DemoModel()# 創(chuàng)建一個(gè)量化的模型實(shí)例model_int8 = torch.quantization.quantize_dynamic(model=model_fp32, # 原始模型qconfig_spec={torch.nn.Linear}, # 要?jiǎng)討B(tài)量化的NN算子dtype=torch.qint8) # 將權(quán)重量化為：float16 \ qint8print(model_fp32)print(model_int8)# 運(yùn)行模型input_fp32 = torch.randn(1,1,2, 2)output_fp32 = model_fp32(input_fp32)print(output_fp32)output_int8 = model_int8(input_fp32)print(output_int8)

輸出

DemoModel((conv): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1))(relu): ReLU()(fc): Linear(in_features=2, out_features=2, bias=True) ) DemoModel((conv): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1))(relu): ReLU()(fc): DynamicQuantizedLinear(in_features=2, out_features=2, dtype=torch.qint8, qscheme=torch.per_tensor_affine) ) tensor([[[[-0.5361, 0.0741],[-0.2033, 0.4149]]]], grad_fn=<AddBackward0>) tensor([[[[-0.5371, 0.0713],[-0.2040, 0.4126]]]]) View Code

Post Training Static Quantization (訓(xùn)練后靜態(tài)量化)

　　靜態(tài)量化需要把模型的權(quán)重和激活都進(jìn)行量化，靜態(tài)量化需要把訓(xùn)練集或者和訓(xùn)練集分布類似的數(shù)據(jù)喂給模型(注意沒有反向傳播)，然后通過(guò)每個(gè)op輸入的分布來(lái)計(jì)算activation的量化參數(shù)(scale和zp)——稱之為Calibrate(定標(biāo))，因?yàn)殪o態(tài)量化的前向推理過(guò)程自始至終都是int計(jì)算，activation需要確保一個(gè)op的輸入符合下一個(gè)op的輸入。

PyTorch會(huì)使用以下5步來(lái)完成模型的靜態(tài)量化：

1、fuse_model

合并一些可以合并的layer。這一步的目的是為了提高速度和準(zhǔn)確度：

fuse_modules(model, modules_to_fuse, inplace=False, fuser_func=fuse_known_modules, fuse_custom_config_dict=None)

比如給fuse_modules傳遞下面的參數(shù)就會(huì)合并網(wǎng)絡(luò)中的conv1、bn1、relu1：

torch.quantization.fuse_modules(F32Model, [['fc', 'relu']], inplace=True)

一旦合并成功，那么原始網(wǎng)絡(luò)中的fc就會(huì)被替換為新的合并后的module(因?yàn)槠涫莑ist中的第一個(gè)元素)，而relu(list中剩余的元素)會(huì)被替換為nn.Identity()，這個(gè)模塊是個(gè)占位符，直接輸出輸入。舉個(gè)例子，對(duì)于下面的一個(gè)小網(wǎng)絡(luò)：

import torch from torch import nnclass F32Model(nn.Module):def __init__(self):super(F32Model, self).__init__()self.fc = nn.Linear(3, 2,bias=False)self.relu = nn.ReLU(inplace=False)def forward(self, x):x = self.fc(x)x = self.relu(x)return xmodel_fp32 = F32Model() print(model_fp32) # F32Model( # (fc): Linear(in_features=3, out_features=2, bias=False) # (relu): ReLU() # ) model_fp32_fused = torch.quantization.fuse_modules(model_fp32, [['fc', 'relu']]) print(model_fp32_fused) # F32Model( # (fc): LinearReLU( # (0): Linear(in_features=3, out_features=2, bias=False) # (1): ReLU() # ) # (relu): Identity() # )

modules_to_fuse參數(shù)的list可以包含多個(gè)item list，或者是submodule的op list也可以，比如：[ ['conv1', 'bn1', 'relu1'], ['submodule.conv', 'submodule.relu']]。有的人會(huì)說(shuō)了，我要fuse的module被Sequential封裝起來(lái)了，如何傳參？參考下面的代碼：

torch.quantization.fuse_modules(a_sequential_module, ['0', '1', '2'], inplace=True)

就目前來(lái)說(shuō)，截止目前為止，只有如下的op和順序才可以?(這個(gè)mapping關(guān)系就定義在DEFAULT_OP_LIST_TO_FUSER_METHOD中)：

Convolution,?BatchNorm
Convolution, BatchNorm,?ReLU
Convolution,?ReLU
Linear,?ReLU
BatchNorm, ReLU
ConvTranspose, BatchNorm

2、設(shè)置qconfig

qconfig要設(shè)置到模型或者M(jìn)odule上。

#如果要部署在x86 server上 model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')#如果要部署在ARM上 model_fp32.qconfig = torch.quantization.get_default_qconfig('qnnpack')

x86和arm之外目前不支持。

3、prepare

prepare用來(lái)給每個(gè)子module插入Observer，用來(lái)收集和定標(biāo)數(shù)據(jù)。

以activation的observer為例，觀察輸入數(shù)據(jù)得到四元組中的 min_val 和 max_val，至少觀察個(gè)幾百個(gè)迭代的數(shù)據(jù)吧，然后由這四元組得到 scale 和 zp 這兩個(gè)參數(shù)的值。

model_fp32_prepared= torch.quantization.prepare(model_fp32_fused)

4、喂數(shù)據(jù)

這一步不是訓(xùn)練。是為了獲取數(shù)據(jù)的分布特點(diǎn)，來(lái)更好的計(jì)算activation的 scale 和 zp 。至少要喂上幾百個(gè)迭代的數(shù)據(jù)。

#至少觀察個(gè)幾百迭代 for data in data_loader:model_fp32_prepared(data)

5、轉(zhuǎn)換模型

第四步完成后，各個(gè)op權(quán)重的四元組?(min_val，max_val，qmin, qmax)?中的?min_val?，?max_val?已經(jīng)有了，各個(gè)op activation的四元組?(min_val，max_val，qmin, qmax)?中的?min_val?，?max_val?也已經(jīng)觀察出來(lái)了。那么在這一步我們將調(diào)用convert API：

model_prepared_int8 = torch.quantization.convert(model_fp32_prepared)

我們來(lái)吃一個(gè)完整的例子：

# -*- coding:utf-8 -*- # Author:凌逆戰(zhàn) | Never # Date: 2022/10/17 """ 權(quán)重和激活都會(huì)被量化 """import torch from torch import nn# 定義一個(gè)浮點(diǎn)模型，其中一些層可以被靜態(tài)量化 class F32Model(torch.nn.Module):def __init__(self):super(F32Model, self).__init__()self.quant = torch.quantization.QuantStub() # QuantStub: 轉(zhuǎn)換張量從浮點(diǎn)到量化self.conv = nn.Conv2d(1, 1, 1)self.fc = nn.Linear(2, 2, bias=False)self.relu = nn.ReLU()self.dequant = torch.quantization.DeQuantStub() # DeQuantStub: 將量化張量轉(zhuǎn)換為浮點(diǎn)def forward(self, x):x = self.quant(x) # 手動(dòng)指定張量: 從浮點(diǎn)轉(zhuǎn)換為量化x = self.conv(x)x = self.fc(x)x = self.relu(x)x = self.dequant(x) # 手動(dòng)指定張量: 從量化轉(zhuǎn)換到浮點(diǎn)return xmodel_fp32 = F32Model() model_fp32.eval() # 模型必須設(shè)置為eval模式，靜態(tài)量化邏輯才能工作# 1、如果要部署在ARM上；果要部署在x86 server上 ‘fbgemm’ model_fp32.qconfig = torch.quantization.get_default_qconfig('qnnpack')# 2、在適用的情況下，將一些層進(jìn)行融合，可以加速 # 常見的融合包括在：DEFAULT_OP_LIST_TO_FUSER_METHOD model_fp32_fused = torch.quantization.fuse_modules(model_fp32, [['fc', 'relu']])# 3、準(zhǔn)備模型，插入observers，觀察 activation 和 weight model_fp32_prepared = torch.quantization.prepare(model_fp32_fused)# 4、代表性數(shù)據(jù)集，獲取數(shù)據(jù)的分布特點(diǎn)，來(lái)更好的計(jì)算activation的 scale 和 zp input_fp32 = torch.randn(1, 1, 2, 2) # (batch_size, channel, W, H) model_fp32_prepared(input_fp32)# 5、量化模型 model_int8 = torch.quantization.convert(model_fp32_prepared)# 運(yùn)行模型，相關(guān)計(jì)算將在int8中進(jìn)行 output_fp32 = model_fp32(input_fp32) output_int8 = model_int8(input_fp32) print(output_fp32) # tensor([[[[0.6315, 0.0000], # [0.2466, 0.0000]]]], grad_fn=<ReluBackward0>) print(output_int8) # tensor([[[[0.3886, 0.0000], # [0.2475, 0.0000]]]])

Quantization Aware Training (邊訓(xùn)練邊量化)

這一部分我用不著，等我需要使用的時(shí)候再來(lái)補(bǔ)充

保存和加載量化模型

我們先把模型量化

import torch from torch import nnclass M(torch.nn.Module):def __init__(self):super().__init__()self.linear = nn.Linear(5, 5,bias=True)self.gru = nn.GRU(input_size=5,hidden_size=5,bias=True,)self.relu = nn.ReLU()def forward(self, x):x = self.linear(x)x = self.gru(x)x = self.relu(x)return xm = M().eval() model_int8 = torch.quantization.quantize_dynamic(model=m, # 原始模型qconfig_spec={nn.Linear,nn.GRU}, # 要?jiǎng)討B(tài)量化的NN算子dtype=torch.qint8, inplace=True) # 將權(quán)重量化為：float16 \ qint8+

保存/加載量化模型 state_dict

torch.save(model_int8.state_dict(), "./state_dict.pth") model_int8.load_state_dict(torch.load("./state_dict.pth")) print(model_int8)

保存/加載腳本化量化模型 torch.jit.save 和 torch.jit.load?

traced_model = torch.jit.trace(model_int8, torch.rand(5, 5)) torch.jit.save(traced_model, "./traced_quant.pt") quantized_model = torch.jit.load("./traced_quant.pt") print(quantized_model)

獲取量化模型的參數(shù)

其實(shí)pytorch獲取量化后的模型參數(shù)是比較困難的，我們還是以上面的量化模型為例來(lái)取參數(shù)的值

print(model_int8) # M( # (linear): DynamicQuantizedLinear(in_features=5, out_features=5, dtype=torch.qint8, qscheme=torch.per_tensor_affine) # (gru): DynamicQuantizedGRU(5, 5) # (relu): ReLU() # ) print(model_int8.linear) print(model_int8.gru) print(model_int8.relu)

我們來(lái)嘗試一下獲取線性層的權(quán)重和偏置

# print(dir(model_int8.linear))　　# 獲得對(duì)象的所有屬性和方法 print(model_int8.linear.weight().int_repr()) # tensor([[ 104, 127, 70, -94, 121], # [ 98, 53, 124, 74, 38], # [-103, -112, 38, 117, 64], # [ -46, -36, 115, 82, -75], # [ -14, -94, 42, -25, 41]], dtype=torch.int8) print(model_int8.linear.bias()) # tensor([ 0.2437, 0.2956, 0.4010, -0.2818, 0.0950], requires_grad=True)

O My God，偏置居然還是浮點(diǎn)類型的，只有權(quán)重被量化為了整型。

好的，我們?cè)賮?lái)獲取GRU的權(quán)重和偏置

print(dir(model_int8.gru)) print(model_int8.gru.get_weight()["weight_ih_l0"].int_repr()) # int8 print(model_int8.gru.get_weight()["weight_hh_l0"].int_repr()) #int8 print(model_int8.gru.get_bias()["bias_ih_l0"]) # float print(model_int8.gru.get_bias()["bias_hh_l0"]) # float

第一，別問(wèn)我別問(wèn)我為什么取值這么麻煩，你以為我想？？？

第二，靜態(tài)量化不支持GRU就算了，動(dòng)態(tài)量化偏置還不給我量化了，哎，pytorch的量化真的是還有很長(zhǎng)的路要走呀！

參考

【pytorch官方】Quantization(需要非常細(xì)心且耐心的去讀)

【pytorch官方】Quantization API

【知乎】PyTorch的量化

總結(jié)

以上是生活随笔為你收集整理的Pytorch模型量化的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：使用DOM4J解析XML时非法字符Exc
下一篇：记一次sql优化之索引的引用