日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

PyTorch Bert文本分类

發布時間:2023/12/10 编程问答 26 豆豆
生活随笔 收集整理的這篇文章主要介紹了 PyTorch Bert文本分类 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

改文章轉載于作者:weixin_40001805
僅供學習參考!!!


之前用bert一直都是根據keras-bert封裝庫操作的,操作非常簡便(可參考蘇劍林大佬博客當Bert遇上Keras:這可能是Bert最簡單的打開姿勢),這次想要來嘗試一下基于pytorch的bert實踐。

最近pytorch大火,而目前很少有博客完整的給出pytorch-bert的應用代碼,本文從最簡單的中文文本分類入手,一步一步的給出每段代碼~ (代碼簡單清晰,讀者有興趣可上手實踐)

  • 首先安裝pytorch-bert庫, 即:pip install pytorch_pretrained_bert;
  • 然后下載預訓練模型權重,這里下載的是 chinese_roberta_wwm_ext_pytorch
    ,下載鏈接為中文BERT-wwm系列模型 (這里可選擇多種模型);
  • 數據集選擇的THUCNews,整理出18w條數據,10類新聞文本的中文分類問題(10分類),每類新聞數據量相等,為1.8w條,數據集來自train.txt(只選擇了網址里的train.txt),
    數據集的具體格式如下。
    下面進入代碼階段。(訓練環境為Google Colab)

1.導入必要的庫

# coding: UTF-8 import torch import time import torch.nn as nn import torch.nn.functional as F from pytorch_pretrained_bert import BertModel, BertTokenizer, BertConfig, BertAdam import pandas as pd import numpy as np from tqdm import tqdm from torch.utils.data import *path = "data/" bert_path = "chinese_roberta_wwm_ext_pytorch/" tokenizer = BertTokenizer(vocab_file=bert_path + "vocab.txt") # 初始化分詞器

2.預處理數據集

input_ids = [] # input char ids input_types = [] # segment ids input_masks = [] # attention mask label = [] # 標簽 pad_size = 32 # 也稱為 max_len (前期統計分析,文本長度最大值為38,取32即可覆蓋99%)with open(path + "train.txt", encoding='utf-8') as f:for i, l in tqdm(enumerate(f)): x1, y = l.strip().split('t')x1 = tokenizer.tokenize(x1)tokens = ["[CLS]"] + x1 + ["[SEP]"]# 得到input_id, seg_id, att_maskids = tokenizer.convert_tokens_to_ids(tokens)types = [0] *(len(ids))masks = [1] * len(ids)# 短則補齊,長則切斷if len(ids) < pad_size:types = types + [1] * (pad_size - len(ids)) # mask部分 segment置為1masks = masks + [0] * (pad_size - len(ids))ids = ids + [0] * (pad_size - len(ids))else:types = types[:pad_size]masks = masks[:pad_size]ids = ids[:pad_size]input_ids.append(ids)input_types.append(types)input_masks.append(masks) # print(len(ids), len(masks), len(types)) assert len(ids) == len(masks) == len(types) == pad_sizelabel.append([int(y)])

輸出:180000it [00:26, 6728.85it/s] (26秒,速度較快)

3.切分訓練集和測試集

# 隨機打亂索引 random_order = list(range(len(input_ids))) np.random.seed(2020) # 固定種子 np.random.shuffle(random_order) print(random_order[:10])# 4:1 劃分訓練集和測試集 input_ids_train = np.array([input_ids[i] for i in random_order[:int(len(input_ids)*0.8)]]) input_types_train = np.array([input_types[i] for i in random_order[:int(len(input_ids)*0.8)]]) input_masks_train = np.array([input_masks[i] for i in random_order[:int(len(input_ids)*0.8)]]) y_train = np.array([label[i] for i in random_order[:int(len(input_ids) * 0.8)]]) print(input_ids_train.shape, input_types_train.shape, input_masks_train.shape, y_train.shape)input_ids_test = np.array([input_ids[i] for i in random_order[int(len(input_ids)*0.8):]]) input_types_test = np.array([input_types[i] for i in random_order[int(len(input_ids)*0.8):]]) input_masks_test = np.array([input_masks[i] for i in random_order[int(len(input_ids)*0.8):]]) y_test = np.array([label[i] for i in random_order[int(len(input_ids) * 0.8):]]) print(input_ids_test.shape, input_types_test.shape, input_masks_test.shape, y_test.shape)

得到結果

4.加載到高效的DataLoader

BATCH_SIZE = 16 train_data = TensorDataset(torch.LongTensor(input_ids_train), torch.LongTensor(input_types_train), torch.LongTensor(input_masks_train), torch.LongTensor(y_train)) train_sampler = RandomSampler(train_data) train_loader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)test_data = TensorDataset(torch.LongTensor(input_ids_test), torch.LongTensor(input_types_test), torch.LongTensor(input_masks_test),torch.LongTensor(y_test)) test_sampler = SequentialSampler(test_data) test_loader = DataLoader(test_data, sampler=test_sampler, batch_size=BATCH_SIZE)

5.定義bert模型

class Model(nn.Module):def __init__(self):super(Model, self).__init__()self.bert = BertModel.from_pretrained(bert_path) # /bert_pretrain/for param in self.bert.parameters():param.requires_grad = True # 每個參數都要 求梯度self.fc = nn.Linear(768, 10) # 768 -> 2def forward(self, x):context = x[0] # 輸入的句子 (ids, seq_len, mask)types = x[1]mask = x[2] # 對padding部分進行mask,和句子相同size,padding部分用0表示,如:[1, 1, 1, 1, 0, 0]_, pooled = self.bert(context, token_type_ids=types, attention_mask=mask, output_all_encoded_layers=False) # 控制是否輸出所有encoder層的結果out = self.fc(pooled) # 得到10分類return out

可以發現,bert模型的定義由于高效簡易的封裝庫存在,使得定義模型較為容易,如果想要在bert之后加入cnn/rnn等層,可在這里定義。

6.實例化bert模型

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = Model().to(DEVICE) print(model)

得到結果

bert模型結構,未完整輸出,可根據這個輸出學習bert的內部結構

7.定義優化器

param_optimizer = list(model.named_parameters()) # 模型參數名字列表 no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight'] optimizer_grouped_parameters = [{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}]NUM_EPOCHS = 3 optimizer = BertAdam(optimizer_grouped_parameters,lr=2e-5,warmup=0.05,t_total=len(train_loader) * NUM_EPOCHS)# optimizer = torch.optim.Adam(model.parameters(), lr=2e-5) # 簡單起見,可用這一行代碼完事

8.定義訓練函數和測試函數

def train(model, device, train_loader, optimizer, epoch): # 訓練模型model.train()best_acc = 0.0 for batch_idx, (x1,x2,x3, y) in enumerate(train_loader):start_time = time.time()x1,x2,x3, y = x1.to(device), x2.to(device), x3.to(device), y.to(device)y_pred = model([x1, x2, x3]) # 得到預測結果model.zero_grad() # 梯度清零loss = F.cross_entropy(y_pred, y.squeeze()) # 得到lossloss.backward()optimizer.step()if(batch_idx + 1) % 100 == 0: # 打印lossprint('Train Epoch: {} [{}/{} ({:.2f}%)]tLoss: {:.6f}'.format(epoch, (batch_idx+1) * len(x1), len(train_loader.dataset),100. * batch_idx / len(train_loader), loss.item())) # 記得為loss.item()def test(model, device, test_loader): # 測試模型, 得到測試集評估結果model.eval()test_loss = 0.0 acc = 0 for batch_idx, (x1,x2,x3, y) in enumerate(test_loader):x1,x2,x3, y = x1.to(device), x2.to(device), x3.to(device), y.to(device)with torch.no_grad():y_ = model([x1,x2,x3])test_loss += F.cross_entropy(y_, y.squeeze())pred = y_.max(-1, keepdim=True)[1] # .max(): 2輸出,分別為最大值和最大值的indexacc += pred.eq(y.view_as(pred)).sum().item() # 記得加item()test_loss /= len(test_loader)print('nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)'.format(test_loss, acc, len(test_loader.dataset),100. * acc / len(test_loader.dataset)))return acc / len(test_loader.dataset)

9.開始訓練和測試

best_acc = 0.0 PATH = 'roberta_model.pth' # 定義模型保存路徑 for epoch in range(1, NUM_EPOCHS+1): # 3個epochtrain(model, DEVICE, train_loader, optimizer, epoch)acc = test(model, DEVICE, test_loader)if best_acc < acc: best_acc = acc torch.save(model.state_dict(), PATH) # 保存最優模型print("acc is: {:.4f}, best acc is {:.4f}n".format(acc, best_acc))

輸出:(訓練時間較長,這里只訓練了一個epoch,測試集得到0.9407的accuracy)

10.加載最優模型進行測試

model.load_state_dict(torch.load("roberta_model.pth")) acc = test(model, DEVICE, test_loader)# 如果打比賽的話,下面代碼也可參考 """ # 測試集提交 PATH = "roberta_model.pth" model.load_state_dict(torch.load(PATH)) def test_for_submit(model, device, test_loader): # 測試模型model.eval()preds = []for batch_idx, (x1,x2,x3) in tqdm(enumerate(test_loader)):x1,x2,x3 = x1.to(device), x2.to(device), x3.to(device)with torch.no_grad():y_ = model([x1,x2,x3])pred = y_.max(-1, keepdim=True)[1].squeeze().cpu().tolist() # .max() 2輸出,分別為最大值和最大值的indexpreds.extend(pred) return preds preds = test_for_submit(model, DEVICE, test_loader) """

得到結果


經過以上10步,即可建立起較為完整的pytorch-bert文本分類體系,代碼也較為簡單易懂,對讀者有幫助記得點個贊呀~

完結-

總結

以上是生活随笔為你收集整理的PyTorch Bert文本分类的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。