Torchtext使用教程文本数据处理

Torchtext

文本数据预处理工具

Field

定义数据处理的方式，将原始数据转为TENSOR

Field使用

from torchtext import data

TEXT = data.Field(sequential=True, tokenize=tokenize, lower=True, fix_length=200)
LABEL = data.Field(sequential=False, use_vocab=False)

Field参数

参数名	说明
sequential	Default: True 是否是序列数据，如果不是就不使用tokenization
use_vocab	Default: True 是否使用a Vocab object.如果不使用的话，原始数据应已是数字类型.
init_token	Default: None A token that will be prepended to every example using this field, or None for no initial token.
eos_token	A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
fix_length	Default: None. 设置序列数据的定长 eg. 100
dtype	The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
preprocessing	The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.
postprocessing	A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.
lower	Default: False. 字符串转为小写
tokenize	Default: string.split 对原始数据进行字符串操作，eg. 输入tokenize = lambda x: x.split()
tokenizer_language	The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
include_lengths	Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
batch_first	Default: False 是否返回batch维度在第一个维度的数据
pad_token	The string token used as padding. Default: “”.
unk_token	The string token used to represent OOV words. Default: “”.
pad_first	Do the padding of the sequence at the beginning. Default: False.
truncate_first	Do the truncating of the sequence at the beginning. Default: False
stop_words	Tokens to discard during the preprocessing step. Default: None
is_target	Whether this field is a target variable. Affects iteration over batches. Default: False

Dataset

使用Field来定义数据组成形式，得到数据集

Dataset使用

自定义Dataset类

from torchtext import data
import random
import numpy as np
class MyDataset(data.Dataset):
    def __init__(self, csv_path, text_field, label_field, test=False, aug=False, **kwargs):
        
        csv_data = pd.read_csv(csv_path)
        
        # 数据处理操作格式
        fields = [("id", None),("text", text_field), ("label", label_field)]
        
        examples = []
        if test:
            # 如果为测试集，则不加载标签
            for text in tqdm(csv_data['text']):
                examples.append(data.Example.fromlist([None, text, None], fields))
        else:
            for text, label in tqdm(zip(csv_data['text'], csv_data['label'])):
                # 数据增强
                if aug:
                    rate = random.random()
                    if rate > 0.5:
                        text = self.dropout(text)
                    else:
                        text = self.shuffle(text)
                examples.append(data.Example.fromlist([None, text, label], fields))
                
        # 上面是一些预处理操作，此处调用super调用父类构造方法，产生标准Dataset
        # super(MyDataset, self).__init__(examples, fields, **kwargs)
        super(MyDataset, self).__init__(examples, fields)

    def shuffle(self, text):
        # 序列随机排序
        text = np.random.permutation(text.strip().split())
        return ' '.join(text)

    def dropout(self, text, p=0.5):
        # 随机删除一些文本
        text = text.strip().split()
        len_ = len(text)
        indexs = np.random.choice(len_, int(len_ * p))
        for i in indexs:
            text[i] = ''
        return ' '.join(text)

Iterator

迭代器 Iterator / BucketIterator

Iterator

保持数据样本顺序不变来构建批数据

BucketIterator

自动选取样本长度相似的数据来构建批数据，最大程度地减少所需的填充量

from torchtext import data
def data_iter(train_path, valid_path, test_path, TEXT, LABEL):
    train = MyDataset(train_path, text_field=TEXT, label_field=LABEL, test=False, aug=1)
    valid = MyDataset(valid_path, text_field=TEXT, label_field=LABEL, test=False, aug=1)
    test = MyDataset(test_path, text_field=TEXT, label_field=None, test=True, aug=1)
    # 传入用于构建词表的数据集
    # TEXT = data.Field(sequential=True, tokenize=tokenize, lower=True, fix_length=200)
    TEXT.build_vocab(train)
    weight_matrix = TEXT.vocab.vectors
    # 只针对训练集构造迭代器
    # train_iter = data.BucketIterator(dataset=train, batch_size=8, shuffle=True, sort_within_batch=False, repeat=False)
    
    # 同时对训练集和验证集构造迭代器
    train_iter, val_iter = data.BucketIterator.splits(
            (train, valid),
            batch_sizes=(8, 8),
            # 如果使用gpu，此处将-1更换为GPU的编号
            device=-1,
            # 用来排序的指标
            sort_key=lambda x: len(x.text),
            sort_within_batch=False,
            repeat=False
    )
    test_iter = Iterator(test, batch_size=8, device=-1, sort=False, sort_within_batch=False, repeat=False)
    return train_iter, val_iter, test_iter, weight_matrix

Word Embedding

在使用pytorch或tensorflow等神经网络框架进行nlp任务的处理时，可以通过对应的Embedding层做词向量的处理。使用预训练好的词向量会带来更优的性能，下面介绍如何在torchtext中使用预训练的词向量，进而传送给神经网络模型进行训练。

torchtext 默认支持的预训练词向量

自动下载对应的预训练词向量文件到当前文件夹下的.vector_cache目录下，.vector_cache为默认的词向量文件和缓存文件的目录。

from torchtext.vocab import GloVe
from torchtext import data
TEXT = data.Field(sequential=True)
# 以下两种指定预训练词向量的方式等效
# TEXT.build_vocab(train, vectors="glove.6B.200d")
TEXT.build_vocab(train, vectors=GloVe(name='6B', dim=300))
# 在这种情况下，会默认下载glove.6B.zip文件，进而解压出glove.6B.50d.txt, glove.6B.100d.txt

外部预训练的词向量

通过name参数指定预训练文件，通过cache参数指定预训练文件目录

cache = '.vector_cache'
vectors = Vectors(name='myvector/glove/glove.6B.200d.txt', cache=cache)
TEXT.build_vocab(train, vectors=vectors)

在模型中指定Embedding层参数

import torch.nn as nn
# pytorch创建的Embedding层
embedding = nn.Embedding(input_dim, hidden_dim)
# 权重在词汇表vocab的vectors属性中
weight_matrix = TEXT.vocab.vectors
# 指定嵌入矩阵的初始权重
embedding.weight.data.copy_(weight_matrix)

posted @ 2020-07-10 09:15 林震宇阅读(6360) 评论(1) 编辑收藏举报

刷新页面返回顶部

林震宇

think more) do more) love more)

Torchtext使用教程文本数据处理

Torchtext

Field

Field使用

Field参数

Dataset

Dataset使用

自定义Dataset类

Iterator

Iterator

BucketIterator

Word Embedding

torchtext 默认支持的预训练词向量

外部预训练的词向量

在模型中指定Embedding层参数

公告

林震宇

think more) do more) love more)

Torchtext使用教程 文本数据处理

Torchtext

Field

Field使用

Field参数

Dataset

Dataset使用

自定义Dataset类

Iterator

Iterator

BucketIterator

Word Embedding

torchtext 默认支持的预训练词向量

外部预训练的词向量

在模型中指定Embedding层参数

公告

Torchtext使用教程文本数据处理