NLP——文本数据增强方法总结

转载自： https://blog.csdn.net/Flying_sfeng/article/details/121691380

转载自：https://blog.csdn.net/u012744245/article/details/123378152

1. Easy Data Augmentation(EDA)

EDA是一种简单但非常有效的方法，具体包括同义词替换，随机插入，随机交换，随机删除等。

无监督方法——EDA来自论文《EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks》。一个用于提高文本分类任务性能的简单数据增强技术。 EDA 由四个简单但功能强大的操作组成：同义词替换、随机插入、随机交换和随机删除。在实验的五个文本分类任务中，EDA 提高了卷积和递归神经网络的性能。 EDA 对于较小的数据集表现出特别强的结果；平均而言，在五个数据集上，仅使用 50% 的可用训练集进行 EDA 训练达到了与使用所有可用数据进行正常训练相同的准确度。

EDA 的4个数据增强操作：

同义词替换(Synonym Replacement, SR)：从句子中随机选取n个不属于停用词集的单词，并随机选择其同义词替换它们；
随机插入(Random Insertion, RI)：随机的找出句中某个不属于停用词集的词，并求出其随机的同义词，将该同义词插入句子的一个随机位置。重复n次；
随机交换(Random Swap, RS)：随机的选择句中两个单词并交换它们的位置。重复n次；
随机删除(Random Deletion, RD)：以 p的概率，随机的移除句中的每个单词；

使用EDA需要注意：控制样本数量，少量学习，不能扩充太多，因为EDA操作太过频繁可能会改变语义，从而降低模型性能。

关于EDA，我想起之前面试NLP算法工程师时，被要求写出这个4个函数。

同义词替换(Synonym Replacement, SR)：

########################################################################
# Synonym replacement
# Replace n words in the sentence with synonyms from wordnet
########################################################################

#for the first time you use wordnet
#import nltk
#nltk.download('wordnet')
from nltk.corpus import wordnet 

def synonym_replacement(words, n):
    new_words = words.copy()
    random_word_list = list(set([word for word in words if word not in stop_words]))
    random.shuffle(random_word_list)
    num_replaced = 0
    for random_word in random_word_list:
        synonyms = get_synonyms(random_word)
        if len(synonyms) >= 1:
            synonym = random.choice(list(synonyms))
            new_words = [synonym if word == random_word else word for word in new_words]
            #print("replaced", random_word, "with", synonym)
            num_replaced += 1
        if num_replaced >= n: #only replace up to n words
            break

    #this is stupid but we need it, trust me
    sentence = ' '.join(new_words)
    new_words = sentence.split(' ')

    return new_words

随机删除(Random Deletion, RD)：

########################################################################
# Random deletion
# Randomly delete words from the sentence with probability p
########################################################################

def random_deletion(words, p):

    #obviously, if there's only one word, don't delete it
    if len(words) == 1:
        return words

    #randomly delete words with probability p
    new_words = []
    for word in words:
        r = random.uniform(0, 1)
        if r > p:
            new_words.append(word)

    #if you end up deleting all words, just return a random word
    if len(new_words) == 0:
        rand_int = random.randint(0, len(words)-1)
        return [words[rand_int]]

    return new_words

随机交换(Random Swap, RS)：

########################################################################
# Random swap
# Randomly swap two words in the sentence n times
########################################################################

def random_swap(words, n):
    new_words = words.copy()
    for _ in range(n):
        new_words = swap_word(new_words)
    return new_words

def swap_word(new_words):
    random_idx_1 = random.randint(0, len(new_words)-1)
    random_idx_2 = random_idx_1
    counter = 0
    while random_idx_2 == random_idx_1:
        random_idx_2 = random.randint(0, len(new_words)-1)
        counter += 1
        if counter > 3:
            return new_words
    new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1] 
    return new_words

随机插入(Random Insertion, RI)：

########################################################################
# Random insertion
# Randomly insert n words into the sentence
########################################################################

def random_insertion(words, n):
    new_words = words.copy()
    for _ in range(n):
        add_word(new_words)
    return new_words

def add_word(new_words):
    synonyms = []
    counter = 0
    while len(synonyms) < 1:
        random_word = new_words[random.randint(0, len(new_words)-1)]
        synonyms = get_synonyms(random_word)
        counter += 1
        if counter >= 10:
            return
    random_synonym = synonyms[0]
    random_idx = random.randint(0, len(new_words)-1)
    new_words.insert(random_idx, random_synonym)

全部代码:

EDA代码

2. An Easier Data Augmentation(AEDA)

AEDA方法很简单，就是在句子中间添加标点符号以此来增强数据。整篇文章正文只有一段，所谓大道至简。中了EMNLP2021 Findings。

代码同样很简单: AEDA code

3 UDA (Unsupervised Data Augmentation)

一个半监督的学习方法——UDA，减少对标注数据的需求，增加对未标注数据的利用。UDA的介绍来自论文《Unsupervised Data Augmentation for Consistency Training》。使用半监督方法时，常见的做法是，对大量未标记数据使用一致性训练来约束模型预测对输入噪声是不变的。在这篇论文中，提出了一个关于如何有效地对未标记数据进行噪声处理的新观点，并认为噪声质量，特别是由高级数据增强方法产生的噪声质量，在半监督学习中起着至关重要的作用。

通过用先进的数据增强方法（如 RandAugment 和反向翻译）代替简单的噪声操作，UDA在相同的一致性训练框架下对六种语言和三种视觉任务进行了实质性改进。在 IMDb 文本分类数据集上，只有 20 个标记数据，但是UDA方法实现了 4.20 的错误率，优于在 25,000 个标记数据上训练的SOTA模型。在标准的半监督学习基准 CIFAR-10 上，UDA方法优于所有以前的方法，并且仅用 250 个标记数据实现了 5.43 的错误率。UDA方法还与迁移学习很好地结合在一起，例如，当从 BERT 进行微调时，并在高数据机制（如 ImageNet）中产生改进，无论是只有 10% 的标记数据还是带有 130 万个额外未标记数据的完整标记集被使用.

UDA使用的语言增强技术——Back-translation：回译能够在保存语义不变的情况下，生成多样的句式。

UDA关键解决的是如何根据少量的标注数据来增加未标注数据的使用？

valid noise: 可以保证原始未标注数据和扩展的未标注数据的预测具有一致性。
diverse noise: 在不更改标签的情况下对输入进行大量修改，增加样本多样性，而不是仅用高斯噪声进行局部更改。
targeted inductive biases: 不同的任务需要不同的归纳偏差。

UDA论文中对图像分类、文本分类任务做了实验，分别用到不同的数据增强策略：

Image Classification: RandAugment数据增强方法，该方法受到 AutoAugment (Cubuk et al., 2018) 的启发。 AutoAugment 使用一种搜索方法将 Python 图像库 (PIL) 中的所有图像处理转换结合起来，以找到一个好的增强策略。在 RandAugment 中，我们不使用搜索，而是从 PIL 中的同一组增强变换中统一采样。换句话说，RandAugment 更简单，不需要标记数据，因为不需要搜索最优策略。
Text Classification: Back-translation回译，保持语义，利用机器翻译系统进行多语言互译，增加句子多样性。
Text Classification: Word replacing with TF-IDF ，回译可以保证全局语义不变，但无法控制某个词的保留。对于主题分类任务，某些关键词在确定主题时具有更重要的信息。所以采用新的增强方法：用较低的TF-IDF分数替换无信息的单词，同时保留较高的TF-IDF值的单词。