HF微调(二)
HF微调语言模型-问答任务¶
注意:微调后的模型仍然是通过提取上下文的子串来回答问题的,而不是生成新的文本。
# 根据你使用的模型和GPU资源情况,调整以下关键参数
squad_v2 = False
model_checkpoint = "/models/distilbert-base-uncased"
batch_size = 128
from datasets import load_dataset
/root/anaconda3/envs/jupylab/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
datasets = load_dataset("squad_v2" if squad_v2 else "squad")
Using the latest cached version of the dataset since squad couldn't be found on the Hugging Face Hub Found the latest cached dataset configuration 'plain_text' at /root/.cache/huggingface/datasets/squad/plain_text/0.0.0/7b6d24c440a36b6815f21b70d25016731768db1f (last modified on Fri Dec 27 11:18:29 2024).
The datasets
object itself is DatasetDict
, which contains one key for the training, validation and test set.
Yelp对比squad数据集¶
datasets#squad数据集中包含了上下文信息
DatasetDict({ train: Dataset({ features: ['id', 'title', 'context', 'question', 'answers'], num_rows: 87599 }) validation: Dataset({ features: ['id', 'title', 'context', 'question', 'answers'], num_rows: 10570 }) })
YelpReviewFull Dataset:
DatasetDict({
train:Dataset({
features:['label','text'],
num_rows:650000
})
test:Dataset({
features:['label','text'],
num_rows:50000
})
})
datasets["train"][0]
{'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}
从上下文中组织回复内容¶
根据上下文回答问题,'answer_start': [515] 答案位于context的第515个字符
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML
def show_random_elements(dataset, num_examples=10):
assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
picks = []
for _ in range(num_examples):
pick = random.randint(0, len(dataset)-1)
while pick in picks:
pick = random.randint(0, len(dataset)-1)
picks.append(pick)
df = pd.DataFrame(dataset[picks])
for column, typ in dataset.features.items():
if isinstance(typ, ClassLabel):
df[column] = df[column].transform(lambda i: typ.names[i])
elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
display(HTML(df.to_html()))
show_random_elements(datasets["train"])
id | title | context | question | answers | |
---|---|---|---|---|---|
0 | 570c78d9fed7b91900d459de | FC_Barcelona | Prior to the 2011–2012 season, Barcelona had a long history of avoiding corporate sponsorship on the playing shirts. On 14 July 2006, the club announced a five-year agreement with UNICEF, which includes having the UNICEF logo on their shirts. The agreement had the club donate €1.5 million per year to UNICEF (0.7 percent of its ordinary income, equal to the UN International Aid Target, cf. ODA) via the FC Barcelona Foundation. The FC Barcelona Foundation is an entity set up in 1994 on the suggestion of then-chairman of the Economical-Statutory Committee, Jaime Gil-Aluja. The idea was to set up a foundation that could attract financial sponsorships to support a non-profit sport company. In 2004, a company could become one of 25 "Honorary members" by contributing between £40,000–60,000 (£54,800–82,300) per year. There are also 48 associate memberships available for an annual fee of £14,000 (£19,200) and an unlimited number of "patronages" for the cost of £4,000 per year (£5,500). It is unclear whether the honorary members have any formal say in club policy, but according to the author Anthony King, it is "unlikely that Honorary Membership would not involve at least some informal influence over the club". | Who recommended setting up the FC Barcelona Foundation? | {'text': ['Jaime Gil-Aluja'], 'answer_start': [560]} |
1 | 5726b245dd62a815002e8d25 | Crimean_War | The war opened in the Balkans when Russian troops occupied provinces in modern Romania and began to cross the Danube. Led by Omar Pasha, the Ottomans fought a strong defensive battle and stopped the advance at Silistra. A separate action on the fort town of Kars in eastern Turkey led to a siege, and a Turkish attempt to reinforce the garrison was destroyed by a Russian fleet at Sinop. Fearing an Ottoman collapse, France and the UK rushed forces to Gallipoli. They then moved north to Varna in June, arriving just in time for the Russians to abandon Silistra. Aside from a minor skirmish at Constanța there was little for the allies to do. Karl Marx quipped that "there they are, the French doing nothing and the British helping them as fast as possible". | Russian troops took over which provinces first? | {'text': ['modern Romania'], 'answer_start': [72]} |
2 | 572ba8ea111d821400f38f3e | Education | Research into LCPS (low cost private schools) found that over 5 years to July 2013, debate around LCPSs to achieving Education for All (EFA) objectives was polarised and finding growing coverage in international policy. The polarisation was due to disputes around whether the schools are affordable for the poor, reach disadvantaged groups, provide quality education, support or undermine equality, and are financially sustainable. The report examined the main challenges encountered by development organisations which support LCPSs. Surveys suggest these types of schools are expanding across Africa and Asia. This success is attributed to excess demand. These surveys found concern for: | What does LCPS stand for? | {'text': ['low cost private schools'], 'answer_start': [20]} |
3 | 56e8d6da99e8941900975ec3 | Westminster_Abbey | The only extant depiction of Edward's abbey, together with the adjacent Palace of Westminster, is in the Bayeux Tapestry. Some of the lower parts of the monastic dormitory, an extension of the South Transept, survive in the Norman undercroft of the Great School, including a door said to come from the previous Saxon abbey. Increased endowments supported a community increased from a dozen monks in Dunstan's original foundation, up to a maximum about eighty monks, although there was also a large community of lay brothers who supported the monastery's extensive property and activities. | Where is the only existant depiction of Edward's abbey? | {'text': ['Bayeux Tapestry'], 'answer_start': [105]} |
4 | 5705f66d52bb89140068974c | The_Times | The following year, when Philip Graves, the Constantinople (modern Istanbul) correspondent of The Times, exposed The Protocols as a forgery, The Times retracted the editorial of the previous year. | How did The Times respond to the exposing of anti-Semitic documents as forgery? | {'text': ['retracted the editorial'], 'answer_start': [151]} |
5 | 56ea90465a205f1900d6d342 | Political_corruption | In politics, corruption undermines democracy and good governance by flouting or even subverting formal processes. Corruption in elections and in the legislature reduces accountability and distorts representation in policymaking; corruption in the judiciary compromises the rule of law; and corruption in public administration results in the inefficient provision of services. It violates a basic principle of republicanism regarding the centrality of civic virtue. | What does corruption undermine in politics? | {'text': ['democracy'], 'answer_start': [35]} |
6 | 56f89ee99b226e1400dd0cd9 | Guinea-Bissau | Guinea-Bissau (i/ˈɡɪni bɪˈsaʊ/, GI-nee-bi-SOW), officially the Republic of Guinea-Bissau (Portuguese: República da Guiné-Bissau, pronounced: [ʁeˈpublikɐ dɐ ɡiˈnɛ biˈsaw]), is a country in West Africa. It covers 36,125 square kilometres (13,948 sq mi) with an estimated population of 1,704,000. | How many kilometers does Guinea-Bissau cover? | {'text': ['36,125'], 'answer_start': [211]} |
7 | 570d0581b3d812140066d39f | Macintosh | In 1985, the combination of the Mac, Apple's LaserWriter printer, and Mac-specific software like Boston Software's MacPublisher and Aldus PageMaker enabled users to design, preview, and print page layouts complete with text and graphics—an activity to become known as desktop publishing. Initially, desktop publishing was unique to the Macintosh, but eventually became available for other platforms. Later, applications such as Macromedia FreeHand, QuarkXPress, and Adobe's Photoshop and Illustrator strengthened the Mac's position as a graphics computer and helped to expand the emerging desktop publishing market. | What three things were combined to develop desktop publishing? | {'text': ['Mac, Apple's LaserWriter printer, and Mac-specific software like Boston Software's MacPublisher'], 'answer_start': [32]} |
8 | 573024cc04bcaa1900d7721e | Roman_Republic | Caesar held both the dictatorship and the tribunate, and alternated between the consulship and the proconsulship. In 48 BC, Caesar was given permanent tribunician powers. This made his person sacrosanct, gave him the power to veto the senate, and allowed him to dominate the Plebeian Council. In 46 BC, Caesar was given censorial powers, which he used to fill the senate with his own partisans. Caesar then raised the membership of the Senate to 900. This robbed the senatorial aristocracy of its prestige, and made it increasingly subservient to him. While the assemblies continued to meet, he submitted all candidates to the assemblies for election, and all bills to the assemblies for enactment. Thus, the assemblies became powerless and were unable to oppose him. | What power could Caesar use against the senate should he choose? | {'text': ['power to veto'], 'answer_start': [217]} |
9 | 56df7e625ca0a614008f9b41 | Plymouth | The mid-19th century burial ground at Ford Park Cemetery was reopened in 2007 by a successful trust and the City council operate two large early 20th century cemeteries at Weston Mill and Efford both with crematoria and chapels. There is also a privately owned cemetery on the outskirts of the city, Drake Memorial Park which does not allow headstones to mark graves, but a brass plaque set into the ground. | When did Ford Park Cemetery reopen? | {'text': ['2007'], 'answer_start': [73]} |
预处理数据¶
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
以下断言确保我们的 Tokenizers 使用的是 FastTokenizer(Rust 实现,速度和功能性上有一定优势)。
import transformers
#transformers.PreTrainedTokenizerFast 更高效的分词方法
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)
tokenizer("What is your name?", "My name is Sylvain.")
{'input_ids': [101, 2054, 2003, 2115, 2171, 1029, 102, 2026, 2171, 2003, 25353, 22144, 2378, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Tokenizer 进阶操作¶
在问答预处理中的一个特定问题是如何处理非常长的文档。
在其他任务中,当文档的长度超过模型最大句子长度时,我们通常会截断它们,但在这里,删除上下文的一部分可能会导致我们丢失正在寻找的答案。
为了解决这个问题,我们允许数据集中的一个(长)示例生成多个输入特征,每个特征的长度都小于模型的最大长度(或我们设置的超参数)。
# The maximum length of a feature (question and context)
max_length = 384
# The authorized overlap between two part of the context when splitting it is needed.
# 补齐窗口信息
doc_stride = 128
超出最大长度的文本数据处理¶
从训练集中找出一个超过最大长度(384)的文本,distilbert-base-uncased 模型最大输入384:
for i, example in enumerate(datasets["train"]):
if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
break
# 挑选出来超过384(最大长度)的数据样例
example = datasets["train"][i]
len(tokenizer(example["question"], example["context"])["input_ids"])
396
截断上下文不保留超出部分¶
len(tokenizer(example["question"],
example["context"],
max_length=max_length,
truncation="only_second")["input_ids"])
384
参数解释¶
-
example["question"]
- 含义:输入的问题文本。
- 说明:
example
是一个字典,其中"question"
键对应的值是用户提出的问题。
-
example["context"]
- 含义:与问题相关的上下文文本。
- 说明:
example
字典中的"context"
键对应的值是包含问题答案的文本段落。
-
max_length
- 含义:Tokenizer 输出的最大长度(以 token 数量计)。
- 说明:如果输入文本的长度超过
max_length
,文本将被截断或分割。该参数通常由模型的输入长度限制决定。
-
truncation="only_second"
- 含义:指定如何截断文本。
- 说明:
"only_second"
表示只截断第二个输入(即example["context"]
),而第一个输入(example["question"]
)保持不变。这是为了确保问题部分完整,而上下文部分可以根据需要截断。
-
return_overflowing_tokens=True
- 含义:是否返回溢出的部分。
- 说明:如果输入文本的长度超过
max_length
,设置为True
时,Tokenizer 会将溢出的部分作为额外的样本返回。这对于处理长文本非常有用,因为可以将长文本分割成多个片段进行处理。
-
stride=doc_stride
- 含义:分割文本时的重叠长度。
- 说明:当
return_overflowing_tokens=True
时,stride
参数控制分割文本时的重叠长度。例如,如果doc_stride
设置为 50,那么每个溢出的片段会与前一个片段重叠 50 个 token。这有助于确保不会因为分割而丢失重要信息。
tokenized_example = tokenizer(
example["question"],
example["context"],
max_length=max_length,
truncation="only_second",
return_overflowing_tokens=True,
stride=doc_stride
)
使用此策略截断后,Tokenizer 将返回多个 input_ids
列表。
[len(x) for x in tokenized_example["input_ids"]]
[384, 157]
解码两个输入特征,可以看到重叠的部分:
for x in tokenized_example["input_ids"][0:2]:
print(tokenizer.decode(x))
[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 2009 – 2010 season. the team is coached by mike brey, who, as of the 2014 – 15 season, his fifteenth at notre dame, has achieved a 332 - 165 record. in 2009 they were invited to the nit, where they advanced to the semifinals but were beaten by penn state who went on and beat baylor in the championship. the 2010 – 11 team concluded its regular season ranked number seven in the country, with a record of 25 – 5, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were [SEP] [CLS] how many wins does the notre dame men's basketball team have? [SEP] championship. the 2010 – 11 team concluded its regular season ranked number seven in the country, with a record of 25 – 5, brey's fifth straight 20 - win season, and a second - place finish in the big east. during the 2014 - 15 season, the team went 32 - 6 and won the acc conference tournament, later advancing to the elite 8, where the fighting irish lost on a missed buzzer - beater against then undefeated kentucky. led by nba draft picks jerian grant and pat connaughton, the fighting irish beat the eventual national champion duke blue devils twice during the season. the 32 wins were the most by the fighting irish team since 1908 - 09. [SEP]
return_offsets_mapping 针对原文生成字典映射 input_ids¶
tokenized_example = tokenizer(
example["question"],
example[""],
max_length=max_length,
truncation="only_second",
return_overflowing_tokens=True,
return_offsets_mapping=True,
stride=doc_stride
)
print(tokenized_example["offset_mapping"][0][:100])
[(0, 0), (0, 3), (4, 8), (9, 13), (14, 18), (19, 22), (23, 28), (29, 33), (34, 37), (37, 38), (38, 39), (40, 50), (51, 55), (56, 60), (60, 61), (0, 0), (0, 3), (4, 7), (7, 8), (8, 9), (10, 20), (21, 25), (26, 29), (30, 34), (35, 36), (36, 37), (37, 40), (41, 45), (45, 46), (47, 50), (51, 53), (54, 58), (59, 61), (62, 69), (70, 73), (74, 78), (79, 86), (87, 91), (92, 96), (96, 97), (98, 101), (102, 106), (107, 115), (116, 118), (119, 121), (122, 126), (127, 138), (138, 139), (140, 146), (147, 153), (154, 160), (161, 165), (166, 171), (172, 175), (176, 182), (183, 186), (187, 191), (192, 198), (199, 205), (206, 208), (209, 210), (211, 217), (218, 222), (223, 225), (226, 229), (230, 240), (241, 245), (246, 248), (248, 249), (250, 258), (259, 262), (263, 267), (268, 271), (272, 277), (278, 281), (282, 285), (286, 290), (291, 301), (301, 302), (303, 307), (308, 312), (313, 318), (319, 321), (322, 325), (326, 330), (330, 331), (332, 340), (341, 351), (352, 354), (355, 363), (364, 373), (374, 379), (379, 380), (381, 384), (385, 389), (390, 393), (394, 406), (407, 408), (409, 415), (416, 418)]
可以根据上面的映射关系在原文中找到
"""
tokenized_example["input_ids"][0] == question, [1] == context
区分question和context:
可以通过token_type_ids来区分哪些token对应question,哪些对应context。
通常,token_type_ids为0的部分对应question,为1的部分对应context。
"""
second_token_id = tokenized_example["input_ids"][0][5]
offsets = tokenized_example["offset_mapping"][0][5]
second_token_id, offsets
(1996, (19, 22))
print(tokenizer.convert_ids_to_tokens([second_token_id])[0], example["question"][offsets[0]:offsets[1]])
the the
example["question"]
"How many wins does the Notre Dame men's basketball team have?"
借助tokenized_example
的sequence_ids
方法,我们可以方便的区分token的来源编号:
- 对于特殊标记:返回None,
- 对于正文Token:返回句子编号(从0开始编号)。
综上,现在我们可以很方便的在一个输入特征中找到答案的起始和结束 Token。
sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)
[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, None]
answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])
# 当前span在文本中的起始标记索引。
token_start_index = 0
while sequence_ids[token_start_index] != 1:
token_start_index += 1
# 当前span在文本中的结束标记索引。
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
token_end_index -= 1
# 检测答案是否超出span范围(如果超出范围,该特征将以CLS标记索引标记)。
offsets = tokenized_example["offset_mapping"][0]
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
# 将token_start_index和token_end_index移动到答案的两端。
# 注意:如果答案是最后一个单词,我们可以移到最后一个标记之后(边界情况)。
while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
token_start_index += 1
start_position = token_start_index - 1
while offsets[token_end_index][1] >= end_char:
token_end_index -= 1
end_position = token_end_index + 1
print(start_position, end_position)
else:
print("答案不在此特征中。")
23 26
打印检查是否准确找到了起始位置:
# 通过查找 offset mapping 位置,解码 context 中的答案
print(tokenizer.decode(tokenized_example["input_ids"][0][start_position: end_position+1]))
# 直接打印 数据集中的标准答案(answer["text"])
print(answers["text"][0])
over 1, 600 over 1,600
关于填充的策略¶
- 对于没有超过最大长度的文本,填充补齐长度。
- 对于需要左侧填充的模型,交换 question 和 context 顺序
pad_on_right = tokenizer.padding_side == "right"
整合步骤¶
针对不可回答的情况(上下文过长,答案在另一个特征中),我们为开始和结束位置都设置了cls索引。
如果allow_impossible_answers标志为False,我们还可以简单地从训练集中丢弃这些示例。
def prepare_train_features(examples):
# 一些问题的左侧可能有很多空白字符,这对我们没有用,而且会导致上下文的截断失败
# (标记化的问题将占用大量空间)。因此,我们删除左侧的空白字符。
examples["question"] = [q.lstrip() for q in examples["question"]]
# 使用截断和填充对我们的示例进行标记化,但保留溢出部分,使用步幅(stride)。
# 当上下文很长时,这会导致一个示例可能提供多个特征,其中每个特征的上下文都与前一个特征的上下文有一些重叠。
tokenized_examples = tokenizer(
examples["question" if pad_on_right else "context"],
examples["context" if pad_on_right else "question"],
truncation="only_second" if pad_on_right else "only_first",
max_length=max_length,
stride=doc_stride,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding="max_length",
)
# 由于一个示例可能给我们提供多个特征(如果它具有很长的上下文),我们需要一个从特征到其对应示例的映射。这个键就提供了这个映射关系。
sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
# 偏移映射将为我们提供从令牌到原始上下文中的字符位置的映射。这将帮助我们计算开始位置和结束位置。
offset_mapping = tokenized_examples.pop("offset_mapping")
# 让我们为这些示例进行标记!
tokenized_examples["start_positions"] = []
tokenized_examples["end_positions"] = []
for i, offsets in enumerate(offset_mapping):
# 我们将使用 CLS 特殊 token 的索引来标记不可能的答案。
input_ids = tokenized_examples["input_ids"][i]
cls_index = input_ids.index(tokenizer.cls_token_id)
# 获取与该示例对应的序列(以了解上下文和问题是什么)。
sequence_ids = tokenized_examples.sequence_ids(i)
# 一个示例可以提供多个跨度,这是包含此文本跨度的示例的索引。
sample_index = sample_mapping[i]
answers = examples["answers"][sample_index]
# 如果没有给出答案,则将cls_index设置为答案。
if len(answers["answer_start"]) == 0:
tokenized_examples["start_positions"].append(cls_index)
tokenized_examples["end_positions"].append(cls_index)
else:
# 答案在文本中的开始和结束字符索引。
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])
# 当前跨度在文本中的开始令牌索引。
token_start_index = 0
while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
token_start_index += 1
# 当前跨度在文本中的结束令牌索引。
token_end_index = len(input_ids) - 1
while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
token_end_index -= 1
# 检测答案是否超出跨度(在这种情况下,该特征的标签将使用CLS索引)。
if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
tokenized_examples["start_positions"].append(cls_index)
tokenized_examples["end_positions"].append(cls_index)
else:
# 否则,将token_start_index和token_end_index移到答案的两端。
# 注意:如果答案是最后一个单词(边缘情况),我们可以在最后一个偏移之后继续。
while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
token_start_index += 1
tokenized_examples["start_positions"].append(token_start_index - 1)
while offsets[token_end_index][1] >= end_char:
token_end_index -= 1
tokenized_examples["end_positions"].append(token_end_index + 1)
return tokenized_examples
datasets.map 的进阶使用¶
使用 datasets.map
方法将 prepare_train_features
应用于所有训练、验证和测试数据:
- batched: 批量处理数据。
- remove_columns: 因为预处理更改了样本的数量,所以在应用它时需要删除旧列。
- load_from_cache_file:是否使用datasets库的自动缓存
datasets 库针对大规模数据,实现了高效缓存机制,能够自动检测传递给 map 的函数是否已更改(因此需要不使用缓存数据)。如果在调用 map 时设置 load_from_cache_file=False
,可以强制重新应用预处理。
tokenized_datasets = datasets.map(prepare_train_features,
batched=True,
remove_columns=datasets["train"].column_names)
微调模型¶
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at /models/distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
训练超参数(TrainingArguments)¶
batch_size=128
model_dir = f"models/{model_checkpoint}-finetuned-squad"
args = TrainingArguments(
output_dir=model_dir,
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=2,
weight_decay=0.01,
)
/root/anaconda3/envs/jupylab/lib/python3.10/site-packages/transformers/training_args.py:1494: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn(
Data Collator(数据整理器)¶
数据整理器将训练数据整理为批次数据,用于模型训练时的批次处理。本教程使用默认的 default_data_collator
。
from transformers import default_data_collator
data_collator = default_data_collator
实例化训练器(Trainer)¶
trainer = Trainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
)
trainer.train()
Epoch | Training Loss | Validation Loss |
---|---|---|
1 | 2.330000 | 1.393700 |
2 | 1.403000 | 1.305678 |
TrainOutput(global_step=1384, training_loss=1.7100247509906747, metrics={'train_runtime': 1470.2387, 'train_samples_per_second': 120.421, 'train_steps_per_second': 0.941, 'total_flos': 1.7348902540849152e+16, 'train_loss': 1.7100247509906747, 'epoch': 2.0})
保存模型权重文件¶
model_to_save = trainer.save_model(model_dir)
模型评估¶
评估模型输出需要一些额外的处理:将模型的预测映射回上下文的部分。
模型直接输出的是预测答案的起始位置
和结束位置
的logits
import torch
for batch in trainer.get_eval_dataloader():
break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
output = trainer.model(**batch)
output.keys()
odict_keys(['loss', 'start_logits', 'end_logits'])
logits格式:
output.start_logits.shape, output.end_logits.shape
(torch.Size([128, 384]), torch.Size([128, 384]))
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)
(tensor([ 46, 57, 78, 43, 118, 162, 72, 42, 162, 34, 73, 41, 80, 91, 156, 35, 83, 91, 80, 58, 37, 31, 42, 53, 41, 35, 42, 77, 11, 44, 27, 133, 66, 40, 87, 44, 43, 83, 127, 26, 28, 33, 87, 127, 95, 25, 43, 132, 42, 29, 44, 46, 24, 44, 65, 58, 81, 14, 59, 72, 25, 36, 57, 43, 77, 64, 60, 75, 84, 59, 74, 40, 47, 44, 61, 56, 38, 47, 61, 78, 16, 34, 55, 65, 76, 15, 31, 62, 62, 64, 12, 36, 67, 97, 9, 31, 56, 66, 10, 55, 62, 68, 29, 60, 65, 63, 17, 29, 26, 18, 15, 18, 16, 16, 38, 42, 21, 25, 25, 25, 27, 47, 57, 35, 42, 24, 25, 29], device='cuda:0'), tensor([ 47, 58, 81, 44, 171, 109, 75, 37, 109, 36, 76, 53, 83, 94, 158, 35, 83, 94, 83, 60, 80, 31, 43, 54, 53, 35, 43, 80, 13, 45, 28, 133, 66, 41, 89, 45, 44, 85, 127, 27, 30, 34, 89, 127, 97, 26, 44, 132, 43, 30, 45, 47, 25, 45, 65, 59, 81, 14, 60, 72, 25, 36, 58, 43, 77, 65, 60, 75, 84, 60, 70, 40, 47, 44, 62, 56, 38, 47, 62, 74, 16, 36, 56, 68, 80, 15, 33, 65, 65, 67, 12, 38, 70, 97, 9, 33, 57, 69, 10, 56, 65, 78, 31, 63, 68, 66, 19, 29, 26, 20, 16, 20, 18, 18, 38, 42, 21, 52, 39, 52, 41, 50, 60, 46, 44, 51, 25, 43], device='cuda:0'))
组合logit¶
查每一个是否有效后,将按照其分数对它们进行排序,并保留最佳的答案。 第一个特征上执行此操作的示例:
n_best_size = 20
import numpy as np
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
# 获取最佳的起始和结束位置的索引:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
# 遍历起始位置和结束位置的索引组合
for start_index in start_indexes:
for end_index in end_indexes:
if start_index <= end_index: # 需要进一步测试以检查答案是否在上下文中
valid_answers.append(
{
"score": start_logits[start_index] + end_logits[end_index],
"text": "" # 我们需要找到一种方法来获取与上下文中答案对应的原始子字符串
}
)
def prepare_validation_features(examples):
# 一些问题的左侧有很多空白,这些空白并不有用且会导致上下文截断失败(分词后的问题会占用很多空间)。
# 因此我们移除这些左侧空白
examples["question"] = [q.lstrip() for q in examples["question"]]
# 使用截断和可能的填充对我们的示例进行分词,但使用步长保留溢出的令牌。这导致一个长上下文的示例可能产生
# 几个特征,每个特征的上下文都会稍微与前一个特征的上下文重叠。
tokenized_examples = tokenizer(
examples["question" if pad_on_right else "context"],
examples["context" if pad_on_right else "question"],
truncation="only_second" if pad_on_right else "only_first",
max_length=max_length,
stride=doc_stride,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding="max_length",
)
# 由于一个示例在上下文很长时可能会产生几个特征,我们需要一个从特征映射到其对应示例的映射。这个键就是为了这个目的。
sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
# 我们保留产生这个特征的示例ID,并且会存储偏移映射。
tokenized_examples["example_id"] = []
for i in range(len(tokenized_examples["input_ids"])):
# 获取与该示例对应的序列(以了解哪些是上下文,哪些是问题)。
sequence_ids = tokenized_examples.sequence_ids(i)
context_index = 1 if pad_on_right else 0
# 一个示例可以产生几个文本段,这里是包含该文本段的示例的索引。
sample_index = sample_mapping[i]
tokenized_examples["example_id"].append(examples["id"][sample_index])
# 将不属于上下文的偏移映射设置为None,以便容易确定一个令牌位置是否属于上下文。
tokenized_examples["offset_mapping"][i] = [
(o if sequence_ids[k] == context_index else None)
for k, o in enumerate(tokenized_examples["offset_mapping"][i])
]
return tokenized_examples
将prepare_validation_features
应用到整个验证集:
validation_features = datasets["validation"].map(
prepare_validation_features,
batched=True,
remove_columns=datasets["validation"].column_names
)
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10570/10570 [00:02<00:00, 3734.33 examples/s]
Now we can grab the predictions for all features by using the Trainer.predict
method:
raw_predictions = trainer.predict(validation_features)
Trainer
会隐藏模型不使用的列(在这里是example_id
和offset_mapping
,我们需要它们进行后处理),所以我们需要将它们重新设置回来:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))
现在,我们可以改进之前的测试:
由于在偏移映射中,当它对应于问题的一部分时,我们将其设置为None,因此可以轻松检查答案是否完全在上下文中。我们还可以从考虑中排除非常长的答案(可以调整的超参数)。
展开说下具体实现:
- 首先从模型输出中获取起始和结束的逻辑值(logits),这些值表明答案在文本中可能开始和结束的位置。
- 然后,它使用偏移映射(offset_mapping)来找到这些逻辑值在原始文本中的具体位置。
- 接下来,代码遍历可能的开始和结束索引组合,排除那些不在上下文范围内或长度不合适的答案。
- 对于有效的答案,它计算出一个分数(基于开始和结束逻辑值的和),并将答案及其分数存储起来。
- 最后,它根据分数对答案进行排序,并返回得分最高的几个答案。
max_answer_length = 30
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]
# 第一个特征来自第一个示例。对于更一般的情况,我们需要将example_id匹配到一个示例索引
context = datasets["validation"][0]["context"]
# 收集最佳开始/结束逻辑的索引:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
for end_index in end_indexes:
# 不考虑超出范围的答案,原因是索引超出范围或对应于输入ID的部分不在上下文中。
if (
start_index >= len(offset_mapping)
or end_index >= len(offset_mapping)
or offset_mapping[start_index] is None
or offset_mapping[end_index] is None
):
continue
# 不考虑长度小于0或大于max_answer_length的答案。
if end_index < start_index or end_index - start_index + 1 > max_answer_length:
continue
if start_index <= end_index: # 我们需要细化这个测试,以检查答案是否在上下文中
start_char = offset_mapping[start_index][0]
end_char = offset_mapping[end_index][1]
valid_answers.append(
{
"score": start_logits[start_index] + end_logits[end_index],
"text": context[start_char: end_char]
}
)
valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers
[{'score': 14.278795, 'text': 'Denver Broncos'}, {'score': 12.163998, 'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'}, {'score': 11.60188, 'text': 'American Football Conference (AFC) champion Denver Broncos'}, {'score': 9.877912, 'text': 'Broncos'}, {'score': 9.813787, 'text': 'The American Football Conference (AFC) champion Denver Broncos'}, {'score': 9.765268, 'text': 'Carolina Panthers'}, {'score': 9.487082, 'text': 'American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'}, {'score': 8.052693, 'text': 'Denver'}, {'score': 7.8227386, 'text': 'champion Denver Broncos'}, {'score': 7.763113, 'text': 'Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'}, {'score': 7.6989894, 'text': 'The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'}, {'score': 7.3739133, 'text': 'Denver Broncos defeated the National Football Conference'}, {'score': 6.8350153, 'text': 'National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos'}, {'score': 6.655664, 'text': 'National Football Conference (NFC) champion Carolina Panthers'}, {'score': 6.5654125, 'text': 'American Football Conference'}, {'score': 6.100212, 'text': 'Denver Broncos defeated the National Football Conference (NFC'}, {'score': 5.7320304, 'text': 'AFC) champion Denver Broncos'}, {'score': 5.70794, 'text': 'champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'}, {'score': 5.600646, 'text': 'the National Football Conference (NFC) champion Carolina Panthers'}, {'score': 5.5450945, 'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title.'}]
打印比较模型输出和标准答案(Ground-truth)是否一致:
datasets["validation"][0]["answers"]
{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}
模型最高概率的输出与标准答案一致
import collections
examples = datasets["validation"]
features = validation_features
example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
features_per_example[example_id_to_index[feature["example_id"]]].append(i)
from tqdm.auto import tqdm
def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
all_start_logits, all_end_logits = raw_predictions
# 构建一个从示例到其对应特征的映射。
example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
features_per_example[example_id_to_index[feature["example_id"]]].append(i)
# 我们需要填充的字典。
predictions = collections.OrderedDict()
# 日志记录。
print(f"正在后处理 {len(examples)} 个示例的预测,这些预测分散在 {len(features)} 个特征中。")
# 遍历所有示例!
for example_index, example in enumerate(tqdm(examples)):
# 这些是与当前示例关联的特征的索引。
feature_indices = features_per_example[example_index]
min_null_score = None # 仅在squad_v2为True时使用。
valid_answers = []
context = example["context"]
# 遍历与当前示例关联的所有特征。
for feature_index in feature_indices:
# 我们获取模型对这个特征的预测。
start_logits = all_start_logits[feature_index]
end_logits = all_end_logits[feature_index]
# 这将允许我们将logits中的某些位置映射到原始上下文中的文本跨度。
offset_mapping = features[feature_index]["offset_mapping"]
# 更新最小空预测。
cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
feature_null_score = start_logits[cls_index] + end_logits[cls_index]
if min_null_score is None or min_null_score < feature_null_score:
min_null_score = feature_null_score
# 浏览所有的最佳开始和结束logits,为 `n_best_size` 个最佳选择。
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
for start_index in start_indexes:
for end_index in end_indexes:
# 不考虑超出范围的答案,原因是索引超出范围或对应于输入ID的部分不在上下文中。
if (
start_index >= len(offset_mapping)
or end_index >= len(offset_mapping)
or offset_mapping[start_index] is None
or offset_mapping[end_index] is None
):
continue
# 不考虑长度小于0或大于max_answer_length的答案。
if end_index < start_index or end_index - start_index + 1 > max_answer_length:
continue
start_char = offset_mapping[start_index][0]
end_char = offset_mapping[end_index][1]
valid_answers.append(
{
"score": start_logits[start_index] + end_logits[end_index],
"text": context[start_char: end_char]
}
)
if len(valid_answers) > 0:
best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
else:
# 在极少数情况下我们没有一个非空预测,我们创建一个假预测以避免失败。
best_answer = {"text": "", "score": 0.0}
# 选择我们的最终答案:最佳答案或空答案(仅适用于squad_v2)
if not squad_v2:
predictions[example["id"]] = best_answer["text"]
else:
answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
predictions[example["id"]] = answer
return predictions
在原始结果上应用后处理问答结果:
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)
正在后处理 10570 个示例的预测,这些预测分散在 10784 个特征中。
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10570/10570 [00:17<00:00, 592.48it/s]
使用 datasets.load_metric
中加载 SQuAD v2
的评估指标
from datasets import load_metric
metric = load_metric("squad_v2" if squad_v2 else "squad",trust_remote_code=True)
/tmp/ipykernel_554838/2330875496.py:3: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate metric = load_metric("squad_v2" if squad_v2 else "squad") Downloading builder script: 4.50kB [00:00, 9.65MB/s] Downloading extra modules: 3.29kB [00:00, 11.2MB/s]
接下来,我们可以调用上面定义的函数进行评估。
if squad_v2:
formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in final_predictions.items()]
else:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
metric.compute(predictions=formatted_predictions, references=references)
{'exact_match': 70.09460737937559, 'f1': 79.66499307708769}
Homework:加载本地保存的模型,进行评估和再训练更高的 F1 Score¶
trained_model = AutoModelForQuestionAnswering.from_pretrained(model_dir)
trained_trainer = Trainer(
trained_model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
)
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek “源神”启动!「GitHub 热点速览」
· 我与微信审核的“相爱相杀”看个人小程序副业
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 如何使用 Uni-app 实现视频聊天(源码,支持安卓、iOS)
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)