LLM训练bug
LLM 编码:
tokenizer = AutoTokenizer.from_pretrained(modelpath)
text="你好"
tokenizer.tokenize(text) # 直接编码
chat_text = tokenizer.apply_chat_template(text,
tokenize=False # 是否直接编码或返回字符串
add_generation_prompt=False)
tokenizer.tokenize(chat_text ) # 编码对话板
tokenizer.decode([**toke id list**]) # 直接解码
来源:嵌套tensor的重定义
'''
tokenizer.apply_chat_template(conversation=example['messages'], tokenize=True, add_generation_prompt=True, return_tensors='pt')
or tokenizer(input, return_tensors="pt").input_ids
返回的都是tensor格式
...
return { 'input_ids': torch.LongTensor([input_ids]), }
'''
1.ValueError: only one element tensors can be converted to Python scalars
'''
a = []
b = torch.tensor([1,2], dtype=torch.float32)
a.append(b)
a.append(b)
torch.tensor(a)
'''
2.TypeError: only integer tensors of a single element can be converted to an index
'''
a = []
b = torch.tensor([1,2], dtype=torch.int32)
a.append(b)
a.append(b)
a = torch.tensor(a)
print(a)
torch.tensor(a)
'''