Fork me on GitHub

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

Coding Poineer

LLM训练bug

LLM 编码:

tokenizer = AutoTokenizer.from_pretrained(modelpath)
text="你好"
tokenizer.tokenize(text)  # 直接编码
chat_text = tokenizer.apply_chat_template(text,
                          tokenize=False  # 是否直接编码或返回字符串
                          add_generation_prompt=False)
tokenizer.tokenize(chat_text )  # 编码对话板
tokenizer.decode([**toke id list**])  # 直接解码

来源:嵌套tensor的重定义
'''
tokenizer.apply_chat_template(conversation=example['messages'], tokenize=True, add_generation_prompt=True, return_tensors='pt')
or tokenizer(input, return_tensors="pt").input_ids

返回的都是tensor格式

...
return { 'input_ids': torch.LongTensor([input_ids]), }
'''

1.ValueError: only one element tensors can be converted to Python scalars
'''
a = []
b = torch.tensor([1,2], dtype=torch.float32)
a.append(b)
a.append(b)
torch.tensor(a)
'''

2.TypeError: only integer tensors of a single element can be converted to an index
'''
a = []
b = torch.tensor([1,2], dtype=torch.int32)
a.append(b)
a.append(b)
a = torch.tensor(a)
print(a)
torch.tensor(a)
'''

posted @   365/24/60  阅读(47)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
点击右上角即可分享
微信分享提示