Loading

LLM基础能力实现-书生浦语大模型实战营学习笔记2&大语言模型4

大语言模型-4.LLM基础能力实现

书生浦语大模型实战营学习笔记-2.LLM基础能力实现

本文包括第二期实战营的第2课内容。本来是想在笔记中给官方教程做做补充的,没想到官方教程的质量还是相当高的,跟着一步一步做没啥坑。所以这篇笔记主要学习一下官方Demo中的一些代码等细节内容。

本文标题中大语言模型系列博客是笔者在学习大语言模型时做的博客;书生浦语大模型实战营学习笔记是笔者在参加书生浦语大模型第二期实战营做的学习笔记。

大语言模型的对话能力实现:以InternLM2-Chat-1.8B的官方Demo为例

我们来看看官方Demo的代码。首先它导入了一些必要的库,并创建变量存储了模型位置:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name_or_path = "/root/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b"

然后使用了AutoTokenizerAutoModelForCausalLM。Hugging Face 的 AutoTokenizer 和 AutoModelForCausalLM 类熟悉大模型的不会陌生,用于自动加载预训练模型和相应的tokenizer。
加载模型时设置了相信远端代码以便从HuggingFace拉取确实模型权重,使用bf16量化节省内存,指定使用第一张显卡。同时使用model.eval()来取消梯度计算。

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, device_map='cuda:0')
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map='cuda:0')
model = model.eval()

下面就是核心业务了设置system_prompt、接收input、调用model.stream_chat()

system_prompt = """You are an AI assistant whose name is InternLM (书生·浦语).
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.
"""

messages = [(system_prompt, '')]

print("=============Welcome to InternLM chatbot, type 'exit' to exit.=============")

while True:
    input_text = input("\nUser  >>> ")
    input_text = input_text.replace(' ', '')  # 移除用户输入文本中的空格
    if input_text == "exit":  # 如果要退出,输入exit即可
        break

    length = 0
	# 对模型的 stream_chat 方法进行迭代,该方法会生成一个对话的生成器。迭代过程中,每次生成一个回复消息 response 和一个占位符 _。
    for response, _ in model.stream_chat(tokenizer, input_text, messages):
		# 如果回复消息不为空,则打印回复消息中从上次打印位置 length 开始到结尾的部分,并刷新输出缓冲区。
        if response is not None:
            print(response[length:], flush=True, end="")
			# 更新上次打印的位置,以便下一次打印时从正确位置开始。
            length = len(response)

所以,大模型对话能力的核心就是通过调用model.stream_chat()实现的。

模型运行结果为:

基础作业

多模态模型的视觉问答能力实现

实现视觉问答和实现对话并没有什么不同。只是调用的API从model.stream_chat()更改为model.chat()。下面是具体代码分析。

首先初始化模型和tokenizer,步骤和之前一样的。这里额外添加了根据args.dtype确定模型加载精度的设置。通过设置半精度可以降低模型推理的资源消耗,是十分必要的。

# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2-vl-7b', trust_remote_code=True).eval()
if args.dtype == 'fp16':
    model.half().cuda()
elif args.dtype == 'fp32':
    model.cuda()
if args.num_gpus > 1:
    from accelerate import dispatch_model
    device_map = auto_configure_device_map(args.num_gpus)
    model = dispatch_model(model, device_map=device_map)

tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2-vl-7b', trust_remote_code=True)

然后指定你输入的文字和图片,并调用model.chat方法:

text = '<ImageHere>Please describe this image in detail.'
image = 'examples/image1.webp'
with torch.cuda.amp.autocast():
    with torch.no_grad():
        # 实际上,最关键的也就这一行了:调用包装好的`model.chat`方法
        response, _ = model.chat(tokenizer, query=text, image=image, history=[], do_sample=False)
print(response)

然后就能得到运行结果了。

多模态模型的图文生成能力实现

要生成一篇文字夹杂图片的文章对于当今模型来说其实是个大工程,内部分了很多步,但大致流程是:

  1. 短文生长文
  2. 生成适合插入配图的标题
  3. 图片标题生图
  4. 根据生成的4张图像选择一张图
  5. 图文合并

其中,1-5步全都需要使用模型进行推理。所以,这个demo本身在工程上就是有工作量的。它演示了如何使用一个文生图文的大模型生成一篇图文夹杂文章的工作流。

Demo中部分代码如下。与生成文章无关的代码已删除以确保易于理解。

class ImageProcessor:
    """用于对图片进行预处理.包括resize和normalize."""
    def __init__(self, image_size=224):
        mean = (0.48145466, 0.4578275, 0.40821073)
        std = (0.26862954, 0.26130258, 0.27577711)
        self.normalize = transforms.Normalize(mean, std)

        self.transform = transforms.Compose([
            transforms.Resize((image_size, image_size),
                              interpolation=InterpolationMode.BICUBIC),
            transforms.ToTensor(),
            self.normalize,
        ])

    def __call__(self, item):
        if isinstance(item, str):
            item = Image.open(item).convert('RGB')
        return self.transform(item)

class Demo_UI:
    """用于生成文章的UI界面."""
    def __init__(self, code_path, num_gpus=1):
        self.code_path = code_path
        self.reset()

        tokenizer = AutoTokenizer.from_pretrained(code_path, trust_remote_code=True)
        self.model = AutoModelForCausalLM.from_pretrained(code_path, device_map='cuda', trust_remote_code=True).half().eval()
        self.model.tokenizer = tokenizer
        self.model.vit.resize_pos()

        self.vis_processor = ImageProcessor()

        stop_words_ids = [92397]
        #stop_words_ids = [92542]
        self.stopping_criteria = get_stopping_criteria(stop_words_ids)
        set_random_seed(1234)
        self.r2 = re.compile(r'<Seg[0-9]*>')
        self.withmeta = False
        self.database = Database()


    def text2instruction(self, text):
        """
        将文本转换为instruction.如果withmeta为True,则添加meta信息.
        Args:
            text: 文本内容.
        Returns:
            instruction.如f"[UNUSED_TOKEN_146]user\n{text}[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"
        """
        if self.withmeta:
            return f"[UNUSED_TOKEN_146]system\n{meta_instruction}[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]user\n{text}[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"
        else:
            return f"[UNUSED_TOKEN_146]user\n{text}[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"

    def generate(self, text, random, beam, max_length, repetition):
        """生成文章."""
        with torch.no_grad():
            with torch.cuda.amp.autocast():  # use mixed precision
                input_ids = self.model.tokenizer(text, return_tensors="pt")['input_ids']  # tokenize the input text
                len_input_tokens = len(input_ids[0])  # get the length of the input tokens

                generate = self.model.generate(input_ids.cuda(),
                                                do_sample=random,
                                                num_beams=beam,
                                                temperature=1.,
                                                repetition_penalty=float(repetition),
                                                stopping_criteria=self.stopping_criteria,
                                                max_new_tokens=max_length,
                                                top_p=0.8,
                                                top_k=40,
                                                length_penalty=1.0)
        response = generate[0].tolist()
        response = response[len_input_tokens:]
        response = self.model.tokenizer.decode(response, skip_special_tokens=True)  # decode the response
        response = response.replace('[UNUSED_TOKEN_145]', '')  # remove the special tokens
        response = response.replace('[UNUSED_TOKEN_146]', '')  # remove the special tokens
        return response

    def generate_with_emb(self, emb, random, beam, max_length, repetition, im_mask=None):
        with torch.no_grad():
            with torch.cuda.amp.autocast():
                generate = self.model.generate(inputs_embeds=emb,
                                                do_sample=random,
                                                num_beams=beam,
                                                temperature=1.,
                                                repetition_penalty=float(repetition),
                                                stopping_criteria=self.stopping_criteria,
                                                max_new_tokens=max_length,
                                                top_p=0.8,
                                                top_k=40,
                                                length_penalty=1.0,
                                                im_mask=im_mask)
        response = generate[0].tolist()
        response = self.model.tokenizer.decode(response, skip_special_tokens=True)
        response = response.replace('[UNUSED_TOKEN_145]', '')
        response = response.replace('[UNUSED_TOKEN_146]', '')
        return response

    def extract_imgfeat(self, img_paths):
        """提取图片特征."""
        if len(img_paths) == 0:
            return None
        images = []
        for j in range(len(img_paths)):
            image = self.vis_processor(img_paths[j])  # 调用ImageProcessor对图片进行预处理
            images.append(image)
        images = torch.stack(images, dim=0)
        with torch.no_grad():
            with torch.cuda.amp.autocast():
                img_embeds = self.model.encode_img(images)  # 提取图片特征。这是自带的方法。
        return img_embeds

    def generate_loc(self, text_sections, upimages, image_num):
        """生成插入图片的位置.
        Args:
            text_sections: 文本内容.
            upimages: 图片.
            image_num: 图片数量.
        Returns:
            适合插入图像的行和插入图片的位置.
        """
        full_txt = ''.join(text_sections)
        input_text = '<image> ' * len(upimages) + f'给定文章"{full_txt}" 根据上述文章,选择适合插入图像的{image_num}行'
        instruction = self.text2instruction(input_text) + '适合插入图像的行是'
        print(instruction)  # 打印instruction

        if len(upimages) > 0:
            img_embeds = self.extract_imgfeat(upimages)
            input_embeds, im_mask, _ = self.interleav_wrap(instruction, img_embeds)  # 调用interleav_wrap方法
            output_text = self.generate_with_emb(input_embeds, True, 1, 200, 1.005, im_mask=im_mask)
        else:
            # 如果没有图片,直接生成适合插入图像的行
            output_text = self.generate(instruction, True, 1, 200, 1.005)

        inject_text = '适合插入图像的行是' + output_text
        print(inject_text)

        locs = [int(m[4:-1]) for m in self.r2.findall(inject_text)]  # 提取插入图片的位置
        print(locs)
        return inject_text, locs

    def generate_cap(self, text_sections, pos, progress):
        """生成图片标题.通过使用self.generate方法通过prompt生成图片标题.
        Args:
            text_sections: 文本内容.
            pos: 图片位置.
        Returns:
            图片标题.
        """
        pasts = ''
        caps = {}
        for idx, po in progress.tqdm(enumerate(pos), desc="image captioning"):  # 遍历图片位置
            full_txt = ''.join(text_sections[:po + 2])
            if idx > 0:
                past = pasts[:-2] + '。'
            else:
                past = pasts

            #input_text = f' <|User|>: 给定文章"{full_txt}" {past}给出适合在<Seg{po}>后插入的图像对应的标题。' + ' \n<TOKENS_UNUSED_0> <|Bot|>: 标题是"'
            input_text = f'给定文章"{full_txt}" {past}给出适合在<Seg{po}>后插入的图像对应的标题。'
            instruction = self.text2instruction(input_text) + '标题是"'
            print(instruction)
            cap_text = self.generate(instruction, True, 1, 200, 1.005)  # 生成图像的标题
            cap_text = cap_text.split('"')[0].strip()
            print(cap_text)
            caps[po] = cap_text  # po是图片位置,cap_text是图片标题

            if idx == 0:
                pasts = f'现在<Seg{po}>后插入图像对应的标题是"{cap_text}", '
            else:
                pasts += f'<Seg{po}>后插入图像对应的标题是"{cap_text}", '

        print(caps)
        return caps

    def interleav_wrap(self, text, image, max_length=4096):
        """
        将文本和图像交织在一起.\n
        通过tokenizer将文本转换为tokens,然后获取tokens的embeddings.
        再将图像的embeddings和文本的embeddings拼接(torch.cat dim=1)在一起.
        """
        device = image.device
        im_len = image.shape[1]
        image_nums = len(image)
        parts = text.split('<image>')
        wrap_embeds, wrap_im_mask = [], []
        temp_len = 0
        need_bos = True

        for idx, part in enumerate(parts):
            if len(part) > 0:
                # tokenize the text
                part_tokens = self.model.tokenizer(part,
                                                    return_tensors='pt',
                                                    padding='longest',
                                                    add_special_tokens=need_bos).to(device)
                if need_bos:
                    need_bos = False
                # get the embeddings of the tokens
                part_embeds = self.model.model.tok_embeddings(part_tokens.input_ids)
                wrap_embeds.append(part_embeds)
                wrap_im_mask.append(torch.zeros(part_embeds.shape[:2]))
                temp_len += part_embeds.shape[1]
            if idx < image_nums:
                wrap_embeds.append(image[idx].unsqueeze(0))
                wrap_im_mask.append(torch.ones(1, image[idx].shape[0]))
                temp_len += im_len

            if temp_len > max_length:
                break

        wrap_embeds = torch.cat(wrap_embeds, dim=1)
        wrap_im_mask = torch.cat(wrap_im_mask, dim=1)
        wrap_embeds = wrap_embeds[:, :max_length].to(device)
        wrap_im_mask = wrap_im_mask[:, :max_length].to(device).bool()
        return wrap_embeds, wrap_im_mask, temp_len

    def model_select_image(self, output_text, locs, images_paths, progress):
        """让模型自己选择图片.通过使用self.model.generate方法生成图片标题."""
        print('model_select_image')
        pre_text = ''
        pre_img = []
        pre_text_list = []
        ans2idx = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
        selected = {k: 0 for k in locs}
        for i, text in enumerate(output_text):
            pre_text += text + '\n'
            if i in locs:
                images = copy.deepcopy(pre_img)
                for j in range(len(images_paths[i])):
                    image = self.vis_processor(images_paths[i][j])
                    images.append(image)
                images = torch.stack(images, dim=0)

                pre_text_list.append(pre_text)
                pre_text = ''

                images = images.cuda()
                text = '根据给定上下文和候选图像,选择合适的配图:' + '<image>'.join(pre_text_list) + '候选图像包括: ' + '\n'.join([chr(ord('A') + j) + '.<image>' for j in range(len(images_paths[i]))])
                input_text = self.text2instruction(text) + '最合适的图是'
                print(input_text)
                with torch.no_grad():
                    with torch.cuda.amp.autocast():
                        img_embeds = self.model.encode_img(images)
                        input_embeds, im_mask, len_input_tokens = self.interleav_wrap(input_text, img_embeds)

                with torch.no_grad():
                    outputs = self.model.generate(
                                            inputs_embeds=input_embeds,
                                            do_sample=True,
                                            temperature=1.,
                                            max_new_tokens=10,
                                            repetition_penalty=1.005,
                                            top_p=0.8,
                                            top_k=40,
                                            length_penalty=1.0,
                                            im_mask=im_mask
                                            )
                response = outputs[0][2:].tolist()   #<s>: C
                #print(response)
                out_text = self.model.tokenizer.decode(response, add_special_tokens=True)
                print(out_text)

                try:
                    # 卧槽这里好草率,直接取了第一个字符,并且只要第一个字符不在里面就直接选第一张图了
                    answer = out_text.lstrip()[0]  # get the first character
                    pre_img.append(images[len(pre_img) + ans2idx[answer]].cpu())
                except:
                    print('Select fail, use first image')
                    answer = 'A'
                    pre_img.append(images[len(pre_img) + ans2idx[answer]].cpu())
                selected[i] = ans2idx[answer]
        return selected

    def model_select_imagebase(self, output_text, locs, imagebase, progress):
        """让模型自己选择图片.通过使用self.model.generate方法生成图片标题.和另一个方法没啥区别"""
        print('model_select_imagebase')
        pre_text = ''
        pre_img = []
        pre_text_list = []
        selected = []

        images = []
        for j in range(len(imagebase)):
            image = self.vis_processor(imagebase[j])
            images.append(image)
        images = torch.stack(images, dim=0).cuda()
        with torch.no_grad():
            with torch.cuda.amp.autocast():
                img_embeds = self.model.encode_img(images)

        for i, text in enumerate(output_text):
            pre_text += text + '\n'
            if i in locs:
                pre_text_list.append(pre_text)
                pre_text = ''
                print(img_embeds.shape)
                cand_embeds = torch.stack([item for j, item in enumerate(img_embeds) if j not in selected], dim=0)
                ans2idx = {}
                count = 0
                for j in range(len(img_embeds)):
                    if j not in selected:
                        ans2idx[chr(ord('A') + count)] = j
                        count += 1

                if cand_embeds.shape[0] > 1:
                    text = '根据给定上下文和候选图像,选择合适的配图:' + '<image>'.join(pre_text_list) + '候选图像包括: ' + '\n'.join([chr(ord('A') + j) + '.<image>' for j in range(len(cand_embeds))])
                    input_text = self.text2instruction(text) + '最合适的图是'
                    print(input_text)

                    all_img = cand_embeds if len(pre_img) == 0 else torch.cat(pre_img + [cand_embeds], dim=0)
                    input_embeds, im_mask, len_input_tokens = self.interleav_wrap(input_text, all_img)

                    with torch.no_grad():
                        outputs = self.model.generate(
                                                inputs_embeds=input_embeds,
                                                do_sample=True,
                                                temperature=1.,
                                                max_new_tokens=10,
                                                repetition_penalty=1.005,
                                                top_p=0.8,
                                                top_k=40,
                                                length_penalty=1.0,
                                                im_mask=im_mask
                                                )
                    response = outputs[0][2:].tolist()   #<s>: C
                    #print(response)
                    out_text = self.model.tokenizer.decode(response, add_special_tokens=True)
                    print(out_text)

                    try:
                        answer = out_text.lstrip()[0]
                    except:
                        print('Select fail, use first image')
                        answer = 'A'
                else:
                    answer = 'A'

                pre_img.append(img_embeds[ans2idx[answer]].unsqueeze(0))
                selected.append(ans2idx[answer])
        selected = {loc: j for loc, j in zip(locs, selected)}
        print(selected)
        return selected

    def show_article(self, show_cap=False):
        """展示文章.主要是操作UI组件."""
        md_shows = []
        imgs_show = []
        edit_bts = []
        for i in range(len(self.texts_imgs)):
            text, img = self.texts_imgs[i]
            md_shows.append(gr.Markdown(visible=True, value=text))
            edit_bts.append(gr.Button(visible=True, interactive=True, ))
            imgs_show.append(gr.Image(visible=False) if img is None else gr.Image(visible=True, value=img.paths[img.pts]))

        print(f'show {len(md_shows)} text sections')
        for _ in range(max_section - len(self.texts_imgs)):
            md_shows.append(gr.Markdown(visible=False, value=''))
            edit_bts.append(gr.Button(visible=False))
            imgs_show.append(gr.Image(visible=False))

        return md_shows + edit_bts + imgs_show

    def generate_article(self, instruction, upimages, beam, repetition, max_length, random, seed):
        """生成文章."""
        self.reset()
        set_random_seed(int(seed))
        self.hash_folder = hashlib.sha256(instruction.encode()).hexdigest()
        self.instruction = instruction
        if upimages is None:
            upimages = []
        else:
            upimages = [t.image.path for t in upimages.root]
        img_instruction = '<image> ' * len(upimages)
        instruction = img_instruction.strip() + instruction  # add the image instruction
        text = self.text2instruction(instruction)  # convert the text to instruction
        print('random generate:{}'.format(random))
        if article_stream_output:
            if len(upimages) == 0:
                input_ids = self.model.tokenizer(text, return_tensors="pt")['input_ids']
                input_embeds = self.model.model.tok_embeddings(input_ids.cuda())  # get the embeddings of the tokens
                im_mask = None
            else:
                images = []
                for j in range(len(upimages)):
                    image = self.vis_processor(upimages[j])  # 调用ImageProcessor对图片进行预处理
                    images.append(image)
                images = torch.stack(images, dim=0)
                with torch.no_grad():
                    with torch.cuda.amp.autocast():
                        img_embeds = self.model.encode_img(images)  # 提取图片特征。这是自带的方法。

                text = self.text2instruction(instruction)  # convert the text to instruction

                input_embeds, im_mask, len_input_tokens = self.interleav_wrap(text, img_embeds)  # 调用interleav_wrap方法交织文本与图像

            print(text)
            generate_params = dict(
                inputs_embeds=input_embeds,
                do_sample=random,
                stopping_criteria=self.stopping_criteria,
                repetition_penalty=float(repetition),
                max_new_tokens=max_length,
                top_p=0.8,
                top_k=40,
                length_penalty=1.0,
                im_mask=im_mask,
            )
            output_text = "▌"
            with self.generate_with_streaming(**generate_params) as generator:
                # 后面都在操作UI组件,不再看了,就Review到这里
                for output in generator:
                    decoded_output = self.model.tokenizer.decode(output[1:])
                    if output[-1] in [self.model.tokenizer.eos_token_id, 92542]:
                        break
                    output_text = decoded_output.replace('\n', '\n\n') + "▌"
                    yield (output_text,) + (gr.Markdown(visible=False),) * (max_section - 1) + (
                            gr.Button(visible=False),) * max_section + (gr.Image(visible=False),) * max_section
                    time.sleep(0.01)
            output_text = output_text[:-1]
            yield (output_text,) + (gr.Markdown(visible=False),) * (max_section - 1) + (
                            gr.Button(visible=False),) * max_section + (gr.Image(visible=False),) * max_section
        else:
            output_text = self.generate(text, random, beam, max_length, repetition)

        output_text = re.sub(r'(\n\s*)+', '\n', output_text.strip())
        print(output_text)

        output_text = output_text.split('\n')[:max_section]

        self.texts_imgs = [[t, None] for t in output_text]
        self.database.addtitle(text, self.hash_folder, params={'beam':beam, 'repetition':repetition, 'max_length':max_length, 'random':random, 'seed':seed})

        if article_stream_output:
            yield self.show_article()
        else:
            return self.show_article()

实际操作时,通过UI调用生成文章的函数,就能得到想要的文章了。

实战

实战部分可以参考博客书生浦语大模型实战营第二期第二节作业的对话Demo:InternLM2-Chat-1.8B 智能对话(使用 InternLM2-Chat-1.8B 模型生成 300 字的小故事)这一小节当中的内容,写得非常详细。这一小节中涉及了搭建环境、下载模型、模型推理3步,而这3步就是当今体验大模型Demo的三步了。虽然教程是基于InternLM2的,但是实际上不管的同义千问还是ChatGLM,步骤都是完全一样的,不一样的只是一些细微的细节。

posted @ 2024-03-30 23:03  vanilla阿草  阅读(203)  评论(0编辑  收藏  举报