Datawhale AI 夏令营-天池Better Synth多模态大模型数据合成挑战赛-task3持续上分(更新中)
在大数据、大模型时代,随着大模型发展,互联网数据渐尽且需大量处理标注,为新模型训练高效合成优质数据成为新兴问题。“天池 Better Synth - 多模态大模型数据合成挑战赛”应运而生,旨在探究合成数据对多模态大模型训练的影响及高效合成方法策略,推动多模态大模型数据合成创新。比赛关注图片理解任务,要求在给定种子数据集和计算量约束下,通过高效方法生成更优数据以训练模型。竞赛使用 Data-Juicer 系统助力参赛者,NVIDIA 的相关开源库让选手能探索高效合成大量优质数据。“Better Synth”是系列赛第四场,为专业人员提供舞台,引领多模态大模型开源共享发展。
天池Better Synth多模态大模型数据合成挑战赛
更新中,助教和同学们可以持续关注...
task2 回顾
总结了一下当前 baseline+探索的相关细节:
-
数据:使用的是 10k数据集
../input/pretrain_stage_1_10k/mgm_pretrain_stage_1_10k.jsonl ../input/pretrain_stage_1_10k/stage_1.json
-
设备&环境:使用的是阿里云的A10单卡,环境直接拉取镜像
# 拉取镜像 -- 从docker hub docker pull datajuicer/dj-competition:better-synth-v0.2 # 在套件目录下运行 docker run --privileged --shm-size 256g --network host --gpus all -v $(pwd):$(pwd) -w $(pwd) -it datajuicer/dj-competition:better-synth-v0.2 # 如果你的机器无法连接到docker hub,则可以从阿里云公开镜像库中拉取 # 拉取镜像 -- 从阿里云镜像库 docker pull registry.cn-shanghai.aliyuncs.com/pai-ai-test/pai-eas:data-juicer-better-synth-v0.2 # 在套件目录下运行 docker run --privileged --shm-size 256g --network host --gpus all -v $(pwd):$(pwd) -w $(pwd) -it registry.cn-shanghai.aliyuncs.com/pai-ai-test/pai-eas:data-juicer-better-synth-v0.2
-
数据合成与处理策略:使用 dj-process,通过 blip2 模型进行image caption的生成;
dataset_path: input/pretrain_stage_1_10k/mgm_pretrain_stage_1_10k.jsonl export_path: output/image_captioning_output/res_10k.jsonl np: 1 process: - image_captioning_mapper: hf_img2seq: '/mnt/workspaces/better_synth_challenge_baseline/models/goldsj/blip2-opt-2___7b' # You can replace this path to a local downloaded HF model keep_original_sample: false # we only need the recaptioned captions
针对性优化
下面让我们来进行针对性优化:
数据增量
- 数据全量下载
此处仅下载了10k数据集,并不是全量的数据,全量数据还需要运行下面的脚本(请在solutionxxx/目录下,在终端中自行复制粘贴运行下载)
echo "Downloading full seed datasets..."
cd /root/autodl-tmp/better_synth_baseline_autoDL/input
axel -n 5 http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/dj-competition/better_synth/data/stage_1/pretrain_stage_1.tar.gz
tar zxvf pretrain_stage_1.tar.gz && rm -rf pretrain_stage_1.tar.gz
cd pretrain_stage_1
axel -n 5 http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/dj-competition/better_synth/data/stage_1/mgm_pretrain_stage_1.jsonl
axel -n 5 http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/dj-competition/better_synth/data/stage_1/stage_1.json
echo "Done downloading full seed datasets..."
数据下载完,预计数量在 40k的规模;
设备升级与环境搭建
设备升级
这里将运行环境的配置升级到8卡V100(32G)
环境搭建
cd better_synth_challenge_baseline
bash install.sh
安装完上面的环境之后,开始试跑,出现下面的问题:
通过检索相关错误,注意是因为下面这个原因
flash attention是一个用于加速模型训练推理的可选项,且仅适用于Turing、Ampere、Ada、Hopper架构的Nvidia GPU显卡(如H100、A100、RTX 3090、T4、RTX 2080),您可以在不安装flash attention的情况下正常使用模型进行推理。
对于这个错误,可以考虑处理的方式,目前想到的有两种:
- 1.尝试将 transformers降级为transformers==4.31.0,也即为未合入 flash-attention 库之前的 transformers
- 2.直接对transformers库通过源码编辑的方式进行修改和安装,这样做的好处是,我们可以直接抛开flash-attention的安装,剥离对其的依赖;可能带来的影响是推理的速度变慢,以及显存消耗会变大;(考虑到引入了8GPU的v100,可以开启更大的线程数,理论上对整个过程影响并不大)
经过尝试,最后选择的是第2种方式,因为第一种方式最后在安装MGM的时候,还是得将transformers升级;
于是这里直接从 GitHub 上下载了 transformer-v4.38.0的源码,并且对源码进行修改,
对应修改的代码位置是:transformers-4.38.0/src/transformers/modeling_utils.py
需要定位到这个函数:_autoset_attn_implementation
这里的主要改动为,将原始代码:
if use_flash_attention_2:
logger.warning_once(
'The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.'
)
config._attn_implementation = "flash_attention_2"
if config._attn_implementation == "flash_attention_2":
cls._check_and_enable_flash_attn_2(
config,
torch_dtype=torch_dtype,
device_map=device_map,
hard_check_only=False,
check_device_map=check_device_map,
)
elif requested_attn_implementation in [None, "sdpa"]:
# use_flash_attention_2 takes priority over SDPA, hence SDPA treated in this elif.
config = cls._check_and_enable_sdpa(
config,
hard_check_only=False if requested_attn_implementation is None else True,
)
else:
config._attn_implementation = "eager"
修改为:
if use_flash_attention_2:
logger.warning_once(
'The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.'
)
config._attn_implementation = "flash_attention_2"
# if config._attn_implementation == "flash_attention_2":
# cls._check_and_enable_flash_attn_2(
# config,
# torch_dtype=torch_dtype,
# device_map=device_map,
# hard_check_only=False,
# check_device_map=check_device_map,
# )
if requested_attn_implementation in [None, "sdpa"]:
# elif requested_attn_implementation in [None, "sdpa"]:
# use_flash_attention_2 takes priority over SDPA, hence SDPA treated in this elif.
config = cls._check_and_enable_sdpa(
config,
hard_check_only=False if requested_attn_implementation is None else True,
)
else:
config._attn_implementation = "eager"
然后使用下面这个方式安装即可
cd transformers-4.38.0
pip install -e .
因为对源码本身做了修改,transformer 的安装会被覆盖,而且不再需要使用flash-attention 的安装了,因此推荐使用下面的安装方式和顺序
-
1、先以源码方式安装 data-juicer
cd ../better_synth_challenge_baseline/data-juicer pip install -v -e .
这里需要等待执行一段时间
-
2.以源码方式安装 transformer-4.38.0
cd ../better_synth_challenge_baseline/cd transformers-4.38.0 pip install -e .
-
3.安装MGM
cd MGM pip install -e .
-
其他依赖
pip install simhash-pybind pip install fire pip install jsonlines
-
额外(可选)
为了更加便于调试参数和监控设备 gpu 使用情况,可以安装下面这个库
pip install gpustat
并且使用下面的命令,来实时查看 gpu 的使用情况(这个可以帮助我们判断和调整 batchsize 的大小)
watch --color -n 1 gpustat -cpu
效果如下
至此,应该在新设备上的所有依赖环境就安装完成了;
-
其他踩坑
可能还会存在的问题是:当前镜像如果为 python=3.8 的内核,需要对pdfplumber
进行降级
pip install pdfplumber==0.11.1
还有就是如果出现 openssl 的问题,直接对urllib3
进行降级即可
pip install urllib3==1.26.7
数据处理合成
好的,假设上面的所有环境均已成功完成,让我们回归到最核心的数据处理部分:
Method1:blip2 优化
dataset_path: input/pretrain_stage_1_10k/mgm_pretrain_stage_1_10k.jsonl
export_path: output/image_captioning_output/res_10k.jsonl
np: 8
process:
- image_captioning_mapper:
hf_img2seq: '/mnt/workspaces/better_synth_challenge_baseline/models/goldsj/blip2-opt-2___7b' # You can replace this path to a local downloaded HF model
caption_num: 4 # 为每个图像生成多少个候选字幕
keep_candidate_mode: 'similar_one_simhash' # 为生成的 $caption_num$ 个候选保留策略。应该在 ["random_any", "similar_one_simhash", "all"] 中。
keep_original_sample: true # 是否保留原始样本。如果设置为 False,最终数据集中将只有生成的字幕,原始字幕将被删除。默认为 True。
prompt: null # 一个字符串提示,用于指导全局所有样本生成blip2模型。默认为None,即不提供任何提示。
prompt_key: null #samples 中用于存储每个样本提示的字段的键名,用于为不同的样本设置不同的提示。如果没有,则使用参数“prompt”中的 prompt。默认为 None。
# mem_required: '32GB' # 此操作 (Op) 利用深度神经网络模型,该模型在计算时会消耗大量内存,因此系统的可用内存可能会限制可启动的最大进程数
这里使用了 8 卡,因此对 np 修改为 8,保持原有的 blip2 模型不变,主要修改了caption_num数量(caption_num默认为1)、keep_candidate_mode
为similar_one_simhash
,prompt暂无修改,或许可以考虑将prompt调整为"a photo of xxx",还未尝试;
运行结果如下:
可以看到,当num_proc设置为8后,同时处理速度也加快了。
Method2:caption模型替换-OFA
OFA(One-For-All)是通用多模态预训练模型,使用简单的序列到序列的学习框架统一模态(跨模态、视觉、语言等模态)和任务(如图片生成、视觉定位、图片描述、图片分类、文本生成等),详见发表于ICML 2022的论文:OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework,以及官方Github仓库https://github.com/OFA-Sys/OFA。
OFA(One For All)模型的架构和工作原理,特别是其在预训练任务中的应用。OFA模型通过统一的架构处理多种任务,包括视觉定位、图像描述、图像-文本匹配、视觉问答、目标检测、图像填充和文本填充。以下是对该模型架构和工作原理的简要分析:
- 架构分析
- 统一的模型架构:
- OFA模型采用统一的架构来处理不同类型的任务。模型的输入可以是图像、文本或两者的组合,输出则根据任务的不同而有所变化。
- 多任务预训练:
- 模型在预训练阶段通过多种任务进行训练,包括视觉定位、图像描述、图像-文本匹配、视觉问答、目标检测、图像填充和文本填充。这些任务帮助模型学习到丰富的多模态表示。
- 工作原理
- 输入处理:
- 模型接收图像和/或文本作为输入。例如,图像描述任务中输入的是图像,视觉问答任务中输入的是图像和问题文本。
- 任务处理:
- 模型根据输入和任务类型,利用其内部的多模态表示和注意力机制,生成相应的输出。例如,在视觉问答任务中,模型会生成问题的答案;在图像描述任务中,模型会生成描述图像内容的文本。
- 输出生成:
- 模型根据任务的不同生成相应的输出。例如,图像描述任务生成文本描述,目标检测任务生成目标的边界框和类别标签。
- 新建一个 download_ofa.py,直接执行即可;
from modelscope import snapshot_download
model_dir = snapshot_download('iic/ofa_image-caption_coco_large_en',
cache_dir='/mnt/workspace/better_synth_challenge_baseline/models',
revision='master')
对应的solution中yaml修改为:
dataset_path: input/pretrain_stage_1_10k/mgm_pretrain_stage_1_10k.jsonl
export_path: output/image_captioning_output/res_10k.jsonl
np: 8
process:
- image_captioning_mapper:
hf_img2seq: '/mnt/workspace/better_synth_challenge_baseline/models/iic/ofa_image-caption_coco_large_en' # You can replace this path to a local downloaded HF model
keep_original_sample: false # we only need the recaptioned captions
Method3:caption模型替换-mPLUG
mPLUG模型,在英文图像描述MS COCO Caption数据集进行finetune的图像描述下游任务。mPLUG模型是统一理解和生成的多模态基础模型,该模型提出了基于skip-connections的高效跨模态融合框架。其中,mPLUG论文公开时在MS COCO Caption数据上达到SOTA,详见:mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
- 架构分析
- 视觉编码器(Visual Encoder):
- 负责处理图像数据,将图像转换为视觉特征表示(v)。
- 文本编码器(Text Encoder):
- 负责处理文本数据,将文本转换为文本特征表示(l)。
- 连接注意力模块(Connected Attention):
- 包含前馈神经网络(FFN)和自注意力机制(Self-Attn),用于处理视觉和文本特征的初步融合。
- 非对称协同注意力模块(Asymmetric Co-Attn):
- 包含前馈神经网络(FFN)、交叉注意力机制(Cross-Attn)和自注意力机制(Self-Attn),用于更深层次的视觉和文本特征融合。
- 前缀语言模型(Prefix LM):
- 包含前馈神经网络(FFN)、交叉注意力机制(Cross-Attn)和因果自注意力机制(Causal Self-Attn),用于生成任务。
- 工作原理
- 输入处理:
- 图像通过视觉编码器转换为视觉特征表示(v)。
- 文本通过文本编码器转换为文本特征表示(l)。
- 特征融合:
- 视觉特征和文本特征首先通过连接注意力模块进行初步融合。
- 然后,融合后的特征通过非对称协同注意力模块进行更深层次的融合,进一步捕捉视觉和文本之间的关系。
- 任务处理:
- 模型可以处理多种任务,如掩码语言模型(MaskLM)、图像-文本匹配(ITM)和图像-文本对比(ITC)。
- 掩码语言模型(MaskLM):通过掩码部分输入文本,模型预测被掩码的词。
- 图像-文本匹配(ITM):判断图像和文本是否匹配。
- 图像-文本对比(ITC):通过对比学习方法,增强图像和文本特征的对齐。
- 生成任务:
- 前缀语言模型(Prefix LM)用于生成任务,如文本生成。它结合了交叉注意力和因果自注意力机制,生成与输入相关的文本。
- 总结
该模型通过视觉编码器和文本编码器分别处理图像和文本数据,并通过多层次的注意力机制(包括连接注意力和非对称协同注意力)进行特征融合。最终,模型能够处理多种任务,如掩码语言模型、图像-文本匹配和图像-文本对比,并支持生成任务。通过这种多模态融合和多任务处理,模型在处理图像和文本数据的综合任务中表现出色。
- 新建一个 download_mplug.py,直接执行即可;
from modelscope import snapshot_download
model_dir = snapshot_download('iic/mplug_image-captioning_coco_base_en',
cache_dir='/mnt/workspace/better_synth_challenge_baseline/models',
revision='master')
对应的solution中yaml修改为:
dataset_path: input/pretrain_stage_1_10k/mgm_pretrain_stage_1_10k.jsonl
export_path: output/image_captioning_output/res_10k.jsonl
np: 8
process:
- image_captioning_mapper:
hf_img2seq: '/mnt/workspace/better_synth_challenge_baseline/models/iic/mplug_image-captioning_coco_base_en' # You can replace this path to a local downloaded HF model
keep_original_sample: false # we only need the recaptioned captions
Method4:caption模型替换-gpt4v
使用image_captioning_from_gpt4v_mapper算子,通过调用gpt4v的api来生成caption;
理论上这个质量应该会更高,目前正在尝试,尚未完成;
- image_captioning_from_gpt4v_mapper: # 生成基于 gpt-4-visison 和图像生成文本的样本
mode: 'description' # 从图像生成的文本模式,可以是 ['resoning', 'description', 'conversation', 'custom'] 之一
api_key: '' # 用于验证请求的 API 密钥
max_token: 500 # 生成的最大 token 数量。默认值为 500。
temperature:1.0 # 控制输出的随机性(范围从 0 到 1)。默认值为 0。
system_prompt: '' # 用于设置对话上下文的字符串提示,并为 gpt4-vision 提供全局指导或规则,以便它能够以预期的方式生成响应。如果 `mode` 设置为 `custom`,则将使用该参数
user_prompt: '' # 一个字符串提示,用于指导每个样本生成 gpt4-vision。默认为 "",表示不提供提示
user_prompt_key: null #samples 中存储每个样本提示的字段的键名,用于为不同的样本设置不同的提示,如果没有,则使用参数prompt中的prompt,默认为None
keep_original_sample: true # 是否保留原始样本,若设置为 False,则最终数据集中只有生成的文本,原始文本会被移除,默认为 True
any_or_all: 'any' # 使用“any”或“all”策略保留所有图像的此样本。'any':如果任何图像满足条件,则保留此样本。'all':仅当所有图像都满足条件时才保留此样本
Method5:diffusion模型生成图像
- 新建一个 download_sd_v1_4.py,直接执行即可;
from modelscope import snapshot_download
model_dir = snapshot_download('AI-ModelScope/stable-diffusion-v1-4',
cache_dir='/mnt/workspace/better_synth_challenge_baseline/models',
revision='master')
对应的solution中yaml修改为:
- image_diffusion_mapper: # 通过扩散模型生成图像
hf_diffusion: '/mnt/workspace/better_synth_challenge_baseline/models/AI-ModelScope/stable-diffusion-v1-4' # huggingface 上用于生成图像的稳定扩散模型名称
torch_dtype: 'fp32' # 用于加载扩散模型的浮点类型。可以是 ['fp32', 'fp16', 'bf16'] 之一
revision: 'main' # 要使用的特定模型版本。它可以是分支名称、标签名称、提交 ID 或 Git 允许的任何标识符。
strength: 0.8 # 稳定扩散模型的参数,表示变换参考图像的程度。如果等于 1,将忽略输入图像
guide_scale: 7.5 #稳定扩散模型的参数,较高的指导尺度值鼓励模型生成与文本提示紧密相关的图像,但代价是图像质量较低
aug_num: 1 # 要生成的图像数量
keep_original_sample: true #是否保留原始样本,若设置为False,则最终数据集中只有生成的图片,原始图片会被移除,默认为True。
caption_key: null # 样本中字段的键名称,用于存储每个图像的标题,标题指导扩散模型生成图像
hf_img2seq:'Salesforce/blip2-opt-2.7b'# 如果 caption_key 为空,则 huggingface 上的模型名称将生成标题
mem_required: '8GB' # 此操作 (Op) 利用深度神经网络模型,该模型在计算时会消耗大量内存,因此系统的可用内存可能会限制可启动的最大进程数
其他更多 method
当然这里还有其他更多的可用模型以及有趣的算子值得探索和使用,也不必拘泥在 modelscope 上下载模型,可以使用 hf-mirror 的方式,直接下载你需要的模型;
另外更多算子的使用方式具体可参考,可以对下面这些算子进行排列组合
# Process config example including:
# - all global arguments
# - all ops and their arguments
# global parameters
project_name: 'all' # project name for distinguish your configs
dataset_path: '/path/to/your/dataset' # path to your dataset directory or file with weights(0.0-1.0), 1.0 as default.
# Accepted format: 'weight1(optional) dataset1-path weight2(optional) dataset2-path'
export_path: '/path/to/result/dataset.jsonl' # path to processed result dataset. Supported suffixes include ['jsonl', 'json', 'parquet']
export_shard_size: 0 # Shard size of exported dataset in Byte. In default, it's 0, which means export the whole dataset into only one file. If it's set a positive number, the exported dataset will be split into several dataset shards, and the max size of each shard won't larger than the export_shard_size
export_in_parallel: false # Whether to export the result dataset in parallel to a single file, which usually takes less time. It only works when export_shard_size is 0, and its default number of processes is the same as the argument np. **Notice**: If it's True, sometimes exporting in parallel might require much more time due to the IO blocking, especially for very large datasets. When this happens, False is a better choice, although it takes more time.
np: 4 # number of subprocess to process your dataset
text_keys: 'content' # the key name of field where the sample texts to be processed, e.g., `text`, `instruction`, `output`, ...
# Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of `text_keys` when you set multiple keys.
suffixes: [] # the suffix of files that will be read. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx']
use_cache: true # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache
ds_cache_dir: null # cache dir for Hugging Face datasets. In default it\'s the same as the environment variable `HF_DATASETS_CACHE`, whose default value is usually "~/.cache/huggingface/datasets". If this argument is set to a valid path by users, it will override the default cache dir
use_checkpoint: false # whether to use the checkpoint management to save the latest version of dataset to work dir when processing. Rerun the same config will reload the checkpoint and skip ops before it. Cache will be disabled when using checkpoint. If args of ops before the checkpoint are changed, all ops will be rerun from the beginning.
temp_dir: null # the path to the temp directory to store intermediate caches when cache is disabled, these cache files will be removed on-the-fly. In default, it's None, so the temp dir will be specified by system. NOTICE: you should be caution when setting this argument because it might cause unexpected program behaviors when this path is set to an unsafe directory.
open_tracer: false # whether to open the tracer to trace the changes during process. It might take more time when opening tracer
op_list_to_trace: [] # only ops in this list will be traced by tracer. If it's empty, all ops will be traced. Only available when tracer is opened.
trace_num: 10 # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened.
op_fusion: false # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process.
cache_compress: null # The compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.
# for multimodal data processing
image_key: 'images' # Key name of field to store the list of sample image paths.
image_special_token: '<__dj__image>' # The special token that represents an image in the text. In default, it's "<__dj__image>". You can specify your own special token according to your input dataset.
audio_key: 'audios' # Key name of field to store the list of sample audio paths.
audio_special_token: '<__dj__audio>' # The special token that represents an audio in the text. In default, it's "<__dj__audio>". You can specify your own special token according to your input dataset.
eoc_special_token: '<|__dj__eoc|>' # The special token that represents the end of a chunk in the text. In default, it's "<|__dj__eoc|>". You can specify your own special token according to your input dataset.
# for distributed processing
executor_type: default # Type of executor, support "default" or "ray" for now.
ray_address: auto # The address of the Ray cluster.
# only for data analysis
save_stats_in_one_file: false # whether to store all stats result into one file
# process schedule: a list of several process operators with their arguments
process:
# Mapper ops. Most of these ops need no arguments.
- chinese_convert_mapper: # convert Chinese between Traditional Chinese, Simplified Chinese and Japanese Kanji.
mode: 's2t' # Choose the mode to convert Chinese: ['s2t', 't2s', 's2tw', 'tw2s', 's2hk', 'hk2s', 's2twp', 'tw2sp', 't2tw', 'tw2t', 'hk2t', 't2hk', 't2jp', 'jp2t']
- clean_email_mapper: # remove emails from text.
- clean_html_mapper: # remove html formats form text.
- clean_ip_mapper: # remove ip addresses from text.
- clean_links_mapper: # remove web links from text.
- clean_copyright_mapper: # remove copyright comments.
- expand_macro_mapper: # expand macro definitions in Latex text.
- fix_unicode_mapper: # fix unicode errors in text.
- nlpaug_en_mapper: # simply augment texts in English based on the nlpaug library
sequential: false # whether combine all augmentation methods to a sequence. If it's True, a sample will be augmented by all opened augmentation methods sequentially. If it's False, each opened augmentation method would generate its augmented samples independently.
aug_num: 1 # number of augmented samples to be generated. If `sequential` is True, there will be total aug_num augmented samples generated. If it's False, there will be (aug_num * #opened_aug_method) augmented samples generated.
delete_random_word: false # whether to open the augmentation method of deleting random words from the original texts. e.g. "I love LLM" --> "I LLM"
swap_random_word: false # whether to open the augmentation method of swapping random contiguous words in the original texts. e.g. "I love LLM" --> "Love I LLM"
spelling_error_word: false # whether to open the augmentation method of simulating the spelling error for words in the original texts. e.g. "I love LLM" --> "Ai love LLM"
split_random_word: false # whether to open the augmentation method of splitting words randomly with whitespaces in the original texts. e.g. "I love LLM" --> "I love LL M"
keyboard_error_char: false # whether to open the augmentation method of simulating the keyboard error for characters in the original texts. e.g. "I love LLM" --> "I ;ov4 LLM"
ocr_error_char: false # whether to open the augmentation method of simulating the OCR error for characters in the original texts. e.g. "I love LLM" --> "I 10ve LLM"
delete_random_char: false # whether to open the augmentation method of deleting random characters from the original texts. e.g. "I love LLM" --> "I oe LLM"
swap_random_char: false # whether to open the augmentation method of swapping random contiguous characters in the original texts. e.g. "I love LLM" --> "I ovle LLM"
insert_random_char: false # whether to open the augmentation method of inserting random characters into the original texts. e.g. "I love LLM" --> "I ^lKove LLM"
- nlpcda_zh_mapper: # simply augment texts in Chinese based on the nlpaug library
sequential: false # whether combine all augmentation methods to a sequence. If it's True, a sample will be augmented by all opened augmentation methods sequentially. If it's False, each opened augmentation method would generate its augmented samples independently.
aug_num: 1 # number of augmented samples to be generated. If `sequential` is True, there will be total aug_num augmented samples generated. If it's False, there will be (aug_num * #opened_aug_method) augmented samples generated.
replace_similar_word: false # whether to open the augmentation method of replacing random words with their similar words in the original texts. e.g. "这里一共有5种不同的数据增强方法" --> "这边一共有5种不同的数据增强方法"
replace_homophone_char: false # whether to open the augmentation method of replacing random characters with their homophones in the original texts. e.g. "这里一共有5种不同的数据增强方法" --> "这里一共有5种不同的濖据增强方法"
delete_random_char: false # whether to open the augmentation method of deleting random characters from the original texts. e.g. "这里一共有5种不同的数据增强方法" --> "这里一共有5种不同的数据增强"
swap_random_char: false # whether to open the augmentation method of swapping random contiguous characters in the original texts. e.g. "这里一共有5种不同的数据增强方法" --> "这里一共有5种不同的数据强增方法"
replace_equivalent_num: false # whether to open the augmentation method of replacing random numbers with their equivalent representations in the original texts. **Notice**: Only for numbers for now. e.g. "这里一共有5种不同的数据增强方法" --> "这里一共有伍种不同的数据增强方法"
- punctuation_normalization_mapper: # normalize unicode punctuations to English punctuations.
- remove_bibliography_mapper: # remove bibliography from Latex text.
- remove_comments_mapper: # remove comments from Latex text, code, etc.
doc_type: tex # comment type you want to remove. Only support 'tex' for now.
inline: true # whether to remove inline comments
multiline: true # whether to remove multiline comments
- remove_header_mapper: # remove header texts from Latex text.
drop_no_head: true # whether to drop sample texts without headers
- remove_long_words_mapper: # remove much too long words from text.
min_len: 1 # the min word length to keep words.
max_len: 128 # the max word length to keep words.
- remove_non_chinese_character_mapper: # remove non Chinese character in text samples.
keep_alphabet: true # whether to keep alpabet
keep_number: true # whether to keep number
keep_punc: true # whether to keep punctuation
- remove_specific_chars_mapper: # remove characters specified by users
chars_to_remove: '◆●■►▼▲▴∆▻▷❖♡□' # a string or a list including those characters that need to be removed
- remove_table_text_mapper: # remove possible table texts from text.
min_col: 2 # the min num of columns in tables to remove
max_col: 20 # the max num of columns in tables to remove
- remove_words_with_incorrect_substrings_mapper: # remove words with incorrect substrings from text.
lang: en # sample in which language
tokenization: false # whether to use model to tokenize documents
substrings: ['http', 'www', '.com', 'href', '//'] # incorrect substrings to remove
- sentence_split_mapper: # split text to multiple sentences and join them with '\n'
lang: 'en' # split text in what language
- whitespace_normalization_mapper: # normalize different kinds of whitespaces to English whitespace.
# Filter ops
- alphanumeric_filter: # filter text with alphabet/numeric ratio out of specific range.
tokenization: false # Whether to count the ratio of alphanumeric to the total number of tokens.
min_ratio: 0.0 # the min ratio of filter range
max_ratio: 0.9 # the max ratio of filter range
- average_line_length_filter: # filter text with the average length of lines out of specific range.
min_len: 10 # the min length of filter range
max_len: 10000 # the max length of filter range
- character_repetition_filter: # filter text with the character repetition ratio out of specific range
rep_len: 10 # repetition length for char-level n-gram
min_ratio: 0.0 # the min ratio of filter range
max_ratio: 0.5 # the max ratio of filter range
- face_area_filter: # filter samples according to the face area ratios in images (r=face_area/image_area). If multiple faces are available, we use the largest one.
min_ratio: 0.0 # the min face area ratio of filter range
max_ratio: 0.4 # the max face area ratio of filter range
upsample_num_times: 0 # optional argument passing to the underlying dlib face detector
- flagged_words_filter: # filter text with the flagged-word ratio larger than a specific max value
lang: en # consider flagged words in what language
tokenization: false # whether to use model to tokenize documents
max_ratio: 0.0045 # the max ratio to filter text
flagged_words_dir: ./assets # directory to store flagged words dictionaries
use_words_aug: false # whether to augment words, especially for Chinese and Vietnamese
words_aug_group_sizes: [2] # the group size of words to augment
words_aug_join_char: "" # the join char between words to augment
- image_aspect_ratio_filter: # filter samples according to the aspect ratios of images (a fraction of width by height, r=w/h) in them
min_ratio: 0.333 # the min aspect ratio of filter range
max_ratio: 3.0 # the max aspect ratio of filter range
any_or_all: any # keep this sample when any/all images meet the filter condition
- image_shape_filter: # filter samples according to the widths and heights of images in them
min_width: 200 # the min width of width filter range
max_width: 5000 # the max width of width filter range
min_height: 200 # the min height of height filter range
max_height: 5000 # the max height of height filter range
any_or_all: any # keep this sample when any/all images meet the filter condition
- image_size_filter: # filter samples according to the size of images (in bytes) within them
min_size: "0" # the min size of filter range
max_size: "1TB" # the max size of filter range
any_or_all: any # keep this sample when any/all images meet the filter condition
- image_text_matching_filter: # filter samples according to the matching score between image and text.
hf_blip: Salesforce/blip-itm-base-coco # name of used Hugging Face blip
min_score: 0.003 # the min matching score of filter range
max_score: 1.0 # the max matching score of filter range
horizontal_flip: false # Flip image horizontally (left to right).
vertical_flip: false # Flip image vertically (top to bottom).
reduce_mode: avg # reduce mode when one text corresponds to multiple images in a chunk, must be one of ['avg','max', 'min'].
any_or_all: any # keep this sample when any/all images meet the filter condition
- image_text_similarity_filter: # filter samples according to the similarity between image and text.
hf_clip: openai/clip-vit-base-patch32 # name of used Hugging Face clip
min_score: 0.1 # the min similarity of filter range
max_score: 1.0 # the max similarity of filter range
horizontal_flip: false # Flip image horizontally (left to right).
vertical_flip: false # Flip image vertically (top to bottom).
reduce_mode: avg # reduce mode when one text corresponds to multiple images in a chunk, must be one of ['avg','max', 'min'].
any_or_all: any # keep this sample when any/all images meet the filter condition
- language_id_score_filter: # filter text in specific language with language scores larger than a specific max value
lang: en # keep text in what language
min_score: 0.8 # the min language scores to filter text
- maximum_line_length_filter: # filter text with the maximum length of lines out of specific range
min_len: 10 # the min length of filter range
max_len: 10000 # the max length of filter range
- perplexity_filter: # filter text with perplexity score out of specific range
lang: en # compute perplexity in what language
max_ppl: 1500 # the max perplexity score to filter text
- special_characters_filter: # filter text with special-char ratio out of specific range
min_ratio: 0.0 # the min ratio of filter range
max_ratio: 0.25 # the max ratio of filter range
- stopwords_filter: # filter text with stopword ratio smaller than a specific min value
lang: en # consider stopwords in what language
tokenization: false # whether to use model to tokenize documents
min_ratio: 0.3 # the min ratio to filter text
stopwords_dir: ./assets # directory to store stopwords dictionaries
use_words_aug: false # whether to augment words, especially for Chinese and Vietnamese
words_aug_group_sizes: [2] # the group size of words to augment
words_aug_join_char: "" # the join char between words to augment
- text_action_filter: # filter text according the number of action verb
lang: en # consider the words in what language
min_action_num: 1 # text will be filtered whose verbs less the min action number
- text_entity_dependency_filter: # filter text without non independent entity nouns
lang: en # consider the words in what language
min_dependency_num: 1 # the min number of adjacent edges of a non independent noun in dependency tree
any_or_all: any # keep this sample when any/all entity nouns are non independent
- text_length_filter: # filter text with length out of specific range
min_len: 10 # the min length of filter range
max_len: 10000 # the max length of filter range
- token_num_filter: # filter text with total token number out of specific range
hf_tokenizer: EleutherAI/pythia-6.9b-deduped # name of used Hugging Face tokenizer
min_num: 10 # the min number of filter range
max_num: 10000 # the max number of filter range
- words_num_filter: # filter text with number of words out of specific range
lang: en # sample in which language
tokenization: false # whether to use model to tokenize documents
min_num: 10 # the min number of filter range
max_num: 10000 # the max number of filter range
- word_repetition_filter: # filter text with the word repetition ratio out of specific range
lang: en # sample in which language
tokenization: false # whether to use model to tokenize documents
rep_len: 10 # repetition length for word-level n-gram
min_ratio: 0.0 # the min ratio of filter range
max_ratio: 0.5 # the max ratio of filter range
- suffix_filter: # filter to keep samples with specified suffix.
suffixes: [] # the suffix of text that will be keep. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx']
- specified_field_filter: # filter text with the specified field info out of specific range
field_key: '' # the target key corresponding to multi-level field information need to be separated by '.'
target_value: [] # the range of specified field information corresponding to the samples that need to be retained
- specified_numeric_field_filter: # filter text with the specified numeric field info out of specific range
field_key: '' # the target key corresponding to multi-level field information need to be separated by '.'
min_value: 0 # the min filter value in SpecifiedNumericField op
max_value: 10000 # the max filter value in SpecifiedNumericField op
# Deduplicator ops
- document_deduplicator: # deduplicate text samples using md5 hashing exact matching method
lowercase: false # whether to convert text to lower case
ignore_non_character: false # whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations
- document_minhash_deduplicator: # deduplicate text samples using MinHash-LSH method
tokenization: space # tokenization method for text. One of [space, punctuation, character]
window_size: 5 # window size of shingling
num_permutations: 256 # number of permutations in minhash computing
jaccard_threshold: 0.7 # the min jaccard similarity threshold in near-duplicate detection. When the jaccard similarity of two sample texts is >= this threshold, they are regarded as similar samples and this op will only keep one of them after deduplication
num_bands: null # number of bands in LSH. Default it's None, and it will be determined by an optimal params computation algorithm by minimize the weighted sum of probs of False Positives and False Negatives
num_rows_per_band: null # number of rows in each band in LSH. Default it's None, and it will be determined by an optimal params computation algorithm
lowercase: true # whether to convert text to lower case
ignore_pattern: null # whether to ignore sub-strings with specific pattern when computing simhash.
- document_simhash_deduplicator: # deduplicate text samples using SimHash-LSH method
tokenization: space # tokenization method for text. One of [space, punctuation, character]
window_size: 6 # window size of shingling
num_blocks: 6 # number of blocks in SimHash computing
hamming_distance: 4 # the max hamming distance to regard 2 samples as similar enough pair. Should be less than num_blocks always
lowercase: true # whether to convert text to lower case
ignore_pattern: null # whether to ignore sub-strings with specific pattern when computing simhash.
- image_deduplicator: # deduplicator to deduplicate samples at document-level using exact matching of images between documents.
method: phash # hash method for image. One of [phash, dhash, whash, ahash]
# Selector ops
- topk_specified_field_selector: # selector to select top samples based on the sorted specified field
field_key: '' # the target keys corresponding to multi-level field information need to be separated by '.'
top_ratio: # ratio of selected top samples
topk: # number of selected top sample
reverse: True # determine the sorting rule, if reverse=True, then sort in descending order
- frequency_specified_field_selector: # selector to select samples based on the sorted frequency of specified field value
field_key: '' # the target keys corresponding to multi-level field information need to be separated by '.'
top_ratio: # ratio of selected top specified field value
topk: # number of selected top specified field value
reverse: True # determine the sorting rule, if reverse=True, then sort in descending order
模型迭代
由于对机器设备配置进行了修改,因此train_mgm_2b_stage_1.sh,需要修改为:
PRETRAIN_BATCH_SIZE_PER_GPU=2
PRETRAIN_GRADIENT_ACCUMULATION_STEPS=16
PRETRAIN_DATALOADER_NUM_WORKERS=4
FINETUNE_BATCH_SIZE_PER_GPU=1
FINETUNE_GRADIENT_ACCUMULATION_STEPS=16
FINETUNE_DATALOADER_NUM_WORKERS=4
这里根据下面的规则:
- PRETRAIN_BATCH_SIZE_PER_GPU * PRETRAIN_GRADIENT_ACCUMULATION_STEPS * num_gpus = 256
- FINETUNE_BATCH_SIZE_PER_GPU * FINETUNE_GRADIENT_ACCUMULATION_STEPS * num_gpus = 128
进行反向计算即可,没有什么太多要讲的。
需要注意的是,切换到不符合 flash-attention的 gpu 架构设备,需要考虑到 bf16 等是否支持,直接在对应的 deepspeed命令中进行修改即可。
另外因为要进行多组实验,这里可以将EXP_NAME,修改为符合当前 method 的名称,避免弄混;
以下为正常开始训练的过程,如图
事已至此,先训练吧!!
实验总结
待完善。
为了更加便于直观理解模型在数据上的拟合程度以及当前数据策略是否有效,可以根据 loss 绘制曲线图