Transformers--4-37-中文文档-四十三-

Transformers 4.37 中文文档（四十三）

原文：huggingface.co/docs/transformers

GIT

原始文本：huggingface.co/docs/transformers/v4.37.2/en/model_doc/git

概述

GIT 模型是由 Jianfeng Wang、Zhengyuan Yang、Xiaowei Hu、Linjie Li、Kevin Lin、Zhe Gan、Zicheng Liu、Ce Liu、Lijuan Wang 在《GIT: A Generative Image-to-text Transformer for Vision and Language》中提出的。GIT 是一种仅解码的 Transformer，利用 CLIP 的视觉编码器来除了文本外还对模型进行视觉输入的条件。该模型在图像字幕和视觉问答基准上取得了最先进的结果。

论文摘要如下：

在本文中，我们设计并训练了一个生成式图像文本 Transformer，GIT，以统一图像/视频字幕和问题回答等视觉-语言任务。虽然生成模型在预训练和微调之间提供了一致的网络架构，但现有工作通常包含复杂的结构（单/多模态编码器/解码器）并依赖于外部模块，如目标检测器/标记器和光学字符识别（OCR）。在 GIT 中，我们简化了架构，将其作为一个图像编码器和一个文本解码器在单一语言建模任务下。我们还扩大了预训练数据和模型规模以提高模型性能。没有花哨的东西，我们的 GIT 在 12 个具有挑战性的基准上建立了新的最先进技术，差距很大。例如，我们的模型首次在 TextCaps 上超越了人类表现（CIDEr 中的 138.2 vs. 125.5）。此外，我们提出了一种新的基于生成的图像分类和场景文本识别方案，在标准基准上取得了不错的表现。

GIT 架构。摘自原始论文。

该模型由nielsr贡献。原始代码可在此处找到。

使用提示

GIT 的实现方式与 GPT-2 非常相似，唯一的区别在于模型还受到pixel_values的影响。

资源

官方 Hugging Face 和社区（由🌎表示）资源列表，可帮助您开始使用 GIT。

关于在自定义数据上进行推理+微调 GIT 的演示笔记本可以在此处找到。
另请参阅：因果语言建模任务指南

如果您有兴趣提交资源以包含在此处，请随时提交拉取请求，我们将进行审查。资源应该理想地展示一些新内容，而不是重复现有资源。

龙哥盟

掠夺·扩张·投机·博弈

Transformers--4-37-中文文档-四十三-

Transformers 4.37 中文文档（四十三）

GIT

概述

使用提示

资源

GitVisionConfig

class transformers.GitVisionConfig

GitVisionModel

class transformers.GitVisionModel

forward

GitConfig

GitProcessor

class transformers.GitProcessor

__call__

GitModel

class transformers.GitModel

forward

GitForCausalLM

class transformers.GitForCausalLM

forward

GroupViT

概述

使用提示

资源

GroupViTConfig

class transformers.GroupViTConfig

from_text_vision_configs

GroupViTTextConfig

class transformers.GroupViTTextConfig

GroupViTVisionConfig

class transformers.GroupViTVisionConfig

GroupViTModel

class transformers.GroupViTModel

forward

get_text_features

get_image_features

GroupViTTextModel

class transformers.GroupViTTextModel

forward

GroupViTVisionModel

forward

TFGroupViTModel

class transformers.TFGroupViTModel

call

get_text_features

get_image_features

TFGroupViTTextModel

class transformers.TFGroupViTTextModel

call

TFGroupViTVisionModel

class transformers.TFGroupViTVisionModel

call

IDEFICS

概述

IdeficsConfig

class transformers.IdeficsConfig

IdeficsModel

class transformers.IdeficsModel

forward

IdeficsForVisionText2Text

class transformers.IdeficsForVisionText2Text

forward

IdeficsImageProcessor

class transformers.IdeficsImageProcessor

preprocess

IdeficsProcessor

class transformers.IdeficsProcessor

__call__

InstructBLIP

概述

使用提示

InstructBlipConfig

class transformers.InstructBlipConfig

from_vision_qformer_text_configs

InstructBlipVisionConfig

class transformers.InstructBlipVisionConfig

InstructBlipQFormerConfig

`class transformers.GitVisionConfig`

`class transformers.GitVisionModel`

`forward`

`class transformers.GitProcessor`

`call`

`class transformers.GitModel`

`forward`

`class transformers.GitForCausalLM`

`forward`

`class transformers.GroupViTConfig`

`from_text_vision_configs`

`class transformers.GroupViTTextConfig`

`class transformers.GroupViTVisionConfig`

`class transformers.GroupViTModel`

`forward`

`get_text_features`

`get_image_features`

`class transformers.GroupViTTextModel`

`forward`

`forward`

`class transformers.TFGroupViTModel`

`call`

`get_text_features`

`get_image_features`

`class transformers.TFGroupViTTextModel`

`call`

`class transformers.TFGroupViTVisionModel`

`call`

`class transformers.IdeficsConfig`

`class transformers.IdeficsModel`

`forward`

`class transformers.IdeficsForVisionText2Text`

`forward`

`class transformers.IdeficsImageProcessor`

`preprocess`

`class transformers.IdeficsProcessor`

`call`

`class transformers.InstructBlipConfig`

`from_vision_qformer_text_configs`

`class transformers.InstructBlipVisionConfig`

`class transformers.InstructBlipQFormerConfig`

`class transformers.InstructBlipProcessor`

`batch_decode`

`decode`

`class transformers.InstructBlipVisionModel`

`forward`

`class transformers.InstructBlipQFormerModel`

`forward`

`class transformers.InstructBlipForConditionalGeneration`

`forward`

`generate`

`class transformers.Kosmos2Config`

`class transformers.Kosmos2Processor`

`call`

`class transformers.Kosmos2Model`

`forward`

`class transformers.Kosmos2ForConditionalGeneration`

`forward`

`class transformers.LayoutLMConfig`

`class transformers.LayoutLMTokenizer`

`build_inputs_with_special_tokens`

`convert_tokens_to_string`

`create_token_type_ids_from_sequences`

`get_special_tokens_mask`

`class transformers.LayoutLMTokenizerFast`

`build_inputs_with_special_tokens`

`create_token_type_ids_from_sequences`

`class transformers.LayoutLMModel`

`forward`

`class transformers.LayoutLMForMaskedLM`

`forward`

`class transformers.LayoutLMForSequenceClassification`

`forward`

`class transformers.LayoutLMForTokenClassification`

`forward`

`class transformers.LayoutLMForQuestionAnswering`

`forward`

`class transformers.TFLayoutLMModel`

`call`

`class transformers.TFLayoutLMForMaskedLM`