Learning Latent Graph Representations for Relational VQA

The key mechanism of transformer-based models is cross-attentions, which implicitly form graphs over tokens and act as diffusion operators to facilitate information propagation through the graph for question-answering that requires some reasoning over the scene.

基于transformer的模型的关键机制是交叉关注,交叉关注在tokens上隐式地形成图,并充当扩散操作符,以促进信息通过图传播,用于需要对场景进行一些推理的问答。

We reinterpret and reformulate the transformer-based model to explicitly construct latent graphs over tokens and thereby support improved performance for answering visual questions about relations between objects.

我们重新解释和表述基于transformer的模型,以显式地在tokens上构造潜在图,从而支持改进性能,以回答关于对象之间关系的可视化问题。

Coincidentally, transformer-based language encoders can not only take advantage of the tokenization trend but also are intrinsically built for information fusion and alignments due to its core self-attention mechanism. 

巧合的是,基于transformer的语言编码器不仅可以利用标记化趋势,而且由于其核心的自我注意机制,其本质上是为信息融合和对齐而构建的。

 

基于transformer的VQA系统的这种成功表明了两个见解的有效性:图像标记化,以及文本标记和图像标记之间的成对标记交互。

我们观察到成对的tokens交互共同形成了一个图,并且遍历这个图形成了一种推理,这可能是对这些基于transformer的模型的推理能力声明的解释

we reinterpret transformer-based VQA systems as graph convolutions,

We show that our model benefits from its latent graph representations

To the best of our knowledge, current transformer-based models cannot benefit from graph information, and there have not been work on taking advantage of scene graphs or graph representations in general for VQA.

In our model, the goal is to learn to generate a latent graph representation and then perform node classification on the resulting heterogeneous graph.

A typical task for a GCN is node classification, as GCN is capable of learning node representations from a given static homogeneous graph.

Graph Transformer Networks (GTN)  are a model for handling heterogeneous graphs, graphs with various types of edges, as well as generating new graphs.

如何利用场景图scene graph和图表示,并利用transformer机制的图卷积,提供VQA。

 

posted @   kkzhang  阅读(44)  评论(0编辑  收藏  举报
编辑推荐:
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
阅读排行:
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
历史上的今天:
2018-08-11 RS485通信电路
2018-08-11 EEPROM存储电路(M24C64芯片)
点击右上角即可分享
微信分享提示