paper 管理

文章实在太多了，实现一个非常科学的分类实在是太难了。有的文章，今天觉得它是这类，明天再读一遍这个标题又觉得它是那类。有些分类之间壁垒本身就不明确（也可能是我对很多分类的本质并不清楚）

其实这个 collection 可以有很多 motivation，比如我只是看到了同类的文章，例如某几篇文章 follow up 了它们的 original paper，于是放到了这里。original paper 的 idea 再往上硬要归类就比较牵强。所以如果你觉得这个分类令你不适（其实它经常令我不适，降低我的归类效率），可以选择 ctrl+F。

2024-09-28 重新整理了一下，因为我听说了一个叫做 post training 的词，突然就知道有些乱乱的类别怎么分了。

说明：

分类的名字，有的是中文，有的是英文。目前还没有统一成一种语言。
每篇文章格式都是链接（多数来自 arxiv） + 名字 + （我自己写了一些 fun fact 或者一些阅读体验，几乎都是用中文写的。）
因为我可能关注（搜刮） LLM 方向的文章比较多，所以你可以理解为所有标题都是 from LLM's point of view
没标已读不代表未读

商用大模型技术报告

https://www.arxiv.org/pdf/2408.08152 DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search （DeepSeek 出品，被吹了好像）

https://arxiv.org/pdf/2409.12122 QWEN2.5-MATH TECHNICAL REPORT: TOWARD MATHEMATICAL EXPERT MODEL VIA SELFIMPROVEMENT

https://arxiv.org/pdf/2409.12186 Qwen2.5-Coder Technical Report

https://arxiv.org/pdf/2404.06395 MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

大列表

efficient transformer 综述

transformer circuits （真的想读这个，什么时候有时间啊）

深度学习中会涉及的问题

模型训练

https://zhuanlan.zhihu.com/p/694263912 badam 优化器？

https://arxiv.org/pdf/2409.03137 THE ADEMAMIX OPTIMIZER: BETTER, FASTER, OLDER “这就是 TCS 吗”

https://arxiv.org/pdf/2409.11727 Enabling Real-Time Conversations with Minimal Training Costs

https://arxiv.org/pdf/2409.01790 Training on the Benchmark Is Not All You Need 这篇放到这里有点搞笑的说。

蒸馏

https://arxiv.org/pdf/2407.05682 Retrieved In-Context Principles from Previous Mistakes 怎么感觉是唐文。

https://www.zhihu.com/question/309808462/answer/3365782354

https://arxiv.org/pdf/2408.11796 LLM Pruning and Distillation in Practice: The Minitron Approach

https://arxiv.org/pdf/2409.12512 Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models

可解释性

https://arxiv.org/pdf/2406.16033 Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models

泛化性

https://arxiv.org/pdf/2103.02503 Domain Generalization: A Survey（这篇放到这里，说明我这个分类是粗粒度的）

https://arxiv.org/pdf/2405.16766 Reframing the Relationship in Out-of-Distribution Detection

https://arxiv.org/pdf/2409.07335 Explanation, Debate, Align: A Weak-to-Strong Framework for Language Model Generalization

https://arxiv.org/pdf/2409.04787 Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models

https://aclanthology.org/2023.findings-emnlp.768.pdf Improving generalization in large language models by learning prefix subspaces 这篇大概率是没时间读明白了

https://arxiv.org/pdf/2409.07335 Explanation, Debate, Align: A Weak-to-Strong Framework for Language Model Generalization

https://arxiv.org/pdf/2409.18433 Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization 这篇文章是测试 LLM 泛化性的 benchmark。但是在力大砖飞的时代，有人会选择 propose 一些算法而不是造一些数据来提升泛化性吗

知识迁移

https://arxiv.org/pdf/2408.10858 Knowledge Sharing and Transfer via Centralized Reward Agent for Multi-Task Reinforcement Learning

https://arxiv.org/pdf/2408.12525v1 Scaling, Control and Generalization in Reinforcement Learning Level Generators

continual learning

https://arxiv.org/pdf/2403.10056 Don’t Half-listen: Capturing Key-part Information in Continual Instruction Tuning

https://github.com/xialeiliu/Awesome-Incremental-Learning 学校老师收集的 incremental learning 方向的文章。很遗憾，这个没有分类，只是按照时间梳理了一下。

算法竞赛

DNA动了孩子们！

https://arxiv.org/pdf/2409.09054 Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis

Knowledge Graph

https://arxiv.org/pdf/2409.03155 long reasoning 能力提升的一个方法是在 knowledge graph 上获得信息。QA 是和 knowledge graph 交互的一个手段。这篇文章通过提升 knowledge Graph QA 的能力提升了 LLM reasoninging 能力

https://arxiv.org/abs/2406.07080 DARA: Decomposition-Alignment-Reasoning Autonomous Language Agent for Question Answering over Knowledge Graphs

https://arxiv.org/abs/2407.17190 Fusing LLMs and KGs for Formal Causal Reasoning behind Financial Risk Contagion

https://arxiv.org/pdf/2402.06861 UrbanKGent: A Unified Large Language Model Agent Framework for Urban Knowledge Graph Construction 这篇应当真是一个 sub domain 的事

Representation Learning

https://arxiv.org/pdf/2409.03662 The Representation Landscape of Few-Shot Learning and Fine-Tuning in Large Language Models 标题讲的已经很清楚了。这篇文章的内容有点太表征学习了，有点读不起。

https://arxiv.org/pdf/2212.07677 Transformers Learn In-Context by Gradient Descent 之前讨论的时候 zhang zhong 学长提到的nb文章，但是我没仔细读过。

https://arxiv.org/pdf/2409.04318 Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs 这篇是微信公众号上推荐的讨论 in context learning 机制的文章

https://arxiv.org/pdf/2408.13661 Hierarchical Network Fusion for Multi-Modal Electron Micrograph Representation Learning with Foundational Large Language Models 第一完成机构是 TCS Research，被吓到了

https://arxiv.org/pdf/2409.12005 Representing Positional Information in Generative World Models for Object Manipulation

LLM benchmarks

https://arxiv.org/pdf/2408.15729 LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models

https://arxiv.org/pdf/2308.07201 CHATEVAL: TOWARDS BETTER LLM-BASED EVALUATORS THROUGH MULTI-AGENT DEBATE

https://arxiv.org/pdf/2409.04168 From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks 这也是一个新赛道

https://arxiv.org/pdf/2409.07641 SIMULBENCH: Evaluating Language Models with Creative Simulation Tasks

https://arxiv.org/pdf/2409.12060 PARAPHRASUS: A Comprehensive Benchmark for Evaluating Paraphrase Detection Models

https://openreview.net/pdf?id=4k4cocpuSw Benchmarking Edge Regression on Temporal Networks

https://arxiv.org/pdf/2206.08514 A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks

https://arxiv.org/pdf/2408.08978 See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

https://arxiv.org/pdf/2409.00844 Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries

https://arxiv.org/pdf/2409.15272 OMNIBENCH: TOWARDS THE FUTURE OF UNIVERSAL OMNI-LANGUAGE MODELS 这就是测试多模态能力的 benchmark 了。

tool benchmarks

https://arxiv.org/pdf/2402.15491 API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs

https://arxiv.org/pdf/2310.03128 METATOOL BENCHMARK FOR LARGE LANGUAGE MODELS: DECIDING WHETHER TO USE TOOLS AND WHICH TO USE

https://arxiv.org/pdf/2311.10775 TOOLTALK: EVALUATING TOOL USAGE IN A CONVERSATIONAL SETTING 这篇倒不是 benchmark 了，这属于是一个 evaluation 的方法。

long context benchmarks

https://arxiv.org/pdf/2402.13718 ∞BENCH: Extending Long Context Evaluation Beyond 100K Tokens

https://arxiv.org/pdf/2409.16191 HELLOBENCH: EVALUATING LONG TEXT GENERATION CAPABILITIES OF LARGE LANGUAGE MODELS

https://arxiv.org/pdf/2403.12766 NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

(Math&Science) Reasoning

https://arxiv.org/pdf/2409.02834 CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

https://arxiv.org/pdf/2311.09805 DOCMATH-EVAL: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents

https://arxiv.org/pdf/2402.14008 OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

https://arxiv.org/pdf/2409.13730 VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning 多模态推理

https://arxiv.org/pdf/2408.15778 LOGICGAME: Benchmarking Rule-Based Reasoning Abilities of Large Language Models

https://arxiv.org/pdf/2409.12746 Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination 能测测我们高考吗？

https://arxiv.org/pdf/2409.13729 MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

推理 benchmark

https://arxiv.org/pdf/2402.17644 Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data

在本文中，我们重点关注高级定量推理的两个主要领域：统计推理和因果推理，图1中展示了相关示例。给定来自样本调查的数据表，统计推理旨在推断潜在的概率分布，解决诸如“y的总体均值的95%置信区间是多少”的问题；而因果推理则旨在理解变量之间的因果关系，解决诸如“t对y的平均处理效应是多少”的问题。

LLM 搏杀数学统计专业的大学生无疑了。

https://arxiv.org/pdf/2307.13692 ARB: Advanced Reasoning Benchmark for Large language Models

https://aclanthology.org/2024.acl-long.515/ T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step 这是工具学习的 evaluation

https://arxiv.org/abs/2403.05307 Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents

https://arxiv.org/pdf/2302.04752 Benchmarks for Automated Commonsense Reasoning: A Survey 这个 survey 最后有一个大表，不知道后面有没有机会用上

instruction following benchmark

https://arxiv.org/pdf/2401.03601v1 INFOBENCH: Evaluating Instruction Following Ability in Large Language Models

https://arxiv.org/pdf/2310.07641 EVALUATING LARGE LANGUAGE MODELS AT EVALUATING INSTRUCTION FOLLOWING 作者怎么都是算法竞赛选手

联邦学习

https://arxiv.org/pdf/2409.18461 Towards Diverse Device Heterogeneous Federated Learning via Task Arithmetic Knowledge Integration

LLM pretrain

https://arxiv.org/pdf/2407.06654 SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training 这个data reweight 还挺重要，得近期仔细读读这篇？

https://arxiv.org/pdf/2409.07787 Stable Language Model Pre-training by Reducing Embedding Variability

https://arxiv.org/pdf/2407.05013 Progress or Regress? Self-Improvement Reversal in Post-training

https://arxiv.org/pdf/2407.04787 Re-Tuning: Overcoming the Compositionality Limits of Large Language Models with Recursive Tuning

https://arxiv.org/pdf/2409.15825 Empirical Insights on Fine-Tuning Large Language Models for Question-Answering 一看标题就知道是干什么的

https://arxiv.org/pdf/2409.12903 Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

https://www.sciencedirect.com/science/article/pii/S0306457324002516 Gauging, enriching and applying geography knowledge in Pre-trained Language Models 可能完全没必要读。有点太标题党了。

https://arxiv.org/pdf/2405.15319 Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

https://arxiv.org/pdf/2211.00151 A Close Look into the Calibration of Pre-trained Language Models

https://arxiv.org/pdf/2409.15518 Eagle: Efficient Training-Free Router for Multi-LLM Inference

dataset

https://aclanthology.org/2022.findings-acl.17/ LEVEN: A Large-Scale Chinese Legal Event Detection Dataset

模型量化

https://arxiv.org/pdf/2409.11055 A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

mixture of experts / fusion

https://arxiv.org/pdf/2407.04153 Mixture of A Million Experts 谷歌 deepmind 老哥写的。

https://arxiv.org/pdf/2401.10491 KNOWLEDGE FUSION OF LARGE LANGUAGE MODELS

https://arxiv.org/abs/2407.19985 Mixture of Nested Experts: Adaptive Processing of Visual Tokens

https://arxiv.org/pdf/2409.01483 Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning

LoRA

https://arxiv.org/pdf/2311.03285 S-LORA: SERVING THOUSANDS OF CONCURRENT LORA ADAPTERS

reflection tuning

https://arxiv.org/pdf/2402.10110 Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning

https://arxiv.org/pdf/2310.11716 Reflection-Tuning:Data Recycling Improves LLM Instruction-Tuning

这两篇好像是一串。但是我忘了。

关于 LLM 本身的研究

https://www.nature.com/articles/s41586-024-07930-y Larger and more instructable language models become less reliable

https://arxiv.org/pdf/2409.14381 Investigating Layer Importance in Large Language Models

限制生成

https://arxiv.org/pdf/2408.12599 Controllable Text Generation for Large Language Models: A Survey

Catastrophic Forgetting

【已读】https://arxiv.org/pdf/2404.10306 Balancing Speciality and Versatility: a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model 微调过程中 versatility 和 speciality 的平衡问题，如果塞入大量特定 domain 的内容，可能会导致通用能力的丧失。这篇文章先找出模型中最表现专有能力的一些模块，把其余参数冻结，然后再用微调数据进行微调。找模块这部分，我连伪代码都没看懂。

https://arxiv.org/pdf/2405.14860 Not All Language Model Features Are Linear

embedding

https://arxiv.org/pdf/2407.12886 Whitening Not Recommended for Classification Tasks in LLMs

tokenizer

https://arxiv.org/pdf/2106.00400 Sub-Character Tokenization for Chinese Pretrained Language Models

long context

https://arxiv.org/pdf/2402.04617 InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

https://arxiv.org/pdf/2409.12181 A Controlled Study on Long Context Extension and Generalization in LLMs

https://arxiv.org/pdf/2409.02897 LONGCITE: ENABLING LLMS TO GENERATE FINEGRAINED CITATIONS IN LONG-CONTEXT QA

https://arxiv.org/pdf/2409.04774 Untie the Knots: An Efficient Data Augmentation Strategy for Long-Context Pre-Training in Language Models

https://arxiv.org/pdf/2305.13304 RECURRENTGPT: Interactive Generation of (Arbitrarily) Long Text

lost in the middle

【已读】https://arxiv.org/pdf/2307.03172 Lost in the Middle: How Language Models Use Long Contexts

上面这篇文章是开山之作了，lost in the middle 大概描述了这么一件事情：

当相关信息在输入上下文的中间位置时，模型的性能会显著下降，表明当前的语言模型在长文本上下文中并不总是能够稳健地使用信息。
观察到一个明显的 U 形性能曲线，即当相关信息出现在输入上下文的开始（首因效应）或结束（近因效应）时，模型的性能最高，而在中间位置时性能显著下降。
即使是专门为长上下文设计的模型，也存在这种性能下降的问题。

我觉得这和 attention sink 有点类似。或者通过 attention sink 容易联想到这个问题。分析 lost in the middle 出现的原因，我觉得最重要的还是训练数据，预训练数据都是互联网找的，互联网上的文字段落都是人写的，小学老师就教你总分总，自然两头的信息量（关键程度/对理解的影响程度）大于中间部分，这就导致了 lost in the middle。

https://arxiv.org/pdf/2403.04797 Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding 上网搜了一下搜到了这篇文章，看名字就很有趣啊

语言模型是压缩模型

https://arxiv.org/pdf/2305.14788 Adapting Language Models to Compress Contexts 作者甚至有陈丹琦

https://arxiv.org/pdf/2309.10668 LANGUAGE MODELING IS COMPRESSION

https://arxiv.org/pdf/2409.11233 Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models

模型坍缩

nature 封面文被沈老师喷说结论很 intuitive

https://arxiv.org/pdf/2404.01413 Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

减少幻觉

https://arxiv.org/pdf/2405.20974 SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales

LLM post training

有点想不出来这个 post training 究竟是干了啥。

构造数据

https://arxiv.org/pdf/2409.11500 Multi-Document Grounded Multi-Turn Synthetic Dialog Generation

https://arxiv.org/pdf/2305.14233 Enhancing Chat Language Models by Scaling High-quality Instructional Conversations 典中典 UltraChat

https://arxiv.org/pdf/2310.01377 ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback 这篇文章给 dpo 提供了 infra

https://arxiv.org/pdf/2409.12568 InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning 这篇是多模态数学微调数据集

self improve

https://arxiv.org/pdf/2406.01495 Re-ReST: Reflection-Reinforced Self-Training for Language Agents

https://arxiv.org/pdf/2406.03816 ReST-MCTS∗ : LLM Self-Training via Process Reward Guided Tree Search

https://arxiv.org/pdf/2407.18219 Recursive Introspection: Teaching Language Model Agents How to Self-Improve

https://arxiv.org/pdf/2409.03381 CogniDual Framework: Self-Training Large Language Models within a Dual-System Theoretical Framework for Improving Cognitive Tasks

RLHF

https://proceedings.neurips.cc/paper_files/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf Learning to summarize from human feedback

推理加速

https://zhuanlan.zhihu.com/p/651359908 大模型推理妙招—投机采样（Speculative Decoding）

alignment

alignment 这个 domain 有点大，把它从 post training 中抽离出来了。

https://arxiv.org/pdf/2401.05566 SLEEPER AGENTS: TRAINING DECEPTIVE LLMS THAT PERSIST THROUGH SAFETY TRAINING

https://arxiv.org/pdf/2308.06259 SELF-ALIGNMENT WITH INSTRUCTION BACKTRANSLATION

https://arxiv.org/pdf/2407.13692 PROVER-VERIFIER GAMES IMPROVE LEGIBILITY OF LLM OUTPUTS 水 OpenAI Research 主页的时候看到的

https://arxiv.org/pdf/2004.07213 Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims∗老文章了

https://arxiv.org/pdf/2202.03286 Red Teaming Language Models with Language Models

https://www.zhihu.com/column/c_1725235995694276608 知乎上有老哥每周总结一些科幻小说的。

https://arxiv.org/pdf/2408.12163 Preference-Guided Reflective Sampling for Aligning Language Models

https://arxiv.org/abs/2403.05063 Aligning Large Language Models for Controllable Recommendations

https://arxiv.org/pdf/2404.18410 Mixture-of-Instructions: Comprehensive Alignment of a Large Language Model through the Mixture of Diverse System Prompting Instructions

https://arxiv.org/pdf/2408.17003 SAFETY LAYERS OF ALIGNED LARGE LANGUAGE MODELS: THE KEY TO LLM SECURITY

https://arxiv.org/pdf/2408.12798 BACKDOORLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

https://arxiv.org/pdf/2409.11704 FROM LISTS TO EMOJIS: HOW FORMAT BIAS AFFECTS MODEL ALIGNMENT

https://arxiv.org/pdf/2409.08206 Compositional Alignment in Vision-Language Models

https://dl.acm.org/doi/pdf/10.1145/3688850 Exploiting Pre-trained Language Models for Black-box Attack against Knowledge Graph Embeddings

**https://arxiv.org/pdf/2409.13948 Aligning Language Models Using Follow-up Likelihood as Reward Signal **

https://arxiv.org/pdf/2409.14119 Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm

https://arxiv.org/pdf/2403.02691 INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

https://arxiv.org/pdf/2409.18541 Align2LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

back door attacks

https://arxiv.org/abs/2402.11208 Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

https://arxiv.org/abs/2402.14968 Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

https://ieeexplore.ieee.org/abstract/document/10697229 Diffense: Defense Against Backdoor Attacks on Deep Neural Networks With Latent Diffusion

https://aclanthology.org/2023.tacl-1.91.pdf Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training

unlearning

https://arxiv.org/pdf/2406.11614 Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces 好像是讨论破坏词向量的。https://zhuanlan.zhihu.com/p/708685124 作者在这里写了一个解释

https://arxiv.org/pdf/2402.16835 EIGHT METHODS TO EVALUATE ROBUST
UNLEARNING IN LLMS 这就是做实验的地方咯

对比学习

https://arxiv.org/pdf/2402.11651 Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents这篇是第一个利用负例样本的文章。但是使用的手法非常粗糙，直接分别在正负例样本前标注这是正例或者负例。这篇文章还讲述了正负例样本配比的问题，画了很有趣的图像

https://arxiv.org/pdf/2406.00888 Show, Don’t Tell: Aligning Language Models with Demonstrated Feedback

https://arxiv.org/pdf/2409.14836 Orthogonal Finetuning for Direct Preference Optimization 一看第一页的图，就很有感觉啊。

LLM applications

https://arxiv.org/pdf/2409.14807 Interpreting Multi-band Galaxy Observations with Large Language Model-Based Agents 看了一眼第二作者是 department of astronomy，就意识到事情不太对，然后读了一下 abstract 给我吓一跳。

https://arxiv.org/pdf/2409.17166 ScriptSmith: A Unified LLM Framework for Enhancing IT Operations via Automated Bash Script Generation, Assessment, and Refinement

data augmentation

https://arxiv.org/pdf/2409.15376 ControlMath: Controllable Data Generation Promotes Math Generalist Models

role-play

https://arxiv.org/pdf/2409.11726 Revealing the Challenge of Detecting Character Knowledge Errors in LLM Role-Playing

LLM serves as world model

https://arxiv.org/pdf/2406.13945 CityBench: Evaluating the Capabilities of Large Language Model as World Model

https://arxiv.org/pdf/2406.13948 CityGPT: Empowering Urban Spatial Cognition of Large Language Models

https://arxiv.org/pdf/2407.13578v1 Large Language Models as Reliable Knowledge Bases?

https://arxiv.org/pdf/2408.15915 Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models

https://arxiv.org/pdf/2306.09296 KOLA: CAREFULLY BENCHMARKING WORLD KNOWLEDGE OF LARGE LANGUAGE MODELS

https://arxiv.org/pdf/2409.12278 Making Large Language Models into World Models with Precondition and Effect Knowledge

text based simulators

【已读】https://arxiv.org/pdf/2406.06485 Can Language Models Serve as Text-Based World Simulators? 这篇认为 LLM 意识不到环境中非当前操作物品按照时间变化的事实，所以造了一些数据

https://arxiv.org/pdf/2107.04132 A Systematic Survey of Text Worlds as Embodied Natural Language Environments （Ruoyao Wang 的文章）

https://arxiv.org/pdf/1909.05398 Interactive Fiction Games: A Colossal Adventure 这篇文章的几个老哥做的 TextWorld

https://arxiv.org/pdf/2312.11970v1 Large Language Models Empowered Agent-based
Modeling and Simulation: A Survey and Perspectives

LLM sys

我也不好定义这个分类究竟是在做什么，大概是使用 LLM 作为基座，将一些之前需要人做的事情自动化掉？按照这个定义，以下两篇文章其实和 agents 不太相关。或者说这个 pipeline 只是初步的 pipeline，作者并没有试验在 complex task 上的可用性。

【已读】https://arxiv.org/pdf/2308.12261 PROMPT2MODEL: Generating Deployable Models from Natural Language Instructions

【已读】https://arxiv.org/pdf/2407.12874 SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning 也是 chenyang zhao 的文章。这篇文章讨论了微调数据自动合成的事情。conventional approach 需要外部信号，或者说更强大的教师模型，但是这篇文章希望学生模型可以 self guide。大概是这个意思？

https://arxiv.org/pdf/2312.04889 KwaiAgents: Generalized Information-seeking Agent System with Large Language Models

https://arxiv.org/pdf/2409.03215 xLAM: A Family of Large Action Models to Empower AI Agent Systems

垂直领域大模型

法律

https://arxiv.org/pdf/2409.11798 The Factuality of Large Language Models in the Legal Domain

金融

https://arxiv.org/pdf/2408.12337 Fine-tuning Smaller Language Models for Question Answering over Financial Documents

农业

https://d197for5662m48.cloudfront.net/documents/publicationstatus/223964/preprint_pdf/be49fd99a51b691cc4349b28c1d904a5.pdf Multi-Modal LLMs in Agriculture: A Comprehensive Review

Code generation

https://arxiv.org/pdf/2401.07339 CODEAGENT: Enhancing Code Generation with Tool-Integrated AgentSystems for Real-World Repo-level Coding Challenges

https://arxiv.org/pdf/2312.13010 AgentCoder: Multi-Agent Code Generation with
Effective Testing and Self-optimisation

https://arxiv.org/pdf/2405.17057 ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation 这篇我好还读来着，但是它这个 reflection 是用来微调的，对当时写 rubbish prompt 的我没啥帮助？

https://arxiv.org/pdf/2409.04114 MULTI-PROGRAMMING LANGUAGE ENSEMBLE FOR CODE GENERATION IN LARGE LANGUAGE MODEL 好像是加州一个做 code generation 的公司做的工作！

https://software-lab.org/publications/icse2025_calibration.pdf Calibration and Correctness of Language Models for Code

https://dl.acm.org/doi/pdf/10.1145/3695993 Fine-tuning Large Language Models to Improve Accuracy and Comprehensibility of Automated Code Review

https://arxiv.org/pdf/2409.12020 Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and Memorization

https://arxiv.org/pdf/2409.00676 Fixing Code Generation Errors for Large Language Models

https://arxiv.org/pdf/2409.04114 MULTI-PROGRAMMING LANGUAGE ENSEMBLE FOR CODE GENERATION IN LARGE LANGUAGE MODEL

https://dl.acm.org/doi/pdf/10.1145/3637528.3671452 Reasoning and Planning with Large Language Models in Code Development

https://arxiv.org/pdf/2409.13928 Eliciting Instruction-tuned Code Language Models’ Capabilities to Utilize Auxiliary Function for Code Generation 可能也是灌

https://arxiv.org/pdf/2408.15658 An Empirical Study on Self-correcting Large Language Models for Data Science Code Generation

https://arxiv.org/pdf/2409.06957 Policy Filtration in RLHF to Fine-Tune LLM for Code Generation

自动化定理证明

这个领域我确实不了解。可能未来可以和同学先交流一些基础知识再仔细读

https://arxiv.org/pdf/2404.07382 Learn from Failure: Fine-Tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic Proving 前上交 WF 亚军选手 jingbo shang，现 UCSD 副教授组里的工作。考虑了负例样本在用于自动化定理证明的 agent 训练中的学习价值。涨点 30% 多，还是很猛的。没太看是用的 RL 还是 SFT。之前看到的 trial&error，蛮多用 RL 的。

web nevigation

https://arxiv.org/pdf/2404.10887 Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning

https://arxiv.org/abs/2404.03648 AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent

Evaluation

https://arxiv.org/pdf/2308.04026 An open-source sandbox for large language model evaluation

LLM Agents

https://arxiv.org/pdf/2402.01030 Executable Code Actions Elicit Better LLM Agents 大模型使用 code 能不能算一种工具调用

surveys

组里的 agent survey list ；开源框架调查 list ；tool Learning survey paper list

https://arxiv.org/pdf/2409.14457 Large Model Agents: State-of-the-Art, Cooperation Paradigms, Security and Privacy, and Future Trends

Active Agents

我们希望 LLM Agent 能主动起来，而不是被动的接受 prompt，给出 completion

https://arxiv.org/pdf/2409.17641 AP-VLM: Active Perception Enabled by Vision-Language Models

工具调用

https://arxiv.org/pdf/2406.11200 AVATAR: Optimizing LLM Agents for Tool-Assisted Knowledge Retrieval

https://arxiv.org/pdf/2406.12045 τ -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

https://arxiv.org/pdf/2409.00920 ToolACE: Winning the Points of LLM Function Calling

https://arxiv.org/pdf/2302.04761 Toolformer: Language Models Can Teach Themselves to Use Tools

【已读】https://arxiv.org/pdf/2404.00450 Planning and Editing What You Retrieve for Enhanced Tool Learning 这篇主要优化工具调用这个部分。motivation 很现实，现有的 Retrieve & read 的 pipeline 有两个问题，首先是直接拿 query 去 retrieve 会有很大偏差，再一个是 hand crafted tool description 和 query 对不齐。于是做法就是先做 query decomposition（plan），用 subpart 去 retrieve。得到的工具，还需要将它的 description 和 query 对齐，再把得到的信息传给 LLM。

【已读】https://arxiv.org/pdf/2410.03439 TOOLGEN: UNIFIED TOOL RETRIEVAL AND CALLING VIA GENERATION 力大砖飞。把四万个 tool，每个 tool 一个 token 添加到 LLaMA3.1-8B 的词表里面了。然后三阶段训练：

感觉真的是很魔怔的一个工作。

移动端 agent

https://arxiv.org/pdf/2406.11896 DigiRL: Training In-The-Wild Device-Control
Agents with Autonomous Reinforcement Learning

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7479556 MobiGoal: Flexible Achievement of Personal Goals for Mobile Users

https://arxiv.org/pdf/2409.00088 On-Device Language Models: A Comprehensive Review

https://arxiv.org/pdf/2408.13933 MobileQuant: Mobile-friendly Quantization for On-device Language Models 这篇讨论的似乎是移动端的量化问题

agent as operation systems

https://arxiv.org/pdf/2403.16971 AIOS: LLM Agent Operating System

https://arxiv.org/pdf/2409.16120 MOSS: Enabling Code-Driven Evolution and Context Management for AI Agents

agent workflow

https://arxiv.org/pdf/2311.10751 ProAgent: From Robotic Process Automation to Agentic Process Automation 自动化 workflow 设计与监测

https://arxiv.org/pdf/2410.10762 AFLOW: AUTOMATING AGENTIC WORKFLOW GENERATION

multiagent？

https://arxiv.org/pdf/2405.15677 SMART: Scalable Multi-agent Real-time Simulation via Next-token Prediction

https://arxiv.org/pdf/2408.11416 Subgoal-based Hierarchical Reinforcement
Learning for Multi-Agent Collaboration

https://arxiv.org/pdf/2405.09935 DEBATE: Devil’s Advocate-Based Assessment and Text Evaluation 评估的时候也可以引入多智能体

https://arxiv.org/pdf/2404.01663 CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models

https://arxiv.org/pdf/2305.14325 Improving Factuality and Reasoning in Language Models through Multiagent Debate 主打新概念吧

https://arxiv.org/pdf/2408.15971 BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

https://arxiv.org/pdf/2308.10848 AGENTVERSE: FACILITATING MULTI-AGENT COLLABORATION AND EXPLORING EMERGENT BEHAVIORS

https://arxiv.org/pdf/2402.18439 Beyond Natural Language: LLMs Leveraging Alternative Formats for Enhanced Reasoning and Communication 这篇还是挺有意思的，讨论的是 multiagent 通信的过程中，除了自然语言之外还有什么方式。这篇文章做的方法，可以降低 70% 多的 token 消耗。

agent 训练？

https://arxiv.org/pdf/2407.03502 AgentInstruct:Toward Generative Teaching with Agentic Flows

https://arxiv.org/pdf/2310.12823 AGENTTUNING: ENABLING GENERALIZED AGENT ABILITIES FOR LLMS

这篇文章的 idea 就是在特定的数据集上微调会让 LLM 丧失通用能力。于是他们将蒸馏数据集和常规数据集拼起来对模型进行训练。因为这篇文章是早年文章，所以他们做这个的时候可以说自己是第一个做的。

https://arxiv.org/pdf/2312.08468 On Diagnostics for Understanding Agent Training Behaviour in Cooperative MARL 突尼斯学校做的工作？

https://arxiv.org/pdf/2406.01495 Re-ReST: Reflection-Reinforced Self-Training for Language Agents

https://arxiv.org/pdf/2402.15506 AGENTOHANA: DESIGN UNIFIED DATA AND TRAINING PIPELINE FOR EFFECTIVE AGENT LEARNING

https://arxiv.org/pdf/2403.14589 ReAct Meets ActRe: When Language Agents Enjoy Training Data Autonomy

https://arxiv.org/pdf/2406.04151 AGENTGYM: Evolving Large Language Model-based Agents across Diverse Environments

https://arxiv.org/pdf/2407.03502 AgentInstruct: Toward Generative Teaching with Agentic Flows

https://arxiv.org/pdf/2408.00764 AGENTGEN: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation

https://arxiv.org/pdf/2402.15506 AGENTOHANA: DESIGN UNIFIED DATA AND TRAINING PIPELINE FOR EFFECTIVE AGENT LEARNING

https://aclanthology.org/2024.acl-long.670.pdf Agent LUMOS: Unified and Modular Training for Open-Source Language Agents

memory

https://arxiv.org/pdf/2404.13501 A Survey on the Memory Mechanism of Large Language Model based Agents

行为模仿

可能在那个年代还是比较新的概念，但是现在大家似乎都玩烂了

https://arxiv.org/pdf/2306.02552 User Behavior Simulation with Large Language Model based Agents

https://arxiv.org/pdf/2408.07888 Fine-tuning Large Language Models with Human-inspired Learning Strategies in Medical Question Answering

https://arxiv.org/pdf/2409.15865 BeSimulator: A Large Language Model Powered Text-based Behavior Simulator

RAG

https://arxiv.org/pdf/2408.10497 QUITO-X: An Information Bottleneck-based Compression Algorithm with Cross-Attention 算法竞赛选手 y_dove 的文章

https://arxiv.org/pdf/2305.17331 Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In

https://arxiv.org/pdf/2409.03708 RAG based Question-Answering for Contextual Response Prediction System

https://arxiv.org/pdf/2409.09916 SFR-RAG: Towards Contextually Faithful LLMs

https://aclanthology.org/2024.inlg-demos.3.pdf VideoRAG: Scaling the context size and relevance for video question-answering

https://arxiv.org/pdf/2409.14924 Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

https://arxiv.org/pdf/2409.12294 RAG-Modulo: Solving Sequential Tasks using Experience, Critics, and Language Models

https://arxiv.org/pdf/2409.01666 In Defense of RAG in the Era of Long-Context Language Models 这标题太让人难绷了

https://arxiv.org/pdf/2405.16089 Towards Completeness-Oriented Tool Retrieval for Large Language Models 主要关注 tool retrieval 的这个过程，捕捉了除了工具和 task 的 consistency 和工具之间的协同性。主要的技术路线是对比学习（至少 objective function 都是对比学习的式子）

https://arxiv.org/pdf/2404.16130 From Local to Global: A Graph RAG Approach to Query-Focused Summarization

https://arxiv.org/pdf/2409.05591 MEMORAG: MOVING TOWARDS NEXT-GEN RAG VIA MEMORY-INSPIRED KNOWLEDGE DISCOVERY

花式 prompt engineering

https://arxiv.org/pdf/2406.06608 The Prompt Report: A Systematic Survey of Prompting Techniques

https://arxiv.org/pdf/2407.04118 MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization 还是自适应 prompt 的工作。甚至带 RL 了已经。

https://arxiv.org/pdf/2409.11136 Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

X of thoughts

这部分主要收集了一些扩展 thought 的方法。值得注意的是，这个是 thought 的生成，不是 react 的 pipeline。但是有些分析 thought 的文章也可以直接推广到 react 的使用上。这些文章用来做实验的数据集非常固定，都是二十四点啊，口袋魔方啊，GSM8K 啊等等。

https://arxiv.org/pdf/2208.14271 Faithful Reasoning Using Large Language Models 这个架构好像只能用来做选择题？

https://arxiv.org/pdf/2205.10625 LEAST-TO-MOST PROMPTING ENABLES COMPLEX REASONING IN LARGE LANGUAGE MODELS 这篇主要是 task decomposition，所有的过程都有 few shot 示例

https://arxiv.org/pdf/2211.12588 Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks 这篇文章也是后面文章的 baseline 之 n？

那个时代的 deepmind/google brain 就开始关注推理了，真的是 pioneer

https://arxiv.org/pdf/2403.05313v1 RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation

先吐槽一下，这篇文章做实验的时候说涨点涨的是相对比例，看开头好家伙涨近 20%，点到 table x一看大跌眼镜。

做法就是通过 RAG获得 thought。之前 zhangzhong 学长跟我提过用 RAG 获得 Next action 的一些问题（比如简单粗暴 RAG，得到的结果在语义上相关，在别的维度就完全无关）。

https://arxiv.org/pdf/2311.04254 EVERYTHING OF THOUGHTS : DEFYING THE LAW OF PENROSE TRIANGLE FOR THOUGHT GENERATION

https://arxiv.org/pdf/2406.04271 Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models 一看这个年头就知道技术先进一些。这篇似乎也是用了 retrieve thought 的手法

https://arxiv.org/pdf/2409.12618 Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning

【已读】https://arxiv.org/pdf/2409.12411 Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation 大概意思是 CoT 是一次性推理的，于是自然有一些问题。一次性推理可以变成多步迭代，那么 agent 也是多步迭代，于是把原来的 CoT pipeline 用 agent 那套（ $s_t,a_t,o_t$ ）重新改写一遍，自己作为环境模拟器。

openai-o1

https://arxiv.org/pdf/2408.08210 Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models

https://arxiv.org/pdf/2409.13183 SKIntern: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models 这个名字一看就很来劲！

https://arxiv.org/pdf/2407.00497 LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement 很难评价这篇文章。

MCTS

https://arxiv.org/pdf/2409.09584 RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation

multimodel

https://arxiv.org/pdf/2408.17150 Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning 怎么已经有人开始干这个了……

https://arxiv.org/pdf/2409.09269 Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

https://arxiv.org/pdf/2408.15626 Can Visual Language Models Replace OCR-Based Visual Question Answering Pipelines in Production? A Case Study in Retail. 这篇标题很有趣啊。

https://arxiv.org/pdf/2312.00849 RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

https://arxiv.org/pdf/2409.07353 Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks 这篇偏 alignment

https://arxiv.org/pdf/2402.15116 Large Multimodal Agents: A Survey

https://arxiv.org/pdf/2408.06040 ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers

https://arxiv.org/pdf/2408.11748 GeoMeter: Probing Depth and Height Perception of Large Visual-Language Models

https://arxiv.org/pdf/2409.11148 Improving the Efficiency of Visually Augmented Language Models

https://arxiv.org/pdf/2408.02718 MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

https://arxiv.org/pdf/2409.05395 Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling

https://arxiv.org/pdf/2409.00844 How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

https://arxiv.org/pdf/2409.18042 EMOVA : EMPOWERING LANGUAGE MODELS TO SEE, HEAR AND SPEAK WITH VIVID EMOTIONS

https://arxiv.org/pdf/2409.14083 SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information 一看名字就觉得有趣啊

https://arxiv.org/pdf/2409.15505 Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs 其实看标题，没太看懂干了个啥。属性，有很重要吗？

vision Language model 基座

https://arxiv.org/pdf/2409.10488 Do Pre-trained Vision-Language Models Encode Object States?

https://arxiv.org/pdf/2409.14066 KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data

https://arxiv.org/pdf/2409.13612 FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs

VLM 应用

https://arxiv.org/pdf/2405.10292 Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

https://arxiv.org/pdf/2409.10419 HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models

https://arxiv.org/pdf/2406.13621v1 Improving Visual Commonsense in Language Models via Multiple Image Generation

上面两篇都是 visual commonsense

prompt VLM

https://ojs.aaai.org/index.php/AAAI/article/view/28297 Self-Prompt Mechanism for Few-Shot Image Recognition

https://arxiv.org/pdf/2409.06166 Revisiting Prompt Pretraining of Vision-Language Models 这篇文章第一完成单位来自南开大学！

https://arxiv.org/pdf/2409.17143 Attention Prompting on Image for Large Vision-Language Models 看了第一页的图，就知道这篇文章很牛逼！

https://arxiv.org/pdf/2203.05557 Conditional Prompt Learning for Vision-Language Models 这篇应该是非常重要的

https://arxiv.org/pdf/2409.14484 Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization 标题疑似有点笼统了

https://arxiv.org/pdf/2409.15310 Visual Prompting in Multimodal Large Language Models: A Survey

https://arxiv.org/pdf/2305.01278 VPGTrans: Transfer Visual Prompt Generator across LLMs

语音交互

https://arxiv.org/pdf/2407.04051 FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

https://arxiv.org/pdf/2408.13106 NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

https://arxiv.org/pdf/2409.09554 ASR Error Correction using Large Language Models 这里的 ASR 是指 Automatic speech recognition

视频理解

https://aclanthology.org/2024.acl-long.772.pdf DeVAn: Dense Video Annotation for Video-Language Models

https://arxiv.org/pdf/2104.04182 FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework 我靠ruoyao wang 怎么还做过这种东西

https://aclanthology.org/2020.lrec-1.536.pdf LifeQA: A Real-life Dataset for Video Question Answering

https://arxiv.org/pdf/2409.09348 QTG-VQA: Question-Type-Guided Architectural for VideoQA Systems

https://arxiv.org/pdf/2409.07748 TOP-DOWN ACTIVITY REPRESENTATION LEARNING FOR VIDEO QUESTION ANSWERING

https://arxiv.org/pdf/2409.07747 MULTI-OBJECT EVENT GRAPH REPRESENTATION LEARNING FOR VIDEO QUESTION ANSWERING

作者顺序都不变，网页标号还是连着的，真就连灌两篇啊。

https://arxiv.org/pdf/2408.17443 HERMES: TEMPORAL-COHERENT LONG-FORM UNDERSTANDING WITH EPISODES AND SEMANTICS

https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/05543.pdf Video Question Answering with Procedural Programs

https://arxiv.org/pdf/2408.12763 Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models

https://arxiv.org/pdf/2409.14319 Scene-Text Grounding for Text-Based Video Question Answering

Robotics 或 Embodied Agents

https://arxiv.org/pdf/2405.13035v1 SIGMA: AN OPEN-SOURCE INTERACTIVE SYSTEM
FOR MIXED-REALITY TASK ASSISTANCE RESEARCH

https://arxiv.org/pdf/2304.13705 Learning Fine-Grained Bimanual Manipulation with
Low-Cost Hardware

https://arxiv.org/pdf/2204.01691 Do As I Can, Not As I Say:Grounding Language in Robotic Affordances 这好像是听 talk 的时候收集的

https://arxiv.org/pdf/2212.06817 RT-1: ROBOTICS TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE 这好像是 sergey levine 在cs285上宣传的

https://arxiv.org/pdf/2407.02220 Embodied AI in Mobile Robots: Coverage Path Planning with Large Language Models

https://arxiv.org/pdf/2210.03370 GNM: A General Navigation Model to Drive Any Robot
https://arxiv.org/pdf/2409.10027 E2Map: Experience-and-Emotion Map for Self-Reflective Robot Navigation with Language Models

https://arxiv.org/pdf/2409.18313 Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation 这个也能 rag 吗……

https://arxiv.org/pdf/2409.15146 COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models

https://arxiv.org/pdf/2409.14908 KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems

Diffusion models

https://arxiv.org/pdf/2407.06938 RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

https://arxiv.org/abs/2402.03570 Diffusion World Model: Future Modeling Beyond Step-by-Step Rollout for Offline Reinforcement Learning

https://arxiv.org/abs/2405.12399 Diffusion for World Modeling: Visual Details Matter in Atari

diffusion policy

https://arxiv.org/pdf/2303.04137 Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

transfusion

这个关键词在arxiv 上能搜到一坨结果。

https://arxiv.org/pdf/2203.11496 TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers

https://arxiv.org/pdf/2403.18681 TRANSFUSION: CONTRASTIVE LEARNING WITH
TRANSFORMERS

https://arxiv.org/abs/2311.09999 TransFusion -- A Transparency-Based Diffusion Model for Anomaly Detection

https://arxiv.org/abs/2210.07677 TransFusion: Transcribing Speech with Multinomial Diffusion

https://arxiv.org/pdf/2307.12667 TRANSFUSION: GENERATING LONG, HIGH FIDELITY TIME SERIES USING DIFFUSION MODELS WITH TRANSFORMERS

https://www.arxiv.org/pdf/2408.11039 Transfusion: Predict the Next Token and
Diffuse Images with One Multi-Modal Model

Exploration

https://aclanthology.org/2024.acl-long.815.pdf Natural Language Satisfiability: Exploring the Problem Distribution and Evaluating Transformer-based Language Models

https://arxiv.org/pdf/2409.12262 Bootstrapping Object-level Planning with Large Language Models

https://arxiv.org/pdf/2408.11815 Great Memory, Shallow Reasoning: Limits of kNN-LMs

https://arxiv.org/pdf/2409.15451 Tag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models（这篇是一个环境。偏 infra）

DPO

https://arxiv.org/pdf/2409.17791 Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

https://arxiv.org/pdf/2404.10719 Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

https://arxiv.org/pdf/2406.18629 STEP-DPO: STEP-WISE PREFERENCE OPTIMIZATION FOR LONG-CHAIN REASONING OF LLMS

DPO 是正负样本对比学习，这篇论文认为在 long reasoning 场景中，直接对整个 trajectory 做正负样本对比会丢失一些信息，所以我把 reward 每步最小化 $y_+$ 的预测 - $y_-$ 的预测

这个一看就是编了一个做法然后做了做实验就灌了，毕竟 figure 上的提升也不算高

https://arxiv.org/pdf/2406.11176 Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement

一言以蔽之：SFT loss + step-DPO loss + outcome-DPO loss（不 step 的 DPO）。非常好裁缝啊。但是我觉得没有从方法上解决问题啊。

https://arxiv.org/pdf/2409.03650 ON THE LIMITED GENERALIZATION CAPABILITY OF THE IMPLICIT REWARD MODEL INDUCED BY DIRECT PREFERENCE OPTIMIZATION 这篇分析了 DPO reward model 和 RLHF reward model 的区别。结论是使用 DPO reward model 进行模型训练会导致训出来的模型面对 OOD 问题能力不足，五个测试平均掉三个点最多掉七个点。所以说 DPO reward model 的泛化能力有限，而且那些 iterative 的 DPO 方法在某种程度上是 RLHF reward model 的集成？

https://arxiv.org/pdf/2406.09760 Bootstrapping Language Models with DPO Implicit Rewards

Iterative dpo 需要每轮构建 preference dataset。本文使用上一轮的 reward model 对当前模型生成的若干个回复进行打分，分最高的和最低的作为 $y_{win},y_{lose}$ ，不需要外部监督信号。

https://arxiv.org/pdf/2408.16751 A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models

Step Reward

https://arxiv.org/pdf/2310.10080 LET’S REWARD STEP BY STEP: STEP-LEVEL REWARD MODEL AS THE NAVIGATORS FOR REASONING

https://arxiv.org/pdf/2305.20050 Let’s Verify Step by Step（OpenAI 出品）

https://arxiv.org/pdf/2402.01469 AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through Process Feedback

From LLM's point of view

https://arxiv.org/pdf/2402.02716 Understanding the planning of LLM agents: A survey

https://arxiv.org/pdf/2212.10403 Towards Reasoning in Large Language Models: A Survey

https://arxiv.org/pdf/2305.14992 Reasoning with Language Model is Planning with World Model

https://arxiv.org/pdf/2405.16376 STRIDE: A Tool-Assisted LLM Agent Framework for Strategic and Interactive Decision-Making

https://arxiv.org/pdf/2403.03101 KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents 好像是 retrieve next action/thought 的最早的文章

https://arxiv.org/pdf/2402.17453 DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning

https://arxiv.org/pdf/2408.16737 Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling这是DeepMind 的文章

https://arxiv.org/pdf/2404.02078 Advancing LLM Reasoning Generalists with Preference Trees

subgoal

https://arxiv.org/pdf/2406.04784 SELFGOAL: Your Language Agents Already Know How to Achieve High-level Goals

https://arxiv.org/pdf/2107.00541 Goal-Conditioned Reinforcement Learning with Imagined Subgoals

长程推理

https://arxiv.org/abs/2408.06318 Can We Rely on LLM Agents to Draft Long-Horizon Plans? Let's Take TravelPlanner as an Example

https://arxiv.org/abs/2403.18760 MLDT: Multi-Level Decomposition for Complex Long-Horizon Robotic Task Planning with Open-Source Large Language Model

https://arxiv.org/pdf/2403.08978 AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents

From RL's point of view

https://arxiv.org/pdf/2403.02502 Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents

https://arxiv.org/pdf/2406.11176 Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement

https://arxiv.org/pdf/1802.07245 Meta-Reinforcement Learning of Structured Exploration Strategies 这篇是远古 sergey levine RL 文章，由于 intro 有点没看懂所以鸽了先。

https://arxiv.org/pdf/2409.01369v1 Imitating Language via Scalable Inverse Reinforcement Learning 这篇也是新颖。

https://arxiv.org/pdf/2107.10390 Reinforcement Learning Agent Training with Goals for Real World Tasks

https://arxiv.org/pdf/2110.12080 C-PLANNING: AN AUTOMATIC CURRICULUM FOR LEARNING GOAL-REACHING TASKS

https://arxiv.org/pdf/2106.10544 Learning Space Partitions for Path Planning

quiet q*

https://arxiv.org/pdf/2403.09629v1 Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

https://arxiv.org/pdf/2408.02666 Self-Taught Evaluators（Meta 作）

有预算限制的推理

https://roboticsproceedings.org/rss20/p112.pdf AutoGPT+P: Affordance-based Task Planning using Large Language Models

Reinforcement learning

https://arxiv.org/pdf/2408.15240 Generative Verifiers: Reward Modeling as Next-Token Prediction

https://arxiv.org/pdf/2305.20050 Let’s Verify Step by Step

https://arxiv.org/pdf/2409.16663 Mitigating Covariate Shift in Imitation Learning for Autonomous Vehicles Using Latent Space Generative World Models 这是炒冷饭还是什么意思。这个组做这个做了这么多年了啊

step reward 设计

https://arxiv.org/pdf/1812.02690 Provably Efficient Maximum Entropy Exploration 好像是牛逼 TCS 文章

https://github.com/WindyLab/LLM-RL-Papers 西湖大学维护的 LLM RL 相关的论文集合

https://arxiv.org/pdf/2406.14324 Revealing the learning process in reinforcement learning agents through attention-oriented metrics

基础 RL 远古文章

这些都是 sergey levine 在 cs285 上提到的东西

https://arxiv.org/pdf/1707.01495 Hindsight Experience Replay

https://arxiv.org/pdf/1706.03741 Deep Reinforcement Learning from Human Preferences

https://arxiv.org/pdf/1912.06088 Learning to Reach Goals via Iterated Supervised Learning

https://arxiv.org/pdf/1903.01973 Learning Latent Plans from Play

AI 生成内容检测

https://arxiv.org/abs/2409.14285 ESPERANTO: Evaluating Synthesized Phrases to Enhance Robustness in AI Detection for Text Origination

https://aclanthology.org/2023.emnlp-main.463.pdf Beat LLMs at Their Own Game: Zero-Shot LLM-Generated Text Detection via Querying ChatGPT

https://ieeexplore.ieee.org/abstract/document/10684742 Modality Perception Learning-Based Determinative Factor Discovery for Multimodal Fake News Detection

总结现状，反思行情小作文

https://arxiv.org/pdf/2407.01502 AI Agents That Matter 喷了喷现在的大多数工作， evaluation 都做的不好

https://arxiv.org/pdf/2211.16327 ON THE POWER OF FOUNDATION MODELS 范畴论小文章

未分类·

Generative agents: Interactive simulacra of human behavior

https://ysymyth.github.io/papers/Dissertation-finalized.pdf 著名老哥shunyu yao 的博士答辩论文

https://proceedings.neurips.cc/paper_files/paper/2011/file/e19347e1c3ca0c0b97de5fb3b690855a-Paper.pdf Unsupervised learning models of primary cortical receptive fields and receptive field plasticity 有点太老了，不知道有没有意义还。

https://arxiv.org/pdf/2405.16137 Comparison between Behavior Trees and Finite State Machines

https://proceedings.neurips.cc/paper_files/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf Learning to summarize from human feedback OpenAI 力作

https://arxiv.org/pdf/2008.02217 HOPFIELD NETWORKS IS ALL YOU NEED

https://arxiv.org/pdf/2408.11431 Diagnosing and Remedying Knowledge Deficiencies in LLMs via Label-free Curricular Meaningful Learning

https://arxiv.org/pdf/2307.06865 Effective Prompt Extraction from Language Models

https://dl.acm.org/doi/abs/10.1145/3637528.3672010 GraphWiz: An Instruction-Following Language Model for Graph Computational Problems

https://arxiv.org/pdf/2409.05283 On the Relationship between Truth and Political Bias in Language Models

https://www.sciencedirect.com/science/article/abs/pii/S0950705124010724 TabSAL: Synthesizing Tabular data with Small agent Assisted Language models

https://arxiv.org/pdf/2409.12990 Hyperbolic Brain Representations

posted @ 2024-09-05 21:59 没学完四大礼包不改名阅读(314) 评论(2) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· 读论文日记

· 2024-03-19 闲话

· 200篇2024/S1论文文章助你“入坑”AI工程：FT + KG + RAG + Agent

· [AI/GPT] LLM的擅长与不擅长：深入剖析大语言模型的能力边界

· 大语言模型（LLM）评价指标小汇总

没学完四大礼包不改名

ecfinal2024铁牌获得者