paper 管理
文章实在太多了,实现一个非常科学的分类实在是太难了。有的文章,今天觉得它是这类,明天再读一遍这个标题又觉得它是那类。有些分类之间壁垒本身就不明确(也可能是我对很多分类的本质并不清楚)
其实这个 collection 可以有很多 motivation,比如我只是看到了同类的文章,例如某几篇文章 follow up 了它们的 original paper,于是放到了这里。original paper 的 idea 再往上硬要归类就比较牵强。所以如果你觉得这个分类令你不适(其实它经常令我不适,降低我的归类效率),可以选择 ctrl+F。
2024-09-28 重新整理了一下,因为我听说了一个叫做 post training 的词,突然就知道有些乱乱的类别怎么分了。
说明:
-
分类的名字,有的是中文,有的是英文。目前还没有统一成一种语言。
-
每篇文章格式都是 链接(多数来自 arxiv) + 名字 + (我自己写了一些 fun fact 或者一些阅读体验,几乎都是用中文写的。)
-
因为我可能关注(搜刮) LLM 方向的文章比较多,所以你可以理解为所有标题都是 from LLM's point of view
-
没标已读不代表未读
商用大模型技术报告
https://www.arxiv.org/pdf/2408.08152 DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search (DeepSeek 出品,被吹了好像)
https://arxiv.org/pdf/2409.12122 QWEN2.5-MATH TECHNICAL REPORT: TOWARD MATHEMATICAL EXPERT MODEL VIA SELFIMPROVEMENT
https://arxiv.org/pdf/2409.12186 Qwen2.5-Coder Technical Report
https://arxiv.org/pdf/2404.06395 MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
大列表
efficient transformer 综述
transformer circuits (真的想读这个,什么时候有时间啊)
深度学习中会涉及的问题
模型训练
https://zhuanlan.zhihu.com/p/694263912 badam 优化器?
https://arxiv.org/pdf/2409.03137 THE ADEMAMIX OPTIMIZER: BETTER, FASTER, OLDER “这就是 TCS 吗”
https://arxiv.org/pdf/2409.11727 Enabling Real-Time Conversations with Minimal Training Costs
https://arxiv.org/pdf/2409.01790 Training on the Benchmark Is Not All You Need 这篇放到这里有点搞笑的说。
蒸馏
https://arxiv.org/pdf/2407.05682 Retrieved In-Context Principles from Previous Mistakes 怎么感觉是唐文。
https://www.zhihu.com/question/309808462/answer/3365782354
https://arxiv.org/pdf/2408.11796 LLM Pruning and Distillation in Practice: The Minitron Approach
https://arxiv.org/pdf/2409.12512 Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models
可解释性
https://arxiv.org/pdf/2406.16033 Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models
泛化性
https://arxiv.org/pdf/2103.02503 Domain Generalization: A Survey(这篇放到这里,说明我这个分类是粗粒度的)
https://arxiv.org/pdf/2405.16766 Reframing the Relationship in Out-of-Distribution Detection
https://arxiv.org/pdf/2409.07335 Explanation, Debate, Align: A Weak-to-Strong Framework for Language Model Generalization
https://arxiv.org/pdf/2409.04787 Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models
https://aclanthology.org/2023.findings-emnlp.768.pdf Improving generalization in large language models by learning prefix subspaces 这篇大概率是没时间读明白了
https://arxiv.org/pdf/2409.07335 Explanation, Debate, Align: A Weak-to-Strong Framework for Language Model Generalization
https://arxiv.org/pdf/2409.18433 Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization 这篇文章是测试 LLM 泛化性的 benchmark。但是在力大砖飞的时代,有人会选择 propose 一些算法 而不是造一些数据来提升泛化性吗
知识迁移
https://arxiv.org/pdf/2408.10858 Knowledge Sharing and Transfer via Centralized Reward Agent for Multi-Task Reinforcement Learning
https://arxiv.org/pdf/2408.12525v1 Scaling, Control and Generalization in Reinforcement Learning Level Generators
continual learning
https://arxiv.org/pdf/2403.10056 Don’t Half-listen: Capturing Key-part Information in Continual Instruction Tuning
https://github.com/xialeiliu/Awesome-Incremental-Learning 学校老师收集的 incremental learning 方向的文章。很遗憾,这个没有分类,只是按照时间梳理了一下。
算法竞赛
DNA动了孩子们!
https://arxiv.org/pdf/2409.09054 Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis
Knowledge Graph
https://arxiv.org/pdf/2409.03155 long reasoning 能力提升的一个方法是在 knowledge graph 上获得信息。QA 是和 knowledge graph 交互的一个手段。这篇文章通过提升 knowledge Graph QA 的能力提升了 LLM reasoninging 能力
https://arxiv.org/abs/2406.07080 DARA: Decomposition-Alignment-Reasoning Autonomous Language Agent for Question Answering over Knowledge Graphs
https://arxiv.org/abs/2407.17190 Fusing LLMs and KGs for Formal Causal Reasoning behind Financial Risk Contagion
https://arxiv.org/pdf/2402.06861 UrbanKGent: A Unified Large Language Model Agent Framework for Urban Knowledge Graph Construction 这篇应当真是一个 sub domain 的事
Representation Learning
https://arxiv.org/pdf/2409.03662 The Representation Landscape of Few-Shot Learning and Fine-Tuning in Large Language Models 标题讲的已经很清楚了。这篇文章的内容有点太表征学习了,有点读不起。
https://arxiv.org/pdf/2212.07677 Transformers Learn In-Context by Gradient Descent 之前讨论的时候 zhang zhong 学长提到的nb文章,但是我没仔细读过。
https://arxiv.org/pdf/2409.04318 Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs 这篇是微信公众号上推荐的讨论 in context learning 机制的文章
https://arxiv.org/pdf/2408.13661 Hierarchical Network Fusion for Multi-Modal Electron Micrograph Representation Learning with Foundational Large Language Models 第一完成机构是 TCS Research,被吓到了
https://arxiv.org/pdf/2409.12005 Representing Positional Information in Generative World Models for Object Manipulation
LLM benchmarks
https://arxiv.org/pdf/2408.15729 LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models
https://arxiv.org/pdf/2308.07201 CHATEVAL: TOWARDS BETTER LLM-BASED EVALUATORS THROUGH MULTI-AGENT DEBATE
https://arxiv.org/pdf/2409.04168 From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks 这也是一个新赛道
https://arxiv.org/pdf/2409.07641 SIMULBENCH: Evaluating Language Models with Creative Simulation Tasks
https://arxiv.org/pdf/2409.12060 PARAPHRASUS: A Comprehensive Benchmark for Evaluating Paraphrase Detection Models
https://openreview.net/pdf?id=4k4cocpuSw Benchmarking Edge Regression on Temporal Networks
https://arxiv.org/pdf/2206.08514 A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks
https://arxiv.org/pdf/2408.08978 See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses
https://arxiv.org/pdf/2409.00844 Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
https://arxiv.org/pdf/2409.15272 OMNIBENCH: TOWARDS THE FUTURE OF UNIVERSAL OMNI-LANGUAGE MODELS 这就是测试多模态能力的 benchmark 了。
tool benchmarks
https://arxiv.org/pdf/2402.15491 API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs
https://arxiv.org/pdf/2310.03128 METATOOL BENCHMARK FOR LARGE LANGUAGE MODELS: DECIDING WHETHER TO USE TOOLS AND WHICH TO USE
https://arxiv.org/pdf/2311.10775 TOOLTALK: EVALUATING TOOL USAGE IN A CONVERSATIONAL SETTING 这篇倒不是 benchmark 了,这属于是一个 evaluation 的方法。
long context benchmarks
https://arxiv.org/pdf/2402.13718 ∞BENCH: Extending Long Context Evaluation Beyond 100K Tokens
https://arxiv.org/pdf/2409.16191 HELLOBENCH: EVALUATING LONG TEXT GENERATION CAPABILITIES OF LARGE LANGUAGE MODELS
https://arxiv.org/pdf/2403.12766 NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens
(Math&Science) Reasoning
https://arxiv.org/pdf/2409.02834 CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models
https://arxiv.org/pdf/2311.09805 DOCMATH-EVAL: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents
https://arxiv.org/pdf/2402.14008 OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
https://arxiv.org/pdf/2409.13730 VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning 多模态推理
https://arxiv.org/pdf/2408.15778 LOGICGAME: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
https://arxiv.org/pdf/2409.12746 Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination 能测测我们高考吗?
https://arxiv.org/pdf/2409.13729 MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model
推理 benchmark
https://arxiv.org/pdf/2402.17644 Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data
在本文中,我们重点关注高级定量推理的两个主要领域:统计推理和因果推理,图1中展示了相关示例。给定来自样本调查的数据表,统计推理旨在推断潜在的概率分布,解决诸如“y的总体均值的95%置信区间是多少”的问题;而因果推理则旨在理解变量之间的因果关系,解决诸如“t对y的平均处理效应是多少”的问题。
LLM 搏杀数学统计专业的大学生无疑了。
https://arxiv.org/pdf/2307.13692 ARB: Advanced Reasoning Benchmark for Large language Models
https://aclanthology.org/2024.acl-long.515/ T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step 这是工具学习的 evaluation
https://arxiv.org/abs/2403.05307 Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents
https://arxiv.org/pdf/2302.04752 Benchmarks for Automated Commonsense Reasoning: A Survey 这个 survey 最后有一个大表,不知道后面有没有机会用上
instruction following benchmark
https://arxiv.org/pdf/2401.03601v1 INFOBENCH: Evaluating Instruction Following Ability in Large Language Models
https://arxiv.org/pdf/2310.07641 EVALUATING LARGE LANGUAGE MODELS AT EVALUATING INSTRUCTION FOLLOWING 作者怎么都是算法竞赛选手
联邦学习
https://arxiv.org/pdf/2409.18461 Towards Diverse Device Heterogeneous Federated Learning via Task Arithmetic Knowledge Integration
LLM pretrain
https://arxiv.org/pdf/2407.06654 SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training 这个data reweight 还挺重要,得近期仔细读读这篇?
https://arxiv.org/pdf/2409.07787 Stable Language Model Pre-training by Reducing Embedding Variability
https://arxiv.org/pdf/2407.05013 Progress or Regress? Self-Improvement Reversal in Post-training
https://arxiv.org/pdf/2407.04787 Re-Tuning: Overcoming the Compositionality Limits of Large Language Models with Recursive Tuning
https://arxiv.org/pdf/2409.15825 Empirical Insights on Fine-Tuning Large Language Models for Question-Answering 一看标题就知道是干什么的
https://arxiv.org/pdf/2409.12903 Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
https://www.sciencedirect.com/science/article/pii/S0306457324002516 Gauging, enriching and applying geography knowledge in Pre-trained Language Models 可能完全没必要读。有点太标题党了。
https://arxiv.org/pdf/2405.15319 Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
https://arxiv.org/pdf/2211.00151 A Close Look into the Calibration of Pre-trained Language Models
https://arxiv.org/pdf/2409.15518 Eagle: Efficient Training-Free Router for Multi-LLM Inference
dataset
https://aclanthology.org/2022.findings-acl.17/ LEVEN: A Large-Scale Chinese Legal Event Detection Dataset
模型量化
https://arxiv.org/pdf/2409.11055 A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B
mixture of experts / fusion
https://arxiv.org/pdf/2407.04153 Mixture of A Million Experts 谷歌 deepmind 老哥写的。
https://arxiv.org/pdf/2401.10491 KNOWLEDGE FUSION OF LARGE LANGUAGE MODELS
https://arxiv.org/abs/2407.19985 Mixture of Nested Experts: Adaptive Processing of Visual Tokens
https://arxiv.org/pdf/2409.01483 Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning
LoRA
https://arxiv.org/pdf/2311.03285 S-LORA: SERVING THOUSANDS OF CONCURRENT LORA ADAPTERS
reflection tuning
https://arxiv.org/pdf/2402.10110 Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning
https://arxiv.org/pdf/2310.11716 Reflection-Tuning:Data Recycling Improves LLM Instruction-Tuning
这两篇好像是一串。但是我忘了。
关于 LLM 本身的研究
https://www.nature.com/articles/s41586-024-07930-y Larger and more instructable language models become less reliable
https://arxiv.org/pdf/2409.14381 Investigating Layer Importance in Large Language Models
限制生成
https://arxiv.org/pdf/2408.12599 Controllable Text Generation for Large Language Models: A Survey
Catastrophic Forgetting
【已读】https://arxiv.org/pdf/2404.10306 Balancing Speciality and Versatility: a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model 微调过程中 versatility 和 speciality 的平衡问题,如果塞入大量特定 domain 的内容,可能会导致通用能力的丧失。这篇文章先找出模型中最表现专有能力的一些模块,把其余参数冻结,然后再用微调数据进行微调。找模块这部分,我连伪代码都没看懂。
https://arxiv.org/pdf/2405.14860 Not All Language Model Features Are Linear
embedding
https://arxiv.org/pdf/2407.12886 Whitening Not Recommended for Classification Tasks in LLMs
tokenizer
https://arxiv.org/pdf/2106.00400 Sub-Character Tokenization for Chinese Pretrained Language Models
long context
https://arxiv.org/pdf/2402.04617 InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory
https://arxiv.org/pdf/2409.12181 A Controlled Study on Long Context Extension and Generalization in LLMs
https://arxiv.org/pdf/2409.02897 LONGCITE: ENABLING LLMS TO GENERATE FINEGRAINED CITATIONS IN LONG-CONTEXT QA
https://arxiv.org/pdf/2409.04774 Untie the Knots: An Efficient Data Augmentation Strategy for Long-Context Pre-Training in Language Models
https://arxiv.org/pdf/2305.13304 RECURRENTGPT: Interactive Generation of (Arbitrarily) Long Text
lost in the middle
【已读】https://arxiv.org/pdf/2307.03172 Lost in the Middle: How Language Models Use Long Contexts
上面这篇文章是开山之作了,lost in the middle 大概描述了这么一件事情:
- 当相关信息在输入上下文的中间位置时,模型的性能会显著下降,表明当前的语言模型在长文本上下文中并不总是能够稳健地使用信息。
- 观察到一个明显的 U 形性能曲线,即当相关信息出现在输入上下文的开始(首因效应)或结束(近因效应)时,模型的性能最高,而在中间位置时性能显著下降。
- 即使是专门为长上下文设计的模型,也存在这种性能下降的问题。
我觉得这和 attention sink 有点类似。或者通过 attention sink 容易联想到这个问题。分析 lost in the middle 出现的原因,我觉得最重要的还是训练数据,预训练数据都是互联网找的,互联网上的文字段落都是人写的,小学老师就教你总分总,自然两头的信息量(关键程度/对理解的影响程度)大于中间部分,这就导致了 lost in the middle。
https://arxiv.org/pdf/2403.04797 Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding 上网搜了一下搜到了这篇文章,看名字就很有趣啊
语言模型是压缩模型
https://arxiv.org/pdf/2305.14788 Adapting Language Models to Compress Contexts 作者甚至有陈丹琦
https://arxiv.org/pdf/2309.10668 LANGUAGE MODELING IS COMPRESSION
https://arxiv.org/pdf/2409.11233 Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models
模型坍缩
nature 封面文被沈老师喷说结论很 intuitive
https://arxiv.org/pdf/2404.01413 Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
减少幻觉
https://arxiv.org/pdf/2405.20974 SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales
LLM post training
有点想不出来这个 post training 究竟是干了啥。
构造数据
https://arxiv.org/pdf/2409.11500 Multi-Document Grounded Multi-Turn Synthetic Dialog Generation
https://arxiv.org/pdf/2305.14233 Enhancing Chat Language Models by Scaling High-quality Instructional Conversations 典中典 UltraChat
https://arxiv.org/pdf/2310.01377 ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback 这篇文章给 dpo 提供了 infra
https://arxiv.org/pdf/2409.12568 InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning 这篇是多模态数学微调数据集
self improve
https://arxiv.org/pdf/2406.01495 Re-ReST: Reflection-Reinforced Self-Training for Language Agents
https://arxiv.org/pdf/2406.03816 ReST-MCTS∗ : LLM Self-Training via Process Reward Guided Tree Search
https://arxiv.org/pdf/2407.18219 Recursive Introspection: Teaching Language Model Agents How to Self-Improve
https://arxiv.org/pdf/2409.03381 CogniDual Framework: Self-Training Large Language Models within a Dual-System Theoretical Framework for Improving Cognitive Tasks
RLHF
https://proceedings.neurips.cc/paper_files/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf Learning to summarize from human feedback
推理加速
https://zhuanlan.zhihu.com/p/651359908 大模型推理妙招—投机采样(Speculative Decoding)
alignment
alignment 这个 domain 有点大,把它从 post training 中抽离出来了。
https://arxiv.org/pdf/2401.05566 SLEEPER AGENTS: TRAINING DECEPTIVE LLMS THAT PERSIST THROUGH SAFETY TRAINING
https://arxiv.org/pdf/2308.06259 SELF-ALIGNMENT WITH INSTRUCTION BACKTRANSLATION
https://arxiv.org/pdf/2407.13692 PROVER-VERIFIER GAMES IMPROVE LEGIBILITY OF LLM OUTPUTS 水 OpenAI Research 主页的时候看到的
https://arxiv.org/pdf/2004.07213 Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims∗老文章了
https://arxiv.org/pdf/2202.03286 Red Teaming Language Models with Language Models
https://www.zhihu.com/column/c_1725235995694276608 知乎上有老哥每周总结一些科幻小说的。
https://arxiv.org/pdf/2408.12163 Preference-Guided Reflective Sampling for Aligning Language Models
https://arxiv.org/abs/2403.05063 Aligning Large Language Models for Controllable Recommendations
https://arxiv.org/pdf/2404.18410 Mixture-of-Instructions: Comprehensive Alignment of a Large Language Model through the Mixture of Diverse System Prompting Instructions
https://arxiv.org/pdf/2408.17003 SAFETY LAYERS OF ALIGNED LARGE LANGUAGE MODELS: THE KEY TO LLM SECURITY
https://arxiv.org/pdf/2408.12798 BACKDOORLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models
https://arxiv.org/pdf/2409.11704 FROM LISTS TO EMOJIS: HOW FORMAT BIAS AFFECTS MODEL ALIGNMENT
https://arxiv.org/pdf/2409.08206 Compositional Alignment in Vision-Language Models
https://dl.acm.org/doi/pdf/10.1145/3688850 Exploiting Pre-trained Language Models for Black-box Attack against Knowledge Graph Embeddings
**https://arxiv.org/pdf/2409.13948 Aligning Language Models Using Follow-up Likelihood as Reward Signal **
https://arxiv.org/pdf/2409.14119 Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm
https://arxiv.org/pdf/2403.02691 INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
https://arxiv.org/pdf/2409.18541 Align2LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation
back door attacks
https://arxiv.org/abs/2402.11208 Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
https://arxiv.org/abs/2402.14968 Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment
https://ieeexplore.ieee.org/abstract/document/10697229 Diffense: Defense Against Backdoor Attacks on Deep Neural Networks With Latent Diffusion
https://aclanthology.org/2023.tacl-1.91.pdf Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training
unlearning
https://arxiv.org/pdf/2406.11614 Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces 好像是讨论破坏词向量的。https://zhuanlan.zhihu.com/p/708685124 作者在这里写了一个解释
https://arxiv.org/pdf/2402.16835 EIGHT METHODS TO EVALUATE ROBUST
UNLEARNING IN LLMS 这就是做实验的地方咯
对比学习
https://arxiv.org/pdf/2402.11651 Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents这篇是第一个利用负例样本的文章。但是使用的手法非常粗糙,直接分别在正负例样本前标注这是正例或者负例。这篇文章还讲述了正负例样本配比的问题,画了很有趣的图像
https://arxiv.org/pdf/2406.00888 Show, Don’t Tell: Aligning Language Models with Demonstrated Feedback
https://arxiv.org/pdf/2409.14836 Orthogonal Finetuning for Direct Preference Optimization 一看第一页的图,就很有感觉啊。
LLM applications
https://arxiv.org/pdf/2409.14807 Interpreting Multi-band Galaxy Observations with Large Language Model-Based Agents 看了一眼第二作者是 department of astronomy,就意识到事情不太对,然后读了一下 abstract 给我吓一跳。
https://arxiv.org/pdf/2409.17166 ScriptSmith: A Unified LLM Framework for Enhancing IT Operations via Automated Bash Script Generation, Assessment, and Refinement
data augmentation
https://arxiv.org/pdf/2409.15376 ControlMath: Controllable Data Generation Promotes Math Generalist Models
role-play
https://arxiv.org/pdf/2409.11726 Revealing the Challenge of Detecting Character Knowledge Errors in LLM Role-Playing
LLM serves as world model
https://arxiv.org/pdf/2406.13945 CityBench: Evaluating the Capabilities of Large Language Model as World Model
https://arxiv.org/pdf/2406.13948 CityGPT: Empowering Urban Spatial Cognition of Large Language Models
https://arxiv.org/pdf/2407.13578v1 Large Language Models as Reliable Knowledge Bases?
https://arxiv.org/pdf/2408.15915 Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models
https://arxiv.org/pdf/2306.09296 KOLA: CAREFULLY BENCHMARKING WORLD KNOWLEDGE OF LARGE LANGUAGE MODELS
https://arxiv.org/pdf/2409.12278 Making Large Language Models into World Models with Precondition and Effect Knowledge
text based simulators
【已读】https://arxiv.org/pdf/2406.06485 Can Language Models Serve as Text-Based World Simulators? 这篇认为 LLM 意识不到环境中 非当前操作物品 按照时间 变化的事实,所以造了一些数据
https://arxiv.org/pdf/2107.04132 A Systematic Survey of Text Worlds as Embodied Natural Language Environments (Ruoyao Wang 的文章)
https://arxiv.org/pdf/1909.05398 Interactive Fiction Games: A Colossal Adventure 这篇文章的几个老哥做的 TextWorld
https://arxiv.org/pdf/2312.11970v1 Large Language Models Empowered Agent-based
Modeling and Simulation: A Survey and Perspectives
LLM sys
我也不好定义这个分类究竟是在做什么,大概是使用 LLM 作为基座,将一些之前需要人做的事情自动化掉?按照这个定义,以下两篇文章其实和 agents 不太相关。或者说这个 pipeline 只是初步的 pipeline,作者并没有试验在 complex task 上的可用性。
【已读】https://arxiv.org/pdf/2308.12261 PROMPT2MODEL: Generating Deployable Models from Natural Language Instructions
【已读】https://arxiv.org/pdf/2407.12874 SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning 也是 chenyang zhao 的文章。这篇文章讨论了微调数据自动合成的事情。conventional approach 需要外部信号,或者说更强大的教师模型,但是这篇文章希望学生模型可以 self guide。大概是这个意思?
https://arxiv.org/pdf/2312.04889 KwaiAgents: Generalized Information-seeking Agent System with Large Language Models
https://arxiv.org/pdf/2409.03215 xLAM: A Family of Large Action Models to Empower AI Agent Systems
推荐系统
https://arxiv.org/pdf/2403.06447 CoRAL: Collaborative Retrieval-Augmented Large Language Models Improve Long-tail Recommendation
垂直领域大模型
法律
https://arxiv.org/pdf/2409.11798 The Factuality of Large Language Models in the Legal Domain
金融
https://arxiv.org/pdf/2408.12337 Fine-tuning Smaller Language Models for Question Answering over Financial Documents
农业
https://d197for5662m48.cloudfront.net/documents/publicationstatus/223964/preprint_pdf/be49fd99a51b691cc4349b28c1d904a5.pdf Multi-Modal LLMs in Agriculture: A Comprehensive Review
Code generation
https://arxiv.org/pdf/2401.07339 CODEAGENT: Enhancing Code Generation with Tool-Integrated AgentSystems for Real-World Repo-level Coding Challenges
https://arxiv.org/pdf/2312.13010 AgentCoder: Multi-Agent Code Generation with
Effective Testing and Self-optimisation
https://arxiv.org/pdf/2405.17057 ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation 这篇我好还读来着,但是它这个 reflection 是用来微调的,对当时写 rubbish prompt 的我没啥帮助?
https://arxiv.org/pdf/2409.04114 MULTI-PROGRAMMING LANGUAGE ENSEMBLE FOR CODE GENERATION IN LARGE LANGUAGE MODEL 好像是加州一个做 code generation 的公司做的工作!
https://software-lab.org/publications/icse2025_calibration.pdf Calibration and Correctness of Language Models for Code
https://dl.acm.org/doi/pdf/10.1145/3695993 Fine-tuning Large Language Models to Improve Accuracy and Comprehensibility of Automated Code Review
https://arxiv.org/pdf/2409.12020 Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and Memorization
https://arxiv.org/pdf/2409.00676 Fixing Code Generation Errors for Large Language Models
https://arxiv.org/pdf/2409.04114 MULTI-PROGRAMMING LANGUAGE ENSEMBLE FOR CODE GENERATION IN LARGE LANGUAGE MODEL
https://dl.acm.org/doi/pdf/10.1145/3637528.3671452 Reasoning and Planning with Large Language Models in Code Development
https://arxiv.org/pdf/2409.13928 Eliciting Instruction-tuned Code Language Models’ Capabilities to Utilize Auxiliary Function for Code Generation 可能也是灌
https://arxiv.org/pdf/2408.15658 An Empirical Study on Self-correcting Large Language Models for Data Science Code Generation
https://arxiv.org/pdf/2409.06957 Policy Filtration in RLHF to Fine-Tune LLM for Code Generation
自动化定理证明
这个领域我确实不了解。可能未来可以和同学先交流一些基础知识再仔细读
https://arxiv.org/pdf/2404.07382 Learn from Failure: Fine-Tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic Proving 前上交 WF 亚军选手 jingbo shang,现 UCSD 副教授 组里的工作。考虑了负例样本在 用于自动化定理证明的 agent 训练中的学习价值。涨点 30% 多,还是很猛的。没太看是用的 RL 还是 SFT。之前看到的 trial&error,蛮多用 RL 的。
web nevigation
https://arxiv.org/pdf/2404.10887 Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning
https://arxiv.org/abs/2404.03648 AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent
Evaluation
https://arxiv.org/pdf/2308.04026 An open-source sandbox for large language model evaluation
LLM Agents
https://arxiv.org/pdf/2402.01030 Executable Code Actions Elicit Better LLM Agents 大模型使用 code 能不能算一种工具调用
surveys
组里的 agent survey list ;开源框架调查 list ;tool Learning survey paper list
https://arxiv.org/pdf/2409.14457 Large Model Agents: State-of-the-Art, Cooperation Paradigms, Security and Privacy, and Future Trends
Active Agents
我们希望 LLM Agent 能主动起来,而不是被动的接受 prompt,给出 completion
https://arxiv.org/pdf/2409.17641 AP-VLM: Active Perception Enabled by Vision-Language Models
工具调用
https://arxiv.org/pdf/2406.11200 AVATAR: Optimizing LLM Agents for Tool-Assisted Knowledge Retrieval
https://arxiv.org/pdf/2406.12045 τ -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
https://arxiv.org/pdf/2409.00920 ToolACE: Winning the Points of LLM Function Calling
https://arxiv.org/pdf/2302.04761 Toolformer: Language Models Can Teach Themselves to Use Tools
【已读】https://arxiv.org/pdf/2404.00450 Planning and Editing What You Retrieve for Enhanced Tool Learning 这篇主要优化工具调用这个部分。motivation 很现实,现有的 Retrieve & read 的 pipeline 有两个问题,首先是直接拿 query 去 retrieve 会有很大偏差,再一个是 hand crafted tool description 和 query 对不齐。于是做法就是先做 query decomposition(plan),用 subpart 去 retrieve。得到的工具,还需要将它的 description 和 query 对齐,再把得到的信息传给 LLM。
【已读】https://arxiv.org/pdf/2410.03439 TOOLGEN: UNIFIED TOOL RETRIEVAL AND CALLING VIA GENERATION 力大砖飞。把四万个 tool,每个 tool 一个 token 添加到 LLaMA3.1-8B 的词表里面了。然后三阶段训练:
感觉真的是很魔怔的一个工作。
移动端 agent
https://arxiv.org/pdf/2406.11896 DigiRL: Training In-The-Wild Device-Control
Agents with Autonomous Reinforcement Learning
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7479556 MobiGoal: Flexible Achievement of Personal Goals for Mobile Users
https://arxiv.org/pdf/2409.00088 On-Device Language Models: A Comprehensive Review
https://arxiv.org/pdf/2408.13933 MobileQuant: Mobile-friendly Quantization for On-device Language Models 这篇讨论的似乎是移动端的量化问题
agent as operation systems
https://arxiv.org/pdf/2403.16971 AIOS: LLM Agent Operating System
https://arxiv.org/pdf/2409.16120 MOSS: Enabling Code-Driven Evolution and Context Management for AI Agents
agent workflow
https://arxiv.org/pdf/2311.10751 ProAgent: From Robotic Process Automation to Agentic Process Automation 自动化 workflow 设计与监测
https://arxiv.org/pdf/2410.10762 AFLOW: AUTOMATING AGENTIC WORKFLOW GENERATION
multiagent?
https://arxiv.org/pdf/2405.15677 SMART: Scalable Multi-agent Real-time Simulation via Next-token Prediction
https://arxiv.org/pdf/2408.11416 Subgoal-based Hierarchical Reinforcement
Learning for Multi-Agent Collaboration
https://arxiv.org/pdf/2405.09935 DEBATE: Devil’s Advocate-Based Assessment and Text Evaluation 评估的时候也可以引入多智能体
https://arxiv.org/pdf/2404.01663 CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models
https://arxiv.org/pdf/2305.14325 Improving Factuality and Reasoning in Language Models through Multiagent Debate 主打新概念吧
https://arxiv.org/pdf/2408.15971 BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems
https://arxiv.org/pdf/2308.10848 AGENTVERSE: FACILITATING MULTI-AGENT COLLABORATION AND EXPLORING EMERGENT BEHAVIORS
https://arxiv.org/pdf/2402.18439 Beyond Natural Language: LLMs Leveraging Alternative Formats for Enhanced Reasoning and Communication 这篇还是挺有意思的,讨论的是 multiagent 通信的过程中,除了自然语言之外还有什么方式。这篇文章做的方法,可以降低 70% 多的 token 消耗。
agent 训练?
https://arxiv.org/pdf/2407.03502 AgentInstruct:Toward Generative Teaching with Agentic Flows
https://arxiv.org/pdf/2310.12823 AGENTTUNING: ENABLING GENERALIZED AGENT ABILITIES FOR LLMS
这篇文章的 idea 就是在特定的数据集上微调会让 LLM 丧失通用能力。于是他们将蒸馏数据集和常规数据集拼起来对模型进行训练。因为这篇文章是早年文章,所以他们做这个的时候可以说自己是第一个做的。
https://arxiv.org/pdf/2312.08468 On Diagnostics for Understanding Agent Training Behaviour in Cooperative MARL 突尼斯学校做的工作?
https://arxiv.org/pdf/2406.01495 Re-ReST: Reflection-Reinforced Self-Training for Language Agents
https://arxiv.org/pdf/2402.15506 AGENTOHANA: DESIGN UNIFIED DATA AND TRAINING PIPELINE FOR EFFECTIVE AGENT LEARNING
https://arxiv.org/pdf/2403.14589 ReAct Meets ActRe: When Language Agents Enjoy Training Data Autonomy
https://arxiv.org/pdf/2406.04151 AGENTGYM: Evolving Large Language Model-based Agents across Diverse Environments
https://arxiv.org/pdf/2407.03502 AgentInstruct: Toward Generative Teaching with Agentic Flows
https://arxiv.org/pdf/2408.00764 AGENTGEN: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation
https://arxiv.org/pdf/2402.15506 AGENTOHANA: DESIGN UNIFIED DATA AND TRAINING PIPELINE FOR EFFECTIVE AGENT LEARNING
https://aclanthology.org/2024.acl-long.670.pdf Agent LUMOS: Unified and Modular Training for Open-Source Language Agents
memory
https://arxiv.org/pdf/2404.13501 A Survey on the Memory Mechanism of Large Language Model based Agents
行为模仿
可能在那个年代还是比较新的概念,但是现在大家似乎都玩烂了
https://arxiv.org/pdf/2306.02552 User Behavior Simulation with Large Language Model based Agents
https://arxiv.org/pdf/2408.07888 Fine-tuning Large Language Models with Human-inspired Learning Strategies in Medical Question Answering
https://arxiv.org/pdf/2409.15865 BeSimulator: A Large Language Model Powered Text-based Behavior Simulator
RAG
https://arxiv.org/pdf/2408.10497 QUITO-X: An Information Bottleneck-based Compression Algorithm with Cross-Attention 算法竞赛选手 y_dove 的文章
https://arxiv.org/pdf/2305.17331 Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In
https://arxiv.org/pdf/2409.03708 RAG based Question-Answering for Contextual Response Prediction System
https://arxiv.org/pdf/2409.09916 SFR-RAG: Towards Contextually Faithful LLMs
https://aclanthology.org/2024.inlg-demos.3.pdf VideoRAG: Scaling the context size and relevance for video question-answering
https://arxiv.org/pdf/2409.14924 Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely
https://arxiv.org/pdf/2409.12294 RAG-Modulo: Solving Sequential Tasks using Experience, Critics, and Language Models
https://arxiv.org/pdf/2409.01666 In Defense of RAG in the Era of Long-Context Language Models 这标题太让人难绷了
https://arxiv.org/pdf/2405.16089 Towards Completeness-Oriented Tool Retrieval for Large Language Models 主要关注 tool retrieval 的这个过程,捕捉了除了工具和 task 的 consistency 和工具之间的协同性。主要的技术路线是对比学习(至少 objective function 都是对比学习的式子)
https://arxiv.org/pdf/2404.16130 From Local to Global: A Graph RAG Approach to Query-Focused Summarization
https://arxiv.org/pdf/2409.05591 MEMORAG: MOVING TOWARDS NEXT-GEN RAG VIA MEMORY-INSPIRED KNOWLEDGE DISCOVERY
花式 prompt engineering
https://arxiv.org/pdf/2406.06608 The Prompt Report: A Systematic Survey of Prompting Techniques
https://arxiv.org/pdf/2407.04118 MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization 还是自适应 prompt 的工作。甚至带 RL 了已经。
https://arxiv.org/pdf/2409.11136 Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models
X of thoughts
这部分主要收集了一些扩展 thought 的方法。值得注意的是,这个是 thought 的生成,不是 react 的 pipeline。但是有些分析 thought 的文章也可以直接推广到 react 的使用上。这些文章用来做实验的数据集非常固定,都是二十四点啊,口袋魔方啊,GSM8K 啊等等。
https://arxiv.org/pdf/2208.14271 Faithful Reasoning Using Large Language Models 这个架构好像只能用来做选择题?
https://arxiv.org/pdf/2205.10625 LEAST-TO-MOST PROMPTING ENABLES COMPLEX REASONING IN LARGE LANGUAGE MODELS 这篇主要是 task decomposition,所有的过程都有 few shot 示例
https://arxiv.org/pdf/2211.12588 Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks 这篇文章也是后面文章的 baseline 之 n?
那个时代的 deepmind/google brain 就开始关注推理了,真的是 pioneer
https://arxiv.org/pdf/2403.05313v1 RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation
先吐槽一下,这篇文章做实验的时候说涨点涨的是相对比例,看开头好家伙涨近 20%,点到 table x一看大跌眼镜。
做法就是通过 RAG获得 thought。之前 zhangzhong 学长跟我提过用 RAG 获得 Next action 的一些问题(比如简单粗暴 RAG,得到的结果在语义上相关,在别的维度就完全无关)。
https://arxiv.org/pdf/2311.04254 EVERYTHING OF THOUGHTS : DEFYING THE LAW OF PENROSE TRIANGLE FOR THOUGHT GENERATION
https://arxiv.org/pdf/2406.04271 Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models 一看这个年头就知道技术先进一些。这篇似乎也是用了 retrieve thought 的手法
https://arxiv.org/pdf/2409.12618 Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning
【已读】https://arxiv.org/pdf/2409.12411 Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation 大概意思是 CoT 是一次性推理的,于是自然有一些问题。一次性推理可以变成多步迭代,那么 agent 也是多步迭代,于是把原来的 CoT pipeline 用 agent 那套(st,at,otst,at,ot)重新改写一遍,自己作为环境模拟器。
openai-o1
https://arxiv.org/pdf/2408.08210 Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models
https://arxiv.org/pdf/2409.13183 SKIntern: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models 这个名字一看就很来劲!
https://arxiv.org/pdf/2407.00497 LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement 很难评价这篇文章。
MCTS
https://arxiv.org/pdf/2409.09584 RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation
multimodel
https://arxiv.org/pdf/2408.17150 Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning 怎么已经有人开始干这个了……
https://arxiv.org/pdf/2409.09269 Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
https://arxiv.org/pdf/2408.15626 Can Visual Language Models Replace OCR-Based Visual Question Answering Pipelines in Production? A Case Study in Retail. 这篇标题很有趣啊。
https://arxiv.org/pdf/2312.00849 RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
https://arxiv.org/pdf/2409.07353 Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks 这篇偏 alignment
https://arxiv.org/pdf/2402.15116 Large Multimodal Agents: A Survey
https://arxiv.org/pdf/2408.06040 ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers
https://arxiv.org/pdf/2408.11748 GeoMeter: Probing Depth and Height Perception of Large Visual-Language Models
https://arxiv.org/pdf/2409.11148 Improving the Efficiency of Visually Augmented Language Models
https://arxiv.org/pdf/2408.02718 MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
https://arxiv.org/pdf/2409.05395 Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling
https://arxiv.org/pdf/2409.00844 How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?
https://arxiv.org/pdf/2409.18042 EMOVA : EMPOWERING LANGUAGE MODELS TO SEE, HEAR AND SPEAK WITH VIVID EMOTIONS
https://arxiv.org/pdf/2409.14083 SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information 一看名字就觉得有趣啊
https://arxiv.org/pdf/2409.15505 Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs 其实看标题,没太看懂干了个啥。属性,有很重要吗?
vision Language model 基座
https://arxiv.org/pdf/2409.10488 Do Pre-trained Vision-Language Models Encode Object States?
https://arxiv.org/pdf/2409.14066 KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data
https://arxiv.org/pdf/2409.13612 FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs
VLM 应用
https://arxiv.org/pdf/2405.10292 Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
https://arxiv.org/pdf/2409.10419 HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models
https://arxiv.org/pdf/2406.13621v1 Improving Visual Commonsense in Language Models via Multiple Image Generation
上面两篇都是 visual commonsense
prompt VLM
https://ojs.aaai.org/index.php/AAAI/article/view/28297 Self-Prompt Mechanism for Few-Shot Image Recognition
https://arxiv.org/pdf/2409.06166 Revisiting Prompt Pretraining of Vision-Language Models 这篇文章第一完成单位来自南开大学!
https://arxiv.org/pdf/2409.17143 Attention Prompting on Image for Large Vision-Language Models 看了第一页的图,就知道这篇文章很牛逼!
https://arxiv.org/pdf/2203.05557 Conditional Prompt Learning for Vision-Language Models 这篇应该是非常重要的
https://arxiv.org/pdf/2409.14484 Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization 标题疑似有点笼统了
https://arxiv.org/pdf/2409.15310 Visual Prompting in Multimodal Large Language Models: A Survey
https://arxiv.org/pdf/2305.01278 VPGTrans: Transfer Visual Prompt Generator across LLMs
语音交互
https://arxiv.org/pdf/2407.04051 FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
https://arxiv.org/pdf/2408.13106 NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks
https://arxiv.org/pdf/2409.09554 ASR Error Correction using Large Language Models 这里的 ASR 是指 Automatic speech recognition
视频理解
https://aclanthology.org/2024.acl-long.772.pdf DeVAn: Dense Video Annotation for Video-Language Models
https://arxiv.org/pdf/2104.04182 FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework 我靠ruoyao wang 怎么还做过这种东西
https://aclanthology.org/2020.lrec-1.536.pdf LifeQA: A Real-life Dataset for Video Question Answering
https://arxiv.org/pdf/2409.09348 QTG-VQA: Question-Type-Guided Architectural for VideoQA Systems
https://arxiv.org/pdf/2409.07748 TOP-DOWN ACTIVITY REPRESENTATION LEARNING FOR VIDEO QUESTION ANSWERING
https://arxiv.org/pdf/2409.07747 MULTI-OBJECT EVENT GRAPH REPRESENTATION LEARNING FOR VIDEO QUESTION ANSWERING
作者顺序都不变,网页标号还是连着的,真就连灌两篇啊。
https://arxiv.org/pdf/2408.17443 HERMES: TEMPORAL-COHERENT LONG-FORM UNDERSTANDING WITH EPISODES AND SEMANTICS
https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/05543.pdf Video Question Answering with Procedural Programs
https://arxiv.org/pdf/2408.12763 Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models
https://arxiv.org/pdf/2409.14319 Scene-Text Grounding for Text-Based Video Question Answering
Robotics 或 Embodied Agents
https://arxiv.org/pdf/2405.13035v1 SIGMA: AN OPEN-SOURCE INTERACTIVE SYSTEM
FOR MIXED-REALITY TASK ASSISTANCE RESEARCH
https://arxiv.org/pdf/2304.13705 Learning Fine-Grained Bimanual Manipulation with
Low-Cost Hardware
https://arxiv.org/pdf/2204.01691 Do As I Can, Not As I Say:Grounding Language in Robotic Affordances 这好像是听 talk 的时候收集的
https://arxiv.org/pdf/2212.06817 RT-1: ROBOTICS TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE 这好像是 sergey levine 在cs285上宣传的
https://arxiv.org/pdf/2407.02220 Embodied AI in Mobile Robots: Coverage Path Planning with Large Language Models
https://arxiv.org/pdf/2210.03370 GNM: A General Navigation Model to Drive Any Robot
https://arxiv.org/pdf/2409.10027 E2Map: Experience-and-Emotion Map for Self-Reflective Robot Navigation with Language Models
https://arxiv.org/pdf/2409.18313 Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation 这个也能 rag 吗……
https://arxiv.org/pdf/2409.15146 COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models
https://arxiv.org/pdf/2409.14908 KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems
Diffusion models
https://arxiv.org/pdf/2407.06938 RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models
https://arxiv.org/abs/2402.03570 Diffusion World Model: Future Modeling Beyond Step-by-Step Rollout for Offline Reinforcement Learning
https://arxiv.org/abs/2405.12399 Diffusion for World Modeling: Visual Details Matter in Atari
diffusion policy
https://arxiv.org/pdf/2303.04137 Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
transfusion
这个 关键词 在arxiv 上能搜到一坨结果。
https://arxiv.org/pdf/2203.11496 TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers
https://arxiv.org/pdf/2403.18681 TRANSFUSION: CONTRASTIVE LEARNING WITH
TRANSFORMERS
https://arxiv.org/abs/2311.09999 TransFusion -- A Transparency-Based Diffusion Model for Anomaly Detection
https://arxiv.org/abs/2210.07677 TransFusion: Transcribing Speech with Multinomial Diffusion
https://arxiv.org/pdf/2307.12667 TRANSFUSION: GENERATING LONG, HIGH FIDELITY TIME SERIES USING DIFFUSION MODELS WITH TRANSFORMERS
https://www.arxiv.org/pdf/2408.11039 Transfusion: Predict the Next Token and
Diffuse Images with One Multi-Modal Model
Exploration
https://aclanthology.org/2024.acl-long.815.pdf Natural Language Satisfiability: Exploring the Problem Distribution and Evaluating Transformer-based Language Models
https://arxiv.org/pdf/2409.12262 Bootstrapping Object-level Planning with Large Language Models
https://arxiv.org/pdf/2408.11815 Great Memory, Shallow Reasoning: Limits of kNN-LMs
https://arxiv.org/pdf/2409.15451 Tag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models(这篇是一个环境。偏 infra)
DPO
https://arxiv.org/pdf/2409.17791 Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness
https://arxiv.org/pdf/2404.10719 Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
https://arxiv.org/pdf/2406.18629 STEP-DPO: STEP-WISE PREFERENCE OPTIMIZATION FOR LONG-CHAIN REASONING OF LLMS
DPO 是正负样本对比学习,这篇论文认为在 long reasoning 场景中,直接对整个 trajectory 做正负样本对比会丢失一些信息,所以我把 reward 每步最小化 y+y+ 的预测 - y−y− 的预测
这个一看就是编了一个做法然后做了做实验就灌了,毕竟 figure 上的提升也不算高
https://arxiv.org/pdf/2406.11176 Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement
一言以蔽之:SFT loss + step-DPO loss + outcome-DPO loss(不 step 的 DPO)。非常好裁缝啊。但是我觉得没有从方法上解决问题啊。
https://arxiv.org/pdf/2409.03650 ON THE LIMITED GENERALIZATION CAPABILITY OF THE IMPLICIT REWARD MODEL INDUCED BY DIRECT PREFERENCE OPTIMIZATION 这篇分析了 DPO reward model 和 RLHF reward model 的区别。结论是使用 DPO reward model 进行模型训练会导致训出来的模型面对 OOD 问题能力不足,五个测试平均掉三个点最多掉七个点。所以说 DPO reward model 的泛化能力有限,而且那些 iterative 的 DPO 方法在某种程度上是 RLHF reward model 的集成?
https://arxiv.org/pdf/2406.09760 Bootstrapping Language Models with DPO Implicit Rewards
Iterative dpo 需要每轮构建 preference dataset。本文使用上一轮的 reward model 对当前模型生成的若干个回复进行打分,分最高的和最低的作为 ywin,yloseywin,ylose,不需要外部监督信号。
https://arxiv.org/pdf/2408.16751 A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models
Step Reward
https://arxiv.org/pdf/2310.10080 LET’S REWARD STEP BY STEP: STEP-LEVEL REWARD MODEL AS THE NAVIGATORS FOR REASONING
https://arxiv.org/pdf/2305.20050 Let’s Verify Step by Step(OpenAI 出品)
https://arxiv.org/pdf/2402.01469 AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through Process Feedback
From LLM's point of view
https://arxiv.org/pdf/2402.02716 Understanding the planning of LLM agents: A survey
https://arxiv.org/pdf/2212.10403 Towards Reasoning in Large Language Models: A Survey
https://arxiv.org/pdf/2305.14992 Reasoning with Language Model is Planning with World Model
https://arxiv.org/pdf/2405.16376 STRIDE: A Tool-Assisted LLM Agent Framework for Strategic and Interactive Decision-Making
https://arxiv.org/pdf/2403.03101 KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents 好像是 retrieve next action/thought 的最早的文章
https://arxiv.org/pdf/2402.17453 DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning
https://arxiv.org/pdf/2408.16737 Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling这是DeepMind 的文章
https://arxiv.org/pdf/2404.02078 Advancing LLM Reasoning Generalists with Preference Trees
subgoal
https://arxiv.org/pdf/2406.04784 SELFGOAL: Your Language Agents Already Know How to Achieve High-level Goals
https://arxiv.org/pdf/2107.00541 Goal-Conditioned Reinforcement Learning with Imagined Subgoals
长程推理
https://arxiv.org/abs/2408.06318 Can We Rely on LLM Agents to Draft Long-Horizon Plans? Let's Take TravelPlanner as an Example
https://arxiv.org/abs/2403.18760 MLDT: Multi-Level Decomposition for Complex Long-Horizon Robotic Task Planning with Open-Source Large Language Model
https://arxiv.org/pdf/2403.08978 AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents
From RL's point of view
https://arxiv.org/pdf/2403.02502 Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents
https://arxiv.org/pdf/2406.11176 Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement
https://arxiv.org/pdf/1802.07245 Meta-Reinforcement Learning of Structured Exploration Strategies 这篇是远古 sergey levine RL 文章,由于 intro 有点没看懂所以鸽了先。
https://arxiv.org/pdf/2409.01369v1 Imitating Language via Scalable Inverse Reinforcement Learning 这篇也是新颖。
https://arxiv.org/pdf/2107.10390 Reinforcement Learning Agent Training with Goals for Real World Tasks
https://arxiv.org/pdf/2110.12080 C-PLANNING: AN AUTOMATIC CURRICULUM FOR LEARNING GOAL-REACHING TASKS
https://arxiv.org/pdf/2106.10544 Learning Space Partitions for Path Planning
quiet q*
https://arxiv.org/pdf/2403.09629v1 Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
https://arxiv.org/pdf/2408.02666 Self-Taught Evaluators(Meta 作)
有预算限制的推理
https://roboticsproceedings.org/rss20/p112.pdf AutoGPT+P: Affordance-based Task Planning using Large Language Models
Reinforcement learning
https://arxiv.org/pdf/2408.15240 Generative Verifiers: Reward Modeling as Next-Token Prediction
https://arxiv.org/pdf/2305.20050 Let’s Verify Step by Step
https://arxiv.org/pdf/2409.16663 Mitigating Covariate Shift in Imitation Learning for Autonomous Vehicles Using Latent Space Generative World Models 这是炒冷饭还是什么意思。这个组做这个做了这么多年了啊
step reward 设计
https://arxiv.org/pdf/1812.02690 Provably Efficient Maximum Entropy Exploration 好像是牛逼 TCS 文章
https://github.com/WindyLab/LLM-RL-Papers 西湖大学维护的 LLM RL 相关的论文集合
https://arxiv.org/pdf/2406.14324 Revealing the learning process in reinforcement learning agents through attention-oriented metrics
基础 RL 远古文章
这些都是 sergey levine 在 cs285 上提到的东西
https://arxiv.org/pdf/1707.01495 Hindsight Experience Replay
https://arxiv.org/pdf/1706.03741 Deep Reinforcement Learning from Human Preferences
https://arxiv.org/pdf/1912.06088 Learning to Reach Goals via Iterated Supervised Learning
https://arxiv.org/pdf/1903.01973 Learning Latent Plans from Play
AI 生成内容检测
https://arxiv.org/abs/2409.14285 ESPERANTO: Evaluating Synthesized Phrases to Enhance Robustness in AI Detection for Text Origination
https://aclanthology.org/2023.emnlp-main.463.pdf Beat LLMs at Their Own Game: Zero-Shot LLM-Generated Text Detection via Querying ChatGPT
https://ieeexplore.ieee.org/abstract/document/10684742 Modality Perception Learning-Based Determinative Factor Discovery for Multimodal Fake News Detection
总结现状,反思行情小作文
https://arxiv.org/pdf/2407.01502 AI Agents That Matter 喷了喷现在的大多数工作, evaluation 都做的不好
https://arxiv.org/pdf/2211.16327 ON THE POWER OF FOUNDATION MODELS 范畴论小文章
未分类·
Generative agents: Interactive simulacra of human behavior
https://ysymyth.github.io/papers/Dissertation-finalized.pdf 著名老哥shunyu yao 的博士答辩论文
https://proceedings.neurips.cc/paper_files/paper/2011/file/e19347e1c3ca0c0b97de5fb3b690855a-Paper.pdf Unsupervised learning models of primary cortical receptive fields and receptive field plasticity 有点太老了,不知道有没有意义还。
https://arxiv.org/pdf/2405.16137 Comparison between Behavior Trees and Finite State Machines
https://proceedings.neurips.cc/paper_files/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf Learning to summarize from human feedback OpenAI 力作
https://arxiv.org/pdf/2008.02217 HOPFIELD NETWORKS IS ALL YOU NEED
https://arxiv.org/pdf/2408.11431 Diagnosing and Remedying Knowledge Deficiencies in LLMs via Label-free Curricular Meaningful Learning
https://arxiv.org/pdf/2307.06865 Effective Prompt Extraction from Language Models
https://dl.acm.org/doi/abs/10.1145/3637528.3672010 GraphWiz: An Instruction-Following Language Model for Graph Computational Problems
https://arxiv.org/pdf/2409.05283 On the Relationship between Truth and Political Bias in Language Models
https://www.sciencedirect.com/science/article/abs/pii/S0950705124010724 TabSAL: Synthesizing Tabular data with Small agent Assisted Language models
https://arxiv.org/pdf/2409.12990 Hyperbolic Brain Representations
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· Docker 太简单,K8s 太复杂?w7panel 让容器管理更轻松!