LLM大模型：Reinforcement Learning-强化学习中思维链中COT、TOT和GOT的前世今生

　　这一轮爆火的AI热潮是被transformer架构点燃的，根据scanling law的观点， transformer这个架构有个显著的特点：大力出奇迹！计算量C=6*ND，N是模型参数，D是token数。N越大，网络压缩、承载信息的能力越大，但是需要的token也就越多，需要的算力也越多，这就是妥妥的烧钱啊！pre-train几百上千亿参数的大模型，所有成本加起来5M~10M+ dollar不等，普通的中小厂是完全无力承担的，大厂做起来也不轻松！更重要的是：

自互联网诞生以来，产生的优质数据是有限的，无法再支持更多参数的模型了，所以IIya认为pre-train的时代即将结束；
目前的LLM还有另一个“缺陷”，被yan lee chun、feifeili等一众大佬吐槽：next token的生成方式是根据上文所有token计算找到概率最大的那个token，所以LLM回答问题完全是“鹦鹉学舌”：只学到表面字符，没学到人类核心的总结和归纳能力；【人回答问题时，会先想好回复的思路，再组织和生成语言，但现在的transformer架构完全不是这样的】

　　那么问题来了: 怎么继续提升LLM的性能了？

　　记得chatGPT刚发布的时候，还是比较早期的3.5版本，经常出现幻觉，针对一些逻辑推理的问题还会出现错误，所以当时有研究人员发现了一些promt的诀窍：直接问问题，gpt的回答效果不好，但是如果给个demo样例，把解决问题的整个思路和流程详细展示出来，在这种情况下gpt回答问题的效果会好很多，这就是所谓的chain of thought，如下：

　　1、这种方式没有改变模型参数量，也没改变模型网络架构，只是在用户侧更改了prompt，效果立竿见影！但这么做也有一个问题：不是每个用户都很专业，所以不是每个用户都会使用COT的这个思路，怎么更加方便用户使用了？chatGPT o1模型诞生了：使用大量包含COT的训练预料对模型做微调，让LLM也学会人的思维方式；具体落地实施的时候，会训练两个模型：1个用来拆分问题，分成多个步骤；另一个用来逐步执行，并返回结果！根据这个思路，可以用工作流的方式模拟o1，dify举例如下：

　　由于模拟时并未fine-tune任何LLM，所以prompt尤为重要了，分解用户任务的prompt举例如下：

You are an expert AI assistant with advanced reasoning capabilities. Your task is to provide detailed, step-by-step explanations of your thought process. For each step:

1. Provide a clear, concise title describing the current reasoning phase.
2. Elaborate on your thought process in the content section.
3. Decide whether to continue reasoning or provide a final answer.

Response Format:
Use JSON with keys: 'title', 'content', 'next_action' (values: 'continue' or 'final_answer')

Key Instructions:
- Employ at least 5 distinct reasoning steps.
- Acknowledge your limitations as an AI and explicitly state what you can and cannot do.
- Actively explore and evaluate alternative answers or approaches.
- Critically assess your own reasoning; identify potential flaws or biases.
- When re-examining, employ a fundamentally different approach or perspective.
- Utilize at least 3 diverse methods to derive or verify your answer.
- Incorporate relevant domain knowledge and best practices in your reasoning.
- Quantify certainty levels for each step and the final conclusion when applicable.
- Consider potential edge cases or exceptions to your reasoning.
- Provide clear justifications for eliminating alternative hypotheses.

Example Response:
{
"title": "Initial Problem Analysis",
"content": "To approach this problem effectively, I'll first break down the given information into key components. This involves identifying...[detailed explanation]... By structuring the problem this way, we can systematically address each aspect.",
"next_action": "continue"
}

Remember: Thoroughness and clarity are crucial. Each step should provide meaningful progress towards the solution.

　　经过上述prompt，把复杂的任务分解成多个步骤，然后逐步执行，我问chatGPT4o一个经典问题：how many r in strawberry？回答如下：

[
    {
        "title": "Initial Problem Understanding",
        "content": "The problem is asking for the number of occurrences of the letter 'r' in the word 'strawberry'. To solve this, I will first analyze the word itself and identify the individual letters. By inspecting each letter in the word 'strawberry', we can count how many times 'r' appears.",
        "next_action": "continue"
    },
    {
        "title": "Word Analysis",
        "content": "The word 'strawberry' consists of the following letters: 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. By reviewing this, we can clearly see that the letter 'r' appears three times. I will now verify this count to ensure accuracy.",
        "next_action": "continue"
    },
    {
        "title": "Verification of Count",
        "content": "To verify the accuracy of the count, I will manually check each letter position in the word 'strawberry'. The positions of 'r' in the word are the 3rd, 8th, and 9th positions. This confirms that 'r' appears exactly three times.",
        "next_action": "continue"
    },
    {
        "title": "Cross-checking with Alternative Methods",
        "content": "Another method to check this would be to count the occurrences programmatically. If I were to write a simple program (e.g., using Python's count() method), I would find that 'strawberry'.count('r') equals 3. This further confirms the manual count, providing additional certainty.",
        "next_action": "continue"
    },
    {
        "title": "Conclusion",
        "content": "After performing multiple checks, both manually and programmatically, I can confidently conclude that the word 'strawberry' contains 3 instances of the letter 'r'. There are no alternative explanations or errors identified during the analysis.",
        "next_action": "final_answer"
    }
]

　　看吧，第3步就得到完全正确的答案了！

　　2、COT开创性地引入了人的思维模式: 把复杂task分解成多个简单的task，然后逐个执行解决！但是COT还是不完美：只有一条链路，鲁棒性、容错差啊！试想：解决问题的方案只有一个，万一这个方案本身错了怎么办？方向都错了，后续怎么努力都白搭啊！为了解决这个鲁棒性问题，Tree of thought又诞生了：https://arxiv.org/pdf/2305.10601.pdf

　　如上图所示：树的每个分支代表解决问题的不同路径，也就是有多个备选方案，这允许模型评估不同方案并做出更明智的决策。既然TOT衍生出了很多树状形分支，怎么找到最合理、最优的分支了？对于树状结构，遍历所有节点无非就两种方法：DFS和BFS！这两种遍历方式怎么选择了？

BFS 会先探索当前层的所有节点，然后再移动到下一层，这有助于在早期阶段识别和排除不可能的路径，选择最优路径；比如24点游戏（解决步骤有限，可以在早期找到正确步骤）、创意写作（找到最优潜力的计划）等；
DFS 会沿着一个路径一直深入，直到达到解决方案或确定该路径不可行，然后回溯到上一个节点继续探索其他路径；比如填字游戏，需要沿着一个特定的线索深入探索，直到找到合适的单词或确定该线索无法继续，然后回溯并尝试其他线索；

　　论文作者使用GPT-4作为基座模型，分别采用IO、COT、TOT三种策略，分别在24 game、creatving writing做测试，结果如下：

　　Results. As shown in Table 3, IO and CoT prompting methods perform poorly with a word-level success rate less than 16%, while ToT significantly improves all metrics, achieving a word-level success rate of 60% and solving 4 out of 20 games. Such an improvement is not surprising, given IO and CoT lack mechanisms to try different clues, make changes to decisions, or backtrack.

　　3、TOT的效果比COT好，这就到头了？仔细看，TOT是树状结构，分支之间是不会连接的，换句话说，分支之间是永世分离的，这样真的好么？举个栗子：在某条岔路上往下，结果发现后续的路有问题，不能再继续往后走了，此时如果是树状结构，只能原路返回，这样做灵活性很差啊！如果在正确的节点也能走到其他分支，是不是灵活很多了？这就进一步衍生除了graph结构：graph of thoughts！与树状结构相比，图结构能够支持更多的连接，提供更加丰富和非线性的推理路径，详细如下：

非线性推理: 与树的逐层展开不同，图结构可以支持多条不同路径的交叉、重用和回溯。它允许模型根据上下文选择跳跃和返回，而不是强制按照固定的顺序进行推理。
更多的连通性: 图结构中的节点可以有多个连接，允许更多的关联和潜在的信息流动。这对于处理复杂任务或涉及多种相关因素的问题（如多模态推理）尤其重要。
提高灵活性: 在图结构中，节点和边的多样化允许更细粒度的推理和信息传递，适用于更加复杂的推理任务，比如推理过程中需要多次回溯或交叉参考的情况。
处理复杂关系: GoT对于需要多种关系类型的推理（如在知识图谱中进行推理）具有天然优势，因为图结构可以更加容易地表示复杂的关系网络。

　　（1）GOT图示如下：和COT、TOT比，最大的区别就是不同分支之间可能关联，岔路更多，更加灵活

　（2）整体的架构和执行流程图示如下：

　　论文作者创造了很多功能模块和概念，我们挨个来看看都有啥用：

prompter：发给LLM的prompt
parser：从LLM的回复中提取关键信息，用这些关键信息更新grapgh reasoning state中维护的网络节点信息
controller：控制整个LLM的thoughts转换，比如节点之间的跳转、路由

graph reasoning state: GRS，存储和维护LLM推理过程
ranking：找到排名最高的thoughts，用于后续的路径规划和备选
graph of operation：GoO，规划整个thoughts执行的步骤计划（把thoughts连接成网状）；执行thoughts过程中更新GRS中的状态信息，这里整个controller的核心
从GRS中选择合适的thoughts，传给prompter（最终目的是让LLM执行）
决定整个流程是结束还是继续，或者是启动下一轮与LLM的交互

scoring&validation：给LLM评分，判断其质量的好坏

　理解了整个architechure中每个模块的功能，其具体的执行流程就很简单了：

用户输入prompt
controller分解prompt，并生成graph of thoughts
按照特定的算法遍历thoughts，并把thought输出到LLM执行
scoring&validation验证LLM的输出质量

　原论文作者自己举了一个数字排序的例子：

初始系统提示：
- 用户提出排序请求，例如：“Hello. I want to sort the following input sequence of numbers: [3,1,9,...]”。
生成操作（Generate）：
- 使用Generate(t, k=4)提示，将输入列表分割成更小的部分。在这个例子中，将64个数字的列表分割成4个包含16个数字的子列表。
- 提示（Prompt）：“Split the following list of 64 numbers into 4 lists of 16 numbers each...”。
排序子列表：
- 对每个子列表使用Generate(t, k=1) + Repeat(k=4)提示进行排序。这意味着对每个子列表生成4个排序结果，并选择其中最好的一个。
- 提示（Prompt）：“Sort the following list of numbers in ascending order...”。
合并操作（Aggregate）：
- 使用Aggregate(t1, t2) + Repeat(k=3) + KeepBest(N=1)提示将两个已排序的子列表合并为一个更长的已排序列表。这个过程重复进行，直到所有子列表都被合并成最终的排序列表。
- 提示（Prompt）：“Merge the following 2 sorted lists of length {length1} each, into one sorted list of length {length2} using a merge sort style approach...”。
改进操作（Improve）：
- 如果合并后的列表不是完全正确的，使用Improve(t) + Repeat(k=4)提示来改进排序结果。这个过程会尝试修正排序中的错误。
- 提示（Prompt）：“The following two lists represent an unsorted list of numbers and a sorted variant of that list... Fix the sorted variant so that it is correct...”
评分和选择最佳结果：
- 每个生成的排序列表都会根据其准确性进行评分，选择评分最高的作为最佳结果。
真实值（Ground Truth）：
- 最后，将GoT得到的结果与预设的ground truth比较，以验证结果的正确性。

　　整个过程通过将一个大问题分解为多个小问题，然后逐步解决并合并结果，从而有效地利用了大型语言模型（LLM）的能力。通过这种方式，GoT能够处理比单个LLM提示窗口更大的输入序列，并且能够通过合并和改进步骤提高最终结果的准确性。

　　使用chatGPT3.5作为基座模型，各种不同thoughts的对比：

总结：

　　1、各种thought：并未改变模型的网络结构，也没额外finetune模型，仅仅是模仿人的思路：遇到复杂task时先生成解决问题的框架，分解成多个简单的task，然后逐个执行；属于工程方面的优化；

　　2、各种thought的思路和decoder最后一层softmax的思路完全一样，没有本质区别：softmax输出next token的概率，根据既定的策略选择具体的next token；各种TOT和GOT也是根据既定的策略选择下一步的方向，相比之下只是next step的颗粒度不同而已！各种thoughts一般以32或64tokens为步长做exploration！如果做token level的exploration，有两点困难：

计算量较大
token级别的reward训练语料收集较难（常见的reward都是针对整个response，而不是单个token）；貌似tsing清华的研究员在尝试解决sparse reward的问题：https://www.cnblogs.com/theseventhson/p/18662354

　　3、为什么要做各种thought？LLM核心原理是根据上文计算next token的概率，如果对复杂task做分解，得到多个简单的task，对于每个简单task，可以极大缩小next token的预测范围，从而提升回答的准确性（RAG不就是这个原理么？）；整体而言，也能处理比单个LLM提示窗口更大的序列输入，极大丰富context

COT核心思路是层层递进推导：A->B->C......->end；为啥不直接从A->end了？中间绕这么大一圈麻不麻烦啊！时间长、计算量还很大！这么多的核心原因：还是next token的概率问题！从A开始计算next token，比较容易得到B，但是无法得到end，所以只能先到B；同理，A->B作为上文，next token只能得到C，无法得到end，所以需要中间一环一环地桥接！

　　4、NLP任务，表面看是处理各种字符，本质是是在探索和研究人类的思维方式，字符只是承载和表达思维的符号！

参考：

1、https://github.com/bklieger-groq/g1

　 https://arxiv.org/pdf/2308.09687 Graph of Thoughts: Solving Elaborate Problems with Large Language Models

2、https://www.bilibili.com/video/BV1XVtfejE31/?spm_id_from=333.337.search-card.all.click&vd_source=241a5bcb1c13e6828e519dd1f78f35b2

　 https://www.bilibili.com/video/BV1vptYeLETk/?spm_id_from=333.337.search-card.all.click&vd_source=241a5bcb1c13e6828e519dd1f78f35b2

　 https://www.bilibili.com/video/BV1UBkNY6Ep5/?spm_id_from=333.337.search-card.all.click&vd_source=241a5bcb1c13e6828e519dd1f78f35b2 chatgpt工程师介绍o1

　 https://www.bilibili.com/video/BV1LASbYMERj/?spm_id_from=333.337.search-card.all.click&vd_source=241a5bcb1c13e6828e519dd1f78f35b2 TOT介绍

3、https://mp.weixin.qq.com/s/-pPhHDi2nz8hp5R3Lm_mww DeepResearch 的设计和实现，其效果吊打只能生成一次的传统RAG

　　2025.2以来，openai、google等传统厂家陆续推出了deep research的产品，能生成上万字的详细研究报告，核心思路还是TOC：

posted @ 2025-01-07 15:52 第七子007 阅读(1599) 评论(0) 收藏举报

刷新页面返回顶部

第七子007

LLM大模型：Reinforcement Learning-强化学习中思维链中COT、TOT和GOT的前世今生

公告