智源大会-2023-笔记-六-

智源大会 2023 笔记（六）

特邀报告（图灵奖得主Joseph Sifakis、Graphcore CTO Simon Knowles） - P1 - 智源社区 - BV1hh4y137BB

呃，所以紫嫣覆盖你的艺术，但是高斯的蒂娜说要带一个苏进去，呃，在美国做得很好是的，钟晖和右，好的，呃，你能听到我吗，是呀，是呀，好的，所以很高兴做这个关于测试的演讲，呃，系统智能和启动，我想说。

目前对于什么是智能以及如何实现智能有很多困惑，这种混乱是由媒体和大型科技公司助长的，舆论的传播，这表明人类水平的人工智能只需要几年的时间，你也可以称之为围绕水问题发展起来的神话。

一些人相信机器离开了我这就是故事的结局，我当然不同意这些观点，呃，如果你打开字典，我们将看到智力被定义为学习的能力，以合乎逻辑的方式理解和思考世界，机器的这种能力可以做出令人印象深刻的事情。

但机器在情境感知方面无法超越人类，适应环境变化和创造性思维，我认为就智力的定义达成一致是非常重要的，对什么是智力没有明确的概念，我们不能发展出一个关于它是如何工作的理论，呃。

我相信今天我们只有一个弱人工智能给了我们元素，成为情报系统，但我们没有合成它们的原理和技术，并建立一个更大的智能系统，将建造桥梁，例如用积木建造的建筑物，然后为了未来。

我认为我们将观察到T和AI之间的加速融合，我们需要自治系统，这将是一个很大的进步，从弱人工智能人工一般的变化，你知道自主系统，呃，呃支持Pardigof，呃，自主系统支持智能系统的范式。

这超出了通常专门化的机器学习系统，转换系统，它们产生于进一步自动化现有组织的需求通过用自主代理取代人类，这是我们谈论的物联网所设想的，自动驾驶汽车，智能电网，智能工厂和原子系统是由代理组成的分布式系统。

这些代理通常是至关重要的，这应该表现出广泛的智慧，呃，他们应该管理动态变化的相互冲突的目标集，应对物理环境中不可预测的不确定性，那是，当然与人类和谐合作，代理，我会解释呃。

自治愿景的实现受到不信任的阻碍，我们必须使用的人工智能系统，而且，而是系统工程中的一些难题，这些问题与智能没有太大关系，这是我演讲的提纲，我会试着比较人类和机器的智能，并讨论人工智能系统的信任问题。

呃介绍自主系统并讨论未来，可能你知道艾伦·图灵，他是计算机科学的创始人，有了将人类和机器智能与他著名的测试进行比较的想法，呃，他的组织如下，呃，你在两个不同的房间里，机器a和人b，和一个实验性的C。

将书面问题发送给A和B，并比较答案，她说如果她分不清哪台是电脑，就是那个人，那么A和B同样聪明，我为什么要提到这个测试，因为今天有些人声称他们的系统成功地通过了，图灵试验，所以这是一个智能系统。

和人一样聪明，这个测试受到了批评，呃，因为成功取决于人的判断，它是主观的，以及测试的选择，呃案件会带来一些偏见，我可以选择的地方，呃，有利于人类或机器的问题，但另一个论点是，这个测试不可能是，呃，呃。

简单的对话游戏，人类的大部分智慧是通过与我们所说的环境的互动来表达的，我们移动，会在两年前表现出社会行为，我在一篇论文中提议，放置试验，这个想法是一个可以是机器的代理，或者一个人和一个特工一样聪明。

执行给定任务的b，如果a能成功地替换b，例如，我想说这台机器和人类司机一样聪明，如果它能成功地替换驱动程序，或者我会说人类像清洁机器人一样聪明，如果它能成功地替换这个，或者注意工具测试是此测试的特例。

呃，任务是否只是一个对话游戏，所以这个测试是相对的，通则，概括和呃，呃，将智力的概念相对化，另一种比较人类和机器智能的方法是考虑不同类型的知识，人类和机器正在开发和使用。

你可能知道人类的思维结合了两个系统，我们所说的快速思维，那是无意识的，自动，毫不费力，呃，这就是我们使用的思维方式，谁不能各种经验，内隐知识，当我走路的时候，我说话，我演奏乐器，我的大脑解决了，呃，呃。

一个非常难的计算问题，但我不知道怎么做，系统二是缓慢的思考，那就是有意识的控制，呃满了，这就是任何原因的来源，知识，当我编程时，例如，我解决了一个问题，呃我慢慢想，我明白我在做什么，有一个惊人的类比。

这两个思维系统和我们今天的两个计算系统，我们有执行算法的传统计算机，呃，都是由，慢慢地有意识地思考，呃，这是基于模型的知识，我们可以理解我们在做什么，相反，这可以得到证实，生成的神经网络，训练后。

这是基于数据的知识，他们能把猫和狗区分开来，就像孩子们一样，但无法验证，无法核实，因为我们不明白他们是怎么变成，我们不，我们有无能为力的问题，这是一个非常重要的问题，我稍后要讨论，呃，人类和机器处理者。

正如我所说，不同类型的知识，这在这个中得到了解释，呃，在这个图表中，我们有一个如此依赖，不同类型的有效性和一般性，例如，呃，呃，我们有一个非经验知识知识，这与外部世界发生的事情无关，这是数学知识，呃。

或者我们通过推理产生的知识，下面你有关于世界的经验知识，最简单的一类只是事件和条件，今天，温度是20摄氏度，例如，然后我们有一个i内隐的经验知识，呃，这是一个系统一，呃，人类知识。

这也是机器学习产生的知识，我们有预测，我们知道如何，呃，我们知道我们，我们解决了一些问题，但我们不知道如何和以上我们有科学和技术知识，那就是经验知识，那就是，呃依赖于数学模型，我们有解释。

我们相信这一点，呃，这种知识，所以从技术的角度来看有很大的不同，机器学习产生的知识与科学知识之间的关系，这就是我要解释的，开始了，知识发展的过程非常相似，呃，因为在这两种情况下都是经验知识。

所以我在这里考虑一个物理实验，所以著名的实验利奥，注意力和加速度之间的比例，并制定了他的法律，所以你有三个步骤，实验步骤十一推广，现在只是概括和解释，如果你想建立一个神经系统来分离图像，猫狗。

说你将从一个实验阶段开始，在那里你会把图像翻倍你有一个人把图像翻倍，基于这个令人不安的，你要训练神经网络，希望能把分离的图像分开，但你没有一个模型来解释这一点，这是非常非常重要的事情，现在要注意另一个。

呃，另一种比较人类和机器智能的方法是理解人类，呃，呃，就可以，呃，优于机器，呃，呃，呃，关于形势意识，可能你已经在新闻里看到了，自动驾驶仪，错误，月亮换黄色交通灯，这永远不会发生在人类身上。

仅仅因为人类明白交通灯不可能在天空中，所以呃，什么是，呃，呃，常识知识，常识知识是一个世界的语义模型，呃，是建好的，呃，自动，呃，从我们出生开始，那就是，呃，是通过日常经验达到的，这个是，呃。

我们用来解释感官信息和自然语言的模型，我们应该强调什么，也是，人类的理解结合了从传感器到语义的自下而上的推理，呃，到心灵的语义模型，从语义模型到感知的自上而下的推理，让我举一个例子，呃这里，例如。

在这个，呃，呃，你能认出一个部分被雪覆盖的停车标志，呃为什么因为呃，感官信息连接到语义模型，一个概念模型，我们在你的脑海中有一个停车标志及其属性，大小，这个颜色，垂直位置，等，相比之下。

必须训练神经网络在所有可能的天气条件下识别停车标志，这是一个非常非常重要的区别，或者如果我给你看，例如，这一系列的图像，你立刻把它解释为飞机事故，相比之下，机器可以单独分析它的框架。

但无法将因果关系联系起来，呃，在每一帧中发生了什么来得出相同的结论，所以总结一下机器与人类的态势感知相匹配，他们应该能够建立一个环境和理解的模型，特别是完全在新的情况下，并将水平和推理结合起来。

这是一个很难的问题，呃，这可能是当今人工智能最难解决的问题，我对解决这个问题的可能性并不乐观，在不久的将来，至少，正如我们迄今取得的缓慢进展所表明的那样，在自然语言的语义分析中。

一个非常重要的问题是当你测试系统的智能时，是如何验证属性，我说呃，一个重要的问题是人工智能系统的不可解释性，所以让我给我们需要的东西一个精确的定义，今天一个系统被解释为。

如果它的行为可以用数学模型或模型来描述，我们可以理解，一种方法是，呃，神经网络是一种组合构建，呃，计算函数，呃，理论上的数学函数，这是可能的，特别是对于前馈网络，因为我们知道每个节点计算的函数。

所以这是输入的加权和，我们应用一些激活功能，所以原则上理论上应该是可能的，有论文讨论了这个问题来构建这个功能，但是你明白复杂的复杂性限制不允许这样做，现在，如果我们想验证系统属性，有两种方法。

验证方法有两种方法，一种是验证，验证是通过模型推理进行的，因此验证尤其不适用于神经网络，在现代系统工程中，验证是非常重要的，为什么因为你需要验证来保证，呃，通用量化属性，如安全或安保。

因为你必须探索系统的所有可能状态来保证这一点，对于我们的人工智能系统来说也是如此，唯一的办法就是，通过测试和注定是一个经验，呃，验证方法，呃，您有限制和可以验证的属性类型的限制。

但你也不能有你所拥有的那种保证，从验证属性中普遍可以发现属性只能高，所以这是一个严重的限制，我还要说，系统工程关注三种主要类型的属性，因此，任何需求都可以分解为三种不同类型的属性，安全性能。

这意味着系统在执行过程中永远不会达到坏状态，所以危险状态安全属性，这意味着一个系统对，呃，性能也很重要，因为它们是现在关于资源及其开发的技术和经济标准的特征，呃，它这个，重要的是要说。

今天我们看到出版物，呃，一些声称智能系统满足某些特性的出版物，没有一个足够严格的，事实上，认识和方法上的要求要求断言，使系统满足属性，伴随着严格的定义，例如，如果我说在我的系统中是COR。

那么你应该给诚实下一个定义，以及相关的验证方法，所以通常，例如，你可以读这个词，一些公司已经在模拟中驾驶了100亿个自主矿山，所以编译是足够安全的，但是这个呃，论点在技术上不是防御性的。

只是因为他们没有说模拟的头脑和真实的头脑是如何联系的，或者人们谈论反应灵敏的人工智能，这意味着人工智能意味着公平这样的标准，可靠性，等，但你应该提供，当然啦，这些性质的表征，并说出神经网络满足。

它是to to to检查这些属性，当然，甚至有报纸谈论，是啊，是啊，i对齐，他会将对话代理与人类的爱的价值观结合起来，但我们甚至不明白人类的意志是如何产生的，以及所有基于价值的过程人类的决策过程。

所以我认为这里有危险，因为人们声称智能系统，我不遵循严格的方法，让我说几句关于测试的话，这是非常古老的东西，所以说什么意思，测试具有输入x并产生输出y的系统，所以属性将是输入和输出之间的关系，我有系统。

我有财产，你会有一个更专业的术语叫做君主，那是一个代理，当你应用x，2。你观察就会知道成功与否，通过或失败，所以OR is是一个代理，它将计算这个谓词并现在决定，如果我说a满足一个性质p，我想验证。

这意味着对于任何可能的输入x和相应的，为什么这个属性是满意的，你明白这是我的意思，对于具有大量输入的系统，这是不可能的，我是说我们应该有一个方法来应对复杂性，为此，我不去细节。

测试方法已成功地应用于测试，呃，硬件或软件，或者在物理学中，我们应用测试方法，或者在医学上，我们应用测试方法，呃，测试方法的特征是什么，测试方法在这里提供了在可能的测试用例中进行选择并评估结果的标准。

所以这个特征的测试图，事实上，通过呃和效率函数，效率函数将表征一组输入的效率，大写X，它衡量这个集合探索系统行为特征的程度，关于我们想要验证的属性，然后是一个支持函数，它将产生一个分数。

所以s的可能性意味着一组的p，投入的资本x，和相应的y，呃，没有给出细节，为了有这样的理论，我们没有这样的理论，用于智能系统，我可以解释为什么，另一个强有力的要求是可重复性，重现性。

意味着测试活动的结果独立于输入的选择，例如，如果，到呃，在两组输入向量中，系统的效率是相同的，他们的结果，分数会大同小异，这保证了结果的客观性，事实上，你表明这一切都取决于你对系统的探索有多深入。

因为你现在测试只是为了提供一个比较，呃呃，以及这种测试思想在各种系统中的应用，所以这里我有一个呃，物理系统或飞行控制器已经在这方面工作过，或者这必须遇到强烈的安全问题，这里我们有一个我们所说的白色测试。

因为我们有一个系统的模型，这样我们就可以，呃，在实验上探索系统的所有可能状态，以某种标准得出我们所说的确凿证据，这是一个，呃，呃，统计检验，所以这里你没有系统的模型，但你有一些理论，例如。

评估疫苗的效率，你可以读到一些统计证据，在这里，我正在考虑三种不同类型的智能系统和图像分类器，一个自动驾驶汽车系统的仿真和CharCPD，所以你看这两个系统，属性可以形式化，你可以有一个。

所以这里有一个人，这可以自动化，事实上，我正在解决这个问题，还有这里，当然我们需要一些理论，呃，一个将定义一些覆盖标准的理论，一些效率标准，我们没有，当然还有图表PD，呃一切皆有可能。

为什么因为我们不能形式化Q问答关系，也可以，演说家将是人，他们会采用学科标准，主观标准，嗯，你看，我们能有一个关于人类的知识吗，但现在标准并不模棱两可，呃没有透露细节，我会说，对于情报系统。

我们是有限的，因为属性应该严格指定，这排除了所有的语言转换器，也应该遵守，所以这排除了我们所说的以人为中心的属性，就像性，相信它有效，还有其他问题，我没时间讨论，但是呃有问题。

因为你可以对神经网络进行对抗性测试，这与重现性的要求不一致，我没时间讨论，呃，更高的考试意味着呃，观察等效测试用例可以给出不同的分数，我要谈谈自主系统，因为我说过自主系统是人工智能的重要一步，一开始。

让我解释一下我们的语言和自动化系统之间的区别，我正在考虑，与环境交互的五种不同类型的系统，为此，他们需要态势意识和决策机制，所以对于热，开始和问题很简单，呃情况只是在读，呃，来自环境的值，控件是静态的。

在这里您可以设计一个控件，这很容易，对于国际象棋机器人来说，事情变得复杂得多，与其说是为了情境意识，但你知道我们需要一个动态的目标规划，因为因为我们有所有可能的配置的爆炸，和黑板上民意测验的移动。

然后事情变得更加困难，即使对一个足球运动员来说，因为玩家，呃应该了解分析动态图像，也可以选择，呃，处理动态目标，这些都是非常困难的问题即使对自动驾驶汽车来说也变得更加困难。

所以让我试着描述一个自主代理的行为，五年前我提出了一篇论文，这个建筑，我试着把这些想法应用到自动驾驶汽车上，所以这里我在考虑自动驾驶汽车的自动驾驶仪，所以这是一个与环境交互的系统，以明显的方式。

你有传感器和执行器，啊，所以你在这里收到框架来实现情况和健康，你会有一个，呃呃，一个感知功能将分析帧将识别障碍物，然后呃和运动学属性，这个反射功能将是外部世界的模型，你据此做出决定。

你通过结合两个功能来做决定，世界管理，对于每个目标，你可以开始水管工，水管工会产生命令和执行器，现在应该强调的是，对于自主代理，你可以有许多不同类型的目标，呃自动驾驶汽车，超过十五种不同类型的目标。

呃所以呃，短期去，呃，避免碰撞或保持当前轨迹或长期目标，在这里，然后潜入一个目的地，所以这个问题很难解决，呃，所以关于这个的反应行为，这个探员的，但也要强调的是，自主剂应该表现出产品行为，基于知识管理。

所以你看这里，我有一个组件，呃，它是一个知识库，在那里我可以保存知识，储存关于我能遇到的物体的第一个属性的知识，我还有一个自我学习功能可以监控这里的信息并定期更新，这个这个报告它。

这个想法是你可以利用知识来增强预测和决策，我来解释一下，所以说，如果在自动驾驶汽车里，我面前有发言权，一个真实和感知，呃，函数标识轨道类型，他们恢复了轨道的性质，我可以预测，我可以知道，例如。

最大加速度，最大速度，所以为了在这里有更好的预测，或者自学习函数可以估计一些参数，我用来做决定的，呃，所以这就是方法，我我我我想象了一个自主代理人的架构，呃，我应该说我不知道如何实施，根据这个。

由于许多复杂性的限制，感知的复杂性，这是很好理解的模糊性和环境的模糊性，不确定性的复杂性，因为你可以在环境中有动态变化，由不可预测的物理和人类过程引起的，然后是决策的复杂性，我说对于一些幸存的车来说。

你可能要处理十五个以上的，呃，不同的目标，这些目标受到不同的时间限制，所以如果我取消了一个目标，如果呃，如果我改变目标，或者如果我删除了一个目标，这个机会应该与我追求的目标一致，这些都是非常困难的问题。

从技术上讲，我还想解释的是，仅仅设计一个代理是不够的，你必须在复杂的网络物理环境中集成它，如果我有一个自动驾驶平台，例如，为了一辆车，而且代理人应该能够与人类操作员和谐地合作，我们知道问题。

这不仅仅是一个，呃，看到人机交互问题，我们知道呃，自动驾驶汽车工业的挫折，在解决和试图解决这个问题，如果你有呃，很多代理，那么它们的协调问题，呃，呃，你必须面对这个问题，重要的是代理之间的协调。

不妨碍实现其个人目标，还有那些探员，协同合作以实现全球系统的目标，通过展示我所说的集体智慧，所以通常，例如，用于自动驾驶系统，这些车应该协调，避免瓶颈，或者有一个更好的职业，道路的最佳占用，好的。

所以我们今天所站的，还有这个，呃，的挑战，呃，构建自主代理，呃，我们有，另一方面，开发得很好的自动化系统的方法，保证它们的真实性，呃，呃，这些都是基于模型的，因此，这些方法被…的复杂性所击败，呃。

自主系统，也因为我们必须考虑，我们这样做是为了整合，呃，呃，呃，神经网络，现在一些大公司采取的另一种方法是建立末端，来结束，所以要整合，例如，拥有一个自动驾驶平台广场，你有一个巨大的和所谓的。

许多是通过模拟训练的，你申请，呃说，在这里你分析帧，你收到帧，你会产生转向角和中断信号，我不建议你用它们开车，因为他们缺乏超凡脱俗，还有另一个非常非常重要的问题，人们不谈论，是呃。

机电系统中这些平台的集成问题，目前的技术没有扩大规模，不适用于整体和呃，神经网络，所以我想对于未来，我们应该努力从每种方法中取长补短，和伟大的在相同的基于Ctural模型和数据库的组件，例如。

我们有很好的决策算法，这可以是基于模型的组件，另一方面我们可以用，呃，我们需要使用神经网络来感知，解决一些难的优化问题，当然还有，部署问题现在仍然存在，这是呃，我对呃，我们有多远，呃现在从到达这个，呃。

这里的自治视觉，我在考虑这张图表，这个数字，我有一个，我我认为呃，智力的特点是情境意识和决策能力，所以我对单一领域有情况意识，或者很多领域都限制了这里的情况意识，呃，单个代理或代理系统的决策。

我认为人工智能目前的重点是这个，我们试图用自动驾驶汽车或其他自动系统做的是，因为你会有多个领域的目标，当然，在未来，我们应该努力建立系统，他们谈到了代理的合理集成问题，所以建立自动驾驶系统，呃。

我的观点是还有很长的路要走，我没有时间来多说，让我，呃，试图用三张关于未来的幻灯片来结束，我想在未来，它将在传统系统工程中集成人工智能技术，呃，这是呃，因为我们有，呃，我们需要有一般的智能人工制品。

啊所以，同样重要的是，今天传统的系统工程被打乱了，不知何故，新的Tri将使既定的做法上瘾，呃，特别适用于关键系统工程，呃呃一些，呃，我认为他们在设计上没有遵循这种安全的概念，正如我所说，呃，例如。

他们采用了，呃首尾相连，基于，呃在艾，呃一些，在一些国家，他们允许自我认证，例如美国，呃，这意味着不是制造商，不是任何独立机构，呃将保证系统的跨性，但是制造商，和，这是，呃，这有点奇怪。

然后他们也允许一些常规的，有一些，我允许定期更新关键软件，我知道这一点，这是一个想法，呃，呃，呃，系统和工程实践，我们有两个重要的，呃，什么选举，呃一个是，呃呃，全智能系统的混合设计，我们在那里，呃呃。

我们应该能够在传统的系统工程中集成人工智能组件，呃呃，发展方法，当然啦，呃呃，艾会留下来，呃，也许永远不要解释数字，所以我们应该设计出建立可信系统的技术，所以可信组件，我们知道如何为硬件系统做到这一点。

我们得发展理论，以及系统验证是，呃标记，正如我解释的那样，从呃验证到测试的转变，我觉得对于智能系统来说，我们能表达的最好的，有一个统计数据，呃，估计置信度的基于技术，我们应该满足于较弱和运输作为保证。

我们永远不会有，运输对我们的传统系统有保证，例如，对于关键系统，我们需要负九分之十，呃呃，少于九次的失败，十减去九，呃，每个呃的失败，每小时运作，这个意志没有实现，然后总结一下。

我认为要弥合自动化和自治之间的鸿沟，还有很长的路要走，呃，这就是我们所知道的，也是，呃来自，呃，呃，自动驾驶汽车，过渡不能是渐进的，并实现完全自治的愿景，我们需要开发，依我看，新科学与工程基础。

这需要一些时间，另一个重要的问题是智力的比率，我们应该就智力的概念达成一致，呃，更换测试设备，正如我所说，并概括了意图的概念，所以一个想法是有多种智力，根据您选择的任务，这将需要一个特定的。

这违背了某种智力，我们也可以考虑不在操作中操作的任务，例如在虚拟环境中，例如，你可以有一个，呃，比较人类的智力和呃，在X的地图上玩游戏的机器，好的，然后呢，当然啦，如果人类智力是参考就是基准，人工的。

一般智力应该能够执行和协调一系列具有人类技能特征的任务，这似乎和别的事情一样重要，呃，我喜欢的一个想法是，这种对可能智能空间的想法，为什么因为人类在分析多维数据方面受到限制我们知道这一点。

我的人工智能系统，人类生活在一个多维数据中，正相反，人类有常识，呃知识，呃抽象，创造力，等，所以在这个，呃，这里的数字，我展示了这种互补性，事实上，我们可以想象，通过结合这些技能，我们可以有一个，呃。

某人有，呃，这个有这种情报的特工，好的，这是人类和机器的极端智能，另一个我觉得有趣的想法是，替换测试表明同等的系统可能有非常不同的创造性，嗯，过程，这是一个想法，应该进一步解释，例如，我可以有问题。

我可以象征性地解决一个，我也可以用机器学习来解决它，事实上，今天有一些有趣的结果表明，其他灯可以，呃，呃，我们用符号解决的所有问题，呃，推理，这里给出了一个工具的名字，它可以在未来做到这一点。

探索这个智力群体的广阔空间是非常重要的，尤其是理解人类的符号智力是如何与，做机器对机器的意图，我最后一张幻灯片，我觉得呃，智能系统的验证将是未来非常非常热门的话题，我有，呃。

已经对这种忽视验证智能系统的局限性的趋势发表了评论，根据呃，我建立了标准，这些标准降低了逻辑和认知标准，这样你就可以阅读了，例如，呃理解呃，自然语言，好的，我不认为这是，呃，这是呃，可接受的论点。

我是说，从技术上讲，你也可以找到把心理态度归因于，像信仰这样的系统，欲望和意图，我在一篇论文中发现了这句话，我觉得读起来很有趣，我们不能证明一个特工总是做正确的事情，但只是它的行动是出于正确的原因。

我认为这是一个糟糕的方法，这意味着呃，你看，呃，如果有人，如果一个人做了坏事，你，你也检查主题，好吧原因，但我认为这是一个糟糕的方法，我是说从技术的角度来看，呃，因为它假设机器可以理解世界。

我们知道自从中式房间的争论以来，这是不可能的，当然呃，我很单纯，但是如果我们应用严格的系统工程标准，例如，我们排除了语言转换器，因为我们不能应用测试技术，所以这个权利的一个问题是我们是否可以申请。

例如十二盏灯的资格考试，也许LMS能通过一些期末考试，呃，在某个领域，作为学生，然而，我想，呃这里，呃，呃说，神经网络和人类有两个根本的区别，人类的思维是稳健的，健壮，所以如果一个人不是疯子，它在思考。

如果他的思维是稳健的，虽然神经网络不是，但也不是鲁棒的，问题的细微变化可能意味着，非常不同，呃，答案，人类的思维也是建立在常识的基础上的，并且在合成中更好地避免，我们知道LMS可以产生非常。

在某些情况下，答案非常不一致，所以总结一下，我认为我们应该承认他们需要情报系统的验证方法，和工作人员以清晰的方式克服当前的模仿，好吧，不作弊，开发新的基础，如果可能的话，我们将修订认识和方法要求。

但要小心并理解我们在做什么，感谢大家的关注，在我一年前出版的一本书里说，我讨论了其中的许多问题。

非常感谢，谢谢。

非常感谢，是啊，是啊，非常感谢，呃，为了这次有见地的谈话，约瑟夫教授，呃，我可以问几个问题吗？是呀，当然啦，是啊，是啊，在呃，人工智能的自学部分，你提到像特斯拉这样的高级算法，我的错误。

黄色交通灯上的月亮，对呀，是呀，所以对于一个人来说，当人类也犯错的时候，但一旦人类意识到这是个错误，他或她可以纠正它，然后进行，是的，是的，怎么能和一切都要学。

然后在不被制造商重置或完全重新培训的情况下变得更好，比如它能在球门上学习吗，对呀，我们怎样才能实现这样的自我学习，在呃中的自校正机制，自治系统，我的意思是显而易见的答案是如果出于自动驾驶汽车的原因，呃。

如果你想有一些保证，你使用，呃多余的，我认为神经网络的问题，他们不能将信息与背景联系起来，而人类将信息与背景联系起来，你接收感官信息，你把它放在一个上下文中，所以如果这种上下文化，我们没有神经网络。

而且有一个技术问题，我正在研究自动驾驶汽车，技术问题如下，非常非常精确我面临的技术问题你从摄像机接收感官信息，呃，你想把这些感官信息和来自地图的信息结合起来，而来自地图的信息是，呃是象征性的。

所以你可以在一些图书馆里有高清地图，然后说，哦，我想配这个，你会有更好的保证，因为你不会只依赖摄像机，你也可以有更好的可预测性，我不知道怎样解决这个问题，好的，所以很简单，今天有一个缺口。

呃在混凝土之间，呃，由神经网络和符号产生的知识，这也是在人工智能之间的差距，和连接是AI，我是说这是一个多程序，好吧，我们不知道如何打破它，那就没问题了，我是说如果我们能结合这一点来接近。

但我们不知道怎么做，人类的头脑有这种能力，将感官信息与世界的概念模型联系起来，如果我们那样做，然后我们就可以达到人类的智力，是啊，是啊，另一个类似的问题，呃，这是关于什么时候有规则。

例如在英国和法国开车，有不同的车在道路的不同侧面行驶，所以如果一个人类司机从英国开车到法国，他或她可以自动调整，那么我们怎么才能，我们怎样才能灌输，我们怎么教算法，这种规则权，呃或者你可以在飞行中学习。

所以嗯，有没有办法触发不同的常识，首先在，这是法国的常识，是呀，是呀，是呀，这正是上下文化的问题，我更改上下文，使用同样的系统你希望能够驾驶汽车，好的，所以如果我必须重新训练我的系统。

当我知道它从法国搬到，好的，我是说，这就是语境化的问题，我是的，好的，呃，我最后一个问题，呃，在幻灯片的最后，你提到我们需要，我们需要改进我们的测试或标准来判断，一个模特是不是真的很好，嗯是的。

从当前数据，让我们也采取，呃，如何驾驶，比如说，计算机平均数，人类，我认为自动驾驶的平均性能已经超过了，但在我看来社会对我们的自动驾驶有一种期望，完美，如果你犯了一个错误，嗯，那是，那是一个很大的。

是呀，所以这是真的，那倒是真的，人类的判断更加严厉，机器比是，如果一个人发生意外，人类可以解释为什么会发生这种情况，相反，人们准备听到这一点，如果机器出了事故，当然啦，好的，它是，很不一样，所以这是。

呃，是呀，这是一个标准，但另一方面，我觉得呃，呃，呃，两个月发生的事情更容易接受，开花，因为我是说，我们熟悉人类的思维方式，而机器，我可以有一些不可预测的行为，好吧，这可以被认为是完全疯狂的，好了。

这是，但是有这样的现象，呃，例如，它们是归一化的，我们知道神经网络，好的，所以是的，你有非常非常不同的行为，好的，是呀，呃，事实上，我从观众那里得到了一个问题，呃，我们的许多与会者是大学生。

所以当他们看到人工智能每周都在进步，所以呃，他们正在取代编码工作，然后是其他的任务，所以他们想知道在今天的世界里，如果我在大学里学的是IT相关专业，我还应该学习基本的排序逻辑吗，呃，还是我应该把它留给。

那么今天计算机科学的职业道路是什么样的呢，或者IT专业，你对他们有什么建议吗，谢谢。是呀，但是你看，我认为这更像是一种技术，如果你想训练，如果你想训练他们，呃，呃，阿艾，好的，呃，呃，你将使用算法。

你将编写程序，你会写，好的，也许是为了AI，你需要一些更特殊的数学，但我认为人工智能是计算机科学的一部分，对学生来说，了解基础知识是非常重要的，当然理论我们有很强的数学背景，然后呃。

我不认为这是一个重要的困境，学生应该，呃，我有广博的文化，并了解所有系统工程问题，会出现的，好的，因为我会和一个系统工程师结婚，那很重要，好的，谢谢你的建议，呃，差不多是时候了，所以说，非常感谢。

约瑟夫教授，呃，斯洛伐克，谢谢你邀请我，感谢上帝，谢谢。我们希望我们能，有时间见面，呃，面对面阴中国，然后我们希望如此，是呀，是呀，今年我将开始去中国旅行，是呀，是呀，好的，谢谢你能来见我，拜拜。

拜拜拜拜，谢谢。拜拜拜拜，每个人，今天我很荣幸来到这里，主持在线报告，由西蒙·霍利斯提供，西蒙·诺拉斯是联合创始人CTO，和Graphcore的工程副总裁，他被认为是情报处理单元的最初设计师。

所谓的伊普，所以他的职业生涯超过40年，呃，三十年，诺利斯先生一直在为新兴的工作负载设计原始处理器，从2012年开始专注于智力，然后优先于图形过程，西蒙公司创建了另外两家成功的处理器公司。

第一个元素十四，2000年被Broadcom收购，第二个ICE A是英伟达在2011年收购的，西蒙的教育背景包括剑桥大学的电子工程学位，所以我现在很高兴地欢迎西蒙·诺尔斯先生。

给我们以硅与AI算法匹配为主题的学术报告，好的，欢迎西蒙，谢谢。托马斯，呃，谢谢你邀请我，在这个享有盛誉的活动中发表讲话，嗯，你们都能看到我分享的屏幕吗，你能看到那个时候，是啊，是啊，是啊，是啊，好的。

好的，好，呃，所以我不是人工智能方面的专家，因为这个电话里的大多数人，所以我可能不会教你任何关于人工智能算法的东西，但我希望至少能为人工智能的芯片设计提供一些启示，也许能告诉你这两者的交集。

以及我们如何让芯片设计师和人工智能算法设计师合作，为了未来更好的人工智能结果，人工智能研究进展如何，多年来的大量研究已经达到了，在那里，它提供了明显有效的人工智能和普遍的效用，投资现在转向。

从发现人工智能能做什么--能力阶段到人工智能的部署，以效率为中心，芯片也在相应地适应，呃，今天部署人工智能是昂贵的，训练模式昂贵，在推理中部署它们是昂贵的，那么我们能做些什么呢。

人工智能处理器用于执行任何算法的主要资源，但特别是AI在这种情况下是坚实的，呃，矩阵算法明显，它是以每秒的翻牌来校准的，每次翻牌的能量是一定的，浮点运算，内存容量，对于那些不熟悉硅存储器的人来说。

根本不需要任何电力，而它只是在回忆事情，它只需要耗电，当你访问它的时候，读写内存带宽，读写部分也以每秒字节为单位进行校准，以每字节一定的能量，芯片之间的数据传输也是如此，在过去的十年里。

我可以把它描述为第一个人工智能十年，我们已经看到了每秒300倍的失败表现，可用于AI的，由典型的图形处理器，一个GPU，我想呃，黄延森做了一个百万x的广告，但这包括另外3000倍。

因为使用了3000倍以上的芯片，所以我说的是一个芯片，呃已经移动了大约300个鸡蛋，我在比较2014年的英伟达麦克斯韦一代，使用32位浮动指针，用最新的料斗，在大约两个更好的八位浮点失败。

今年二十三年的算术，这种改进从何而来，我试着在这里分解它，这些都是倍增因素，顺便说一句，所以它们都繁殖到三百，大约一半或根部，300来自于在较小的数字格式上使用矩阵乘法器，换句话说，从浮三二到浮八，嗯。

其余的大部分来自硅工程的改进，晶体管密度大约提高了8倍，从28纳米下降到5纳米，一点点，一个七x来自时钟速度增加到一个点，八五千兆赫，从一点零五，我想，但就电力而言，这已经很昂贵了。

电力成本大约是芯片功率的三倍，最后，很少一部分来自GPU架构的改进，我想这是因为GPU在图形领域已经建立了很好的基础，很难进化，呃，那些人工智能的芯片，而不破坏它们的效用，呃现在在图形。

我认为人工智能市场足够大，这两种芯片类型已经分开了，我们现在将看到，我认为更多的建筑创新，很明显，我的公司也在尝试建筑创新，我经常被问到的一件事是日记有多大，构建每秒千万亿次的浮点单元需要什么。

以及它们需要的寄存器文件，以便能够与本地内存交互，这张图显示了一个大盒子，三个两个，五点七的平方是，这是世界上任何晶圆厂中任何人都能制造的最大的芯片区，整个芯片生态系统都是为这个最大芯片尺寸量身定做的。

所有的步进机，所有晶圆处理机，所有的口罩制造机，做这么大的薯片，它可能看起来像一个奇怪的大小，但事实就是如此，这将是一个非常，非常，很长时间，如果它改变或变得更大。

所以没有人能制造出超过823平方毫米的芯片，在任何过程中，该芯片的技术需要大约四分之一的时间来实现每秒一拍的速度，浮点，十六，呃，具有小寄存器文件的算术单元，嗯，同时也能够，呃好吧。

能够作为一种替代模式，用于预测的H位浮点，越来越有用，这是假设它们在一个点上运行，八，5千兆赫，它们是在五纳米的过程中，大致就是霍珀今天的位置，它也是我们今天最新一代的地方，一般来说，速度。

坏消息是晶体管密度不会真的变得更好，所以图表上的这些点显示了归一化密度，大概量产的年份都是实点，我对每一个字都负责，在这些过程中的每一个过程中都有一个处理器磁带，在每一个点上，所以这些都是真的。

这不是营销，就像，嗯，除了，右边最后两个，嗯，右边最后一个满点是三纳米，超越的那个，也就是目前两纳米的预期，正如你所看到的，我画了两条线，一个是逻辑逻辑晶体管成对出现，一个n型，一个P型，因此。

苔藓和S ram细胞是互补的，芯片上的静态存储单元，存储一位需要六个晶体管，所以通常芯片设计者从逻辑晶体管对的角度来考虑，然后呃，六套，呃S斜坡晶体管，你可以看到公羊不再变得更密集。

在最后一个进程节点中，五楠，呃，事实上，三纳米的密度稍微低一点，两纳米，大致相同的逻辑是，呃，稍微扩展一下它的收缩能力，但只有通过减少晶体管的鳍片，这使得晶体管变弱，使他们不那么可控，嗯。

它们可能会掉到一个鳍上，嗯，所以可能会有进一步的人口减少步骤，但你可以看到总的趋势和记忆是一样的，换句话说，压平的，密度的急剧变平，晶体管的改进，每平方毫米不再有晶体管，呃。

所以我们每平方米的晶体管增加了2000倍，过去25年的平方毫米，现在一切都结束了，正在努力，当然啦，改进晶体管实体，还会有进一步的改进，但他们会很慢，至关重要的是，它们不会提高每个晶体管的成本。

这里我列出了三件事，这些都在IMC和台积电的路线图上，在两纳米，我们将看到所谓的纳米片状场效应晶体管，在这个呃，技术，垂直鳍被放置在彼此顶部的水平鳍状结构所取代，这允许晶体管在较低的电压下工作。

所以它给了我们能量的改善，但并没有给我们任何密度改善，同时在两纳米，台积电宣布，他们打算实施埋地电力轨道，这是一项非常复杂的技术，涉及将晶圆粘合在一起，在晶片上蚀刻硅线并在下面形成金属。

其中一个晶片的晶体管，所有的英雄主义，这使得向晶体管输送电力的金属程序，在晶圆片的对面，到信号路由，因此，节省空间只提供15%的净密度，这是台积电的估计，这将是在呃，xpus。

我在这里用术语XPU来表示GPU，CPU使用任何形式的空气算术加速器，它可能会在26年成为一个XP使用，但它只能提高15%的净密度，下一个重要的步骤是十年后，在这一步中，从过程工程的角度来看。

这更加引人注目，n和p晶体管再次相互叠加以提高密度，这一次可能会额外增加50%，但这是十年后的事了，这样你就可以看到消息了，那么发生了什么，人们仍然希望构建更复杂的单节点，所以越来越。

你会看到硅染料组合成更大的组件，所以换句话说，硅封装的创新，包装听起来像是个无聊的术语，但其中一些真的非常有趣，也非常困难，我在右上角展示了一些例子，非常简短。

两个大模具安装在带有HBM内存堆栈的硅插入器上，然后通过封装基板内的埋地桥相互连接，那是中间的A和D图形芯片，顶部是英特尔的Pontivea，可能是最雄心勃勃的，但也是迄今为止最昂贵的GPU构建。

到目前为止，四个两个独立的硅Dion，衬底上的两个大硅，右边是苹果的一些更温和但在经济上非常重要的东西，他们安装的笔记本电脑处理器和台式机处理器，在硅上，在与裸硅染料共享的有机基底上，这很有趣。

因为通常这些记忆会安装在PCB上，将它们安装在衬底上可以为存储器技术提供更高的带宽，它通常与带宽无关，我稍后会回到左下角AMD的最新版本，呃，芯片，事实上，超越MX，现在使用一种叫做晶圆上芯片的技术。

在处理器核心顶部安装一个L 2缓存硅片，然后将这些核心排列在衬底上，围绕着一个IO死亡成一个相当复杂的，这是迄今为止最雄心勃勃的事情之一，是底部中心特斯拉D一百，这是一个重组晶圆，换句话说。

形成了一片芯片，晶圆片被锯成碎片，工作芯片被识别，然后用一种塑料材料重建晶圆，从好筹码里拿出来，他们的晶片就像硅片一样，然后你可以在上面金属化，你也可以做其他的事情，比如附加。

直接焊接到重组晶片的IO连接，这项技术已经在你的手机里了，在大约一平方厘米的规模上已经很多年了，但这是第一次，在30厘米宽的尺度上，最后在右下角，我们自己的芯片，这是第一个晶圆上的晶圆垂直组装。

换句话说，放在一起的硅片，在它们被磨损之前，在晶圆刻度上，这很有趣的原因是因为球场，垂直互连非常，比左下角的替代芯片要好得多，这允许巨大的垂直带宽，每秒几十太字节，每秒至少可能数百太字节。

在垂直堆栈中的芯片之间传递，这显然可以让你建造一个像更大的硅芯片一样的东西，所以硅进化的另一面是性能的进化，根据能量成分，这些点和我在密度幻灯片上显示的一样，我们都听说过丹纳德缩放的结束，在十九纳米。

这是享受到的能量尺度的点，硅缩放时代的大部分突然停止了，我们从每年大约66%的性能改进，在每对偶到大约十八的操作中，我们现在处于第三个时代，从现在开始，每年将有不到10%的，我把它描述为这里的恒定电压。

它不是很稳定，但它几乎是恒定的，所以你看到同样的特征，换句话说，我们不仅不会在硅芯片中得到更多的晶体管，但是每个函数的能量也不会提高太多，它会改善，但不是很多，具有经济意义的是系统级的能源，很明显。

这对环境有重要意义，我们都在尽力制造最低功耗的计算机，但这在经济上有什么大不了的吗，嗯，这里我展示了一个非常大规模的计算机安装的粗略细分，超级缩放器可能会建立的那种东西，呃。

大约三分之二的成本是底盘资本设备，随着时间的推移摊销，可能超过三年，所以电脑，也就是，呃，诸如此类的事情，嗯，只有大约10%的成本是消耗的电力，第三颗可以高达十五的子弹，在电力昂贵的地区，甚至20%。

但在大多数地区，它只有大约10%，然而，还有另一种成分是15%到20%，也就是基础设施的成本，当你改变电源时，它会发生变化，所以在一个大型计算机系统中有很多基础设施，与提供电能和消除热量有关。

这影响了基本的事情，比如一个数据中心需要有多大，换句话说，重建的物理成本是必须移除的热量的函数，所以如果你把功率组件加在一起，换句话说，与所提供的电力成正比的基础设施，如果你把它加上实际消耗的电力。

然后你就可以看到，大约是运行一台大型人工智能计算机成本的三分之一，与能量分量成正比，所以意义重大，它大约是制造等价物的资本成本的一半，呃，好的，我们今天在哪里，嗯，你们很多人都会熟悉这个。

当前一代已部署的人工智能系统，你每次真正的翻牌消耗大约三个大焦耳，交付用于在一个大模型上训练一个大模型，呃模特，对不起在一台大机器上，类似于变形金刚形式的语言模型，在那个超大规模的机器上。

每弗洛大约需要三兆焦耳，嗯，所以这意味着今天，如果我们有十亿个参数，我们用200亿个代币训练，呃，大概是这个龙猫点，嗯，那么我们可能需要100千瓦一个小时，呃，大概是250个筹码。

如果我们把它放大一千倍，令牌的规模相应地增加了一千倍，然后我们增加到2万5千个筹码，10兆瓦一年，这就是为什么没有人部署万亿晶体管，对不起，万亿参数今天没有模型简直太贵了，有万亿参数。

稀疏模型知道人工智能做了其中一个，但没有间隔，所以我们今天的尺寸已经达到了实用性的极限，正如我所说的那样，对丁摩尔来说，情况不会好转多少，为什么我们不能更快地运行XP呢，我们为什么不提高时钟频率呢？嗯。

你们很多人可能都知道CPU，比如说，通常以三四千兆赫的速度运行，但是Gpus和Ipus只运行在不到两千兆赫的地方，谷歌的TPU只运行了比1千兆赫高一点的频率，嗯，你可以让硅晶体管比今天运行得更快。

在这里的图表中，我已经展示了每个探针中晶体管的速度是如何变化的，工作电压的最后几个工艺节点，您可以看到每个进程节点，给你一个陡峭的曲线，给你更快的晶体管更陡峭的曲线，有趣的是互联系统，换句话说。

它们都不以任何频率工作的点，一点都没有，不是零伏特，它处于阈值电压，大约是0。4伏特，所以这意味着，当你把芯片的电源电压降低到0。4伏特时，速度降得很低，很快，而功率只与电压的平方成比例下降。

这给你一个这个形状的曲线，如果将每次操作的能量与每秒操作的能量进行比较，他们都很正常，小圆圈代表了最好的点，至少英伟达和图形，其他一些人选择在，这是校准到七五纳米的过程，大概是一个点。

大约800毫伏的85千兆赫，我们为什么选得好，如果你看，它是曲线在每秒操作中变得超线性的点，这意味着功率变成了超级二次方性能上升得非常非常快，我们可以选择以20%的速度运行。

以每次运行多消耗40%的能量为代价，我们没有，因为那很贵，但我们可以，呃，我们可以选择省电，我们可以慢40%，我们将来每次操作将节省40%的能源，也许我们可以那样做，嗯，为了节约能源，如果能源变得更贵。

我们可能会这么做，但是服装表现还是挺高的，人们非常渴望有尽可能多的表现，呃，这样你就可以看到，有一个灵活性来改变XP的操作速度，嗯，但它是一个，这是一种相当昂贵的灵活性。

所以这就是为什么所有的XP你倾向于以相同的速度运行，现在我们可以做一件事，这又是一个图形创新，我们至少可以确保最大限度地利用电源电压，嗯，一个大的硅片代表了一个非常。

非常非线性的负载到任何试图平稳输送电力的电源，呃，软件可以使芯片在大约一个绘图仪中从零活动到最大活动，不到一纳秒，没有电源，呃，或者没有电源，呃，能够跟上，对最新芯片的戏剧性需求。

可以在几分之一伏的电压下切换数百安培，不到一纳秒，没有电源能跟上，GraphQL交付的创新，把第二块硅片连接到逻辑伺服器上，第二片硅片充满了什么，我们称之为深沟容量，这是如果你喜欢一个巨大的。

极低阻抗电容器，在所有进行开关的晶体管的微米内，这是一种非常独特的技术，它是由同一片晶圆上的晶圆技术形成的，用于将图像连接到手机中的逻辑处理芯片，所以对现有技术的创新使用有着完全不同的规模，在晶圆级上。

在右边，你可以看到对电源电压的影响，呃，上面的轨迹没有这个解耦技术，较低的轨迹是电源电压平滑得有多好，一旦你附加了这个解耦技术，这让我们可以做什么，在我们的电源电压中没有太多的净空，换句话说。

我们可以降低平均电源电压，因为芯片的能量与电源电压的平方成正比，因为这是一种电容技术，这会使电源下降很多，事实上我们可以在不改变逻辑的情况下节省40%的电力。

或者我们可以在不改变逻辑的情况下在相同的功率下获得40%的速度，一点都不死，你会从前面的图表中看到，我证明了这就像至少有一个完整的过程，节点嗯几乎免费，做这个很便宜，顺便说一句。

比在一个过程中前进更便宜，所以我们至少可以这么做，权力去哪里，所以这是处理器瓷砖能耗的细分，都在运行电源病毒，呃，图形最漂亮的芯片，巨人标记两个X，我们称之为晶圆上的晶圆解耦，它是一种能量病毒，嗯。

所以这实际上会导致每个翻牌消耗大约一个皮卡丘，le，其最坏情况数据模式，嗯，同时应用于芯片上的所有瓷砖，嗯，这是一个浮点16，我们也支持浮点八，大家可以看到，大部分功率实际上都进入了浮点算术。

剩下的大部分能量都用于移动大约一毫米的数据，事实上大约半毫米，从本地内存到浮点数据路径，换句话说，没有，没有多少电力不用于，我们想做的工作，嗯，我称之为无操作循环的黑色部分，是制造处理器的开销。

换句话说，我们必须有蓝色和橙色的部分，即使是固定功能，我经常听到这样的说法如果我们为这样的任务制造固定的硬件，它将比构建过程效率高得多，这不是真的，现代处理器的程序化包装器，尤其是对于这种应用程序来说。

确实非常轻量级，在硅晶体管或能源方面成本很低，所以我们把能量用在了正确的地方，我们能做好什么，我们已经把算术的精度降低了，呃，到8位浮点，现在有一个我三倍，呃，标准组um，研究低精度算术，到目前为止。

它必须，呃，八位，它可能会继续到四个位，嗯，右边的图表显示在横轴上，尺度范围有多广，就可以得到，呃，训练变压器与，这两种不同类型的八位浮点格式，目前由我提出，它有四个指数位和三个螳螂位。

一个有五个指数位，两个螳螂位，英伟达有非常相似的，但在他们最新的芯片中不完全相同的数字格式，嗯，如果你盯着它看足够长的时间，你可能会看到，对于重量和激活，呃，上面的格式很好用，工作以及三个两位浮点。

对于这两种类型的梯度，较低的格式也可以，或者至少在某些情况下更好，然后三个两位浮点，所以八位算术效果很好，省了很多电，4位算术怎么样，最近有一些工作，呃，探索特定的缩放规律，参考贝德特曼先生的这篇论文。

特别是看起来像四位数字代码，可能是精度的极限，向前推断的缩放，这太棒了，这意味着我们可以在改善曲线方面走得更远，嗯，我们可以提供一个浮点，四位，呃，妈妈能量可能是八位浮点的一半的算术，比较高的积累路径。

精度开始变得重要，所以这不仅仅是乘法的问题，它确实缩水了很多，但积累路径也是如此，就每秒浮点运算而言，净效应可能是两倍，其中可以使用四位，我要做的另一个有趣的观察是，这些突触。

至少你大脑的某些部分似乎有大约4位的分辨率，呃，四个半比特，绝对迷人的纸在这里，你可以去查阅，在人脑的海马体中证明了这一点，在一个范围内对数分布的两个六个可解析电平，让它们看起来更像四位浮点。

所以如果你相信我们正在向类似大脑的结构进化，那可能会有帮助，硅在未来十年会发生什么，考虑到我刚才告诉你的一切，嗯，我觉得，我可能错了，当然很难向前看，在任何快速发展的技术上都要花十年时间。

我想我们可能会得到1。5倍的晶体管每个芯片，十年后，它们每个晶体管的成本可能是一样的，我们可能会设法使每次操作的能量提高2倍，以同样的精度，如果我们能去四位浮点，这将给我们另一个2的因子。

所以每瓦的性能可能是4倍，我们将在多芯片架构层面看到很多创新，随着芯片的减速，包装创新也在增加，有些人不同意我的预测，特别是，美国政府不同意我的预测，这些是美国能源部制定的目标。

为他们的ACC超级计算机程序提供建议，有两台机器要送，二十六分之一，二十分之一，但这些是2030年预期的改进，这是七年后，他们希望性能能提高一百四十倍，能源效率提高了50倍，我想我可以说。

处理器设计师完全相信这两件事都不会实现，让我们继续讨论不在处理器中的东西，这就是存储芯片，它在今天的处理器旁边越来越重要，大多数ai处理器使用hbm高带宽内存，这是一个垂直集成的DRAM染料堆栈。

坐在控制器上，然后将染料放置在硅插入器的顶部，与XPU共享，然后将其放置在封装基板上，然后放在PCB上，我一会儿就回来，呃，有一种竞争技术开始出现，这是你手机上使用的内存技术，它用于笔记本电脑。

它开始被用于更高性能的计算环境，这被称为低功耗DD R双倍数据速率，代表，这是一个历史名称，这不是很有帮助，这也是一种垂直堆叠的技术，但是用非常非常非常便宜的，呃方法，你就会看到，我在右手边突出显示了。

这两种技术每千兆字节的成本差异，他们使用完全相同的DRAM技术，它们是由同样的人在完全相同的风扇上建造的，然而有一个七，最终产品每千兆字节成本的x差，原因是HBM的复杂性，随着hbn容量的增加。

这个数字可能会下降，但可能不会低于四个不同的因素，所以还是很低，这些技术之间的另一个区别，我也用黄色突出显示了，首先是LPD博士，允许更多的内存连接到一个x pu--大约半个樱桃，但是。

它可以被读取的速率，每秒太字节的速度只有，这意味着如果你想阅读整个，它将花费你大约十倍的时间来阅读整个较小的，赫本男人，这与AI特别相关，因为这通常是我们在训练中想做的，我们倾向于读写整个记忆。

Auto中的每一次迭代，回归标记，明智的推断，目前流行的，最受欢迎的语言模型，我们倾向于阅读整个模型，为每个生成两个，嗯，模型的大小通常由抱歉主导，内存的大小由，所以我们正在读取整个记忆，呃，骑自行车。

所以非常重要的循环，HBM肯定会在循环中获胜，然而，它在容量上损失了，在成本上损失了，我认为我们将看到这两种记忆技术在人工智能处理中发挥作用，我想英伟达同意我的观点，因为如果你看看格蕾丝漏斗装置。

HBM附着在GPU粒子上，LVR连接到CPU部分后面，嗯，所以也许这是最好的情况，已经，呃，两者都有很多，嗯，但显然这是一个非常昂贵的情况，所以我想你会看到的，我们会看到有其中一个或另一个的芯片。

以及在人工智能上下文中的记忆，你不能在一秒钟内骑车或触摸，真的不是很有用，让我举两个例子来说明为什么，这就是我在这里画的SGD培训的案例，红色的部分是前进的路径，和黑色部分在后向路径中形成梯度。

SGD培训，众所周知，是一阶优化机制，需要大量的迭代，通常在所有一百万次迭代中，所以如果我们准备等几个星期来训练模型，然后我们需要在几秒钟内完成每次迭代，每次迭代读取和写入所有内存，因此。

内存必须在不到一秒钟的时间内是可循环或可触摸的，或者换句话说，内存带宽需要与内存电容大致相同，每秒千兆字节需要和千兆字节差不多，第二个例子是象征性回归世代，在这种情况下。

我们希望读取完整的模型和每次迭代缓存的任何kv值，我们想做那么多，每秒多次，所以我们每秒读取整个内存很多次，这又意味着，A记忆，它是如此之大或如此之慢，以至于阅读需要一秒钟以上的时间，不是很有用。

因为这个原因，我们不会看到服务器中使用的内存类，用于人工智能，比如说，所谓的DL内存，嗯，它会在服务中流行一段时间，但你不会在人工智能机器上看到它，你也不会看到闪存，像NAND内存这样的东西。

非常非常便宜，但不幸的是，在几千次甚至几百次读写循环后，它就会磨损，所以如果你建造了一台人工智能机器并尝试进行一次SGD的训练，你得到了，呃，在你的记忆被摧毁之前，你不会走得太远。

我们可以把相当多的静态内存，换句话说，基于逻辑晶体管的存储器在一个完整的Reical管芯上，这是一个特别的东西，叫做巨像的图，IPU做了大约一半的模具面积，事实上，被静态内存覆盖，这是一个死球。

你可以看到我突出显示了那里近1500个处理器中的一个，每个处理器有624千字节，所以总共，我们在这个芯片上有大约九百兆字节的内存，芯片之间有相当大的带宽，所以在一个集群上，你有几个千兆字节的RAM。

如果你的模型是有序的，十亿个或更少的参数，那你就根本不需要外置式闸板了，事实上，这是这个芯片的设计基础，这个芯片是为你可以将模型放入芯片本身的情况而设计的，或者穿过连接在一起的芯片簇，当你能做到的时候。

那么你访问内存的速度甚至比HBM快十倍，你可以以每秒几十TB的速度访问它，在这个芯片的情况下，每秒65 TB，完全无法达到的速度，具有任何动态内存，所以这是一个选择，至少对于普通型号来说，呃。

还有一种情况，在芯片上有大量的RAM非常有用，这是第二颗子弹，如果您有一个大模型，这样您确实需要外部RAM，那么在芯片上有足够的SRM仍然是非常有用的，您只能从DRAM读取每个矩阵一次。

每次向前或向后通过模型时，嗯，今天的许多芯片都没有那么多的s ram，必须不止一次地阅读模型的部分来向前传递，比如说，为什么只读一遍是个好主意，因为访问呃，就DRAM能量而言。

来自外部存储器的数据是昂贵的，而且DRAM的带宽也是有限的，所以你读过一次吗，如果你有足够的内存读一次，那是个优势，终于，我们还没有谈到的另一个资源是芯片间链路，多少钱。

我们可以在一个芯片上容纳多少芯片带宽，如果我们用完了外部DRAM的长边内存，使用HBM或LPD R，然后我们就有了短边，芯片设计师所说的南北，嗯，在其中安装芯片到芯片的链接，嗯，这是通常做的。

我们可以用今天的技术容纳大约128人，一百千兆位，所谓谷神星晚期，呃，Sergey代表序列化，序列化的，只是有点宽，在每个方向上支持每秒100千兆位的um通道技术，这就是100千兆以太网所使用的。

所以我想了一百二十八，100千兆以太网范围给我们一个点，每秒6TB，全双工，换句话说，在芯片之间同时在两个方向上，功率成本约为每秒容量60瓦，所以如果你有100瓦的全部容量，不像公羊，不像DRAM。

所以这些一直在消耗能量，所以即使你没有发送任何东西，他们还在消费和机器人，而公羊只燃烧力量，当你真的发送测试版，100千兆的车道不会到达很远，它们在一个PCB中会达到30厘米左右。

或在精心设计的铜电缆中大约两米，所以它们有点在底盘里，或者是铜的说唱技术的极致，如果你想走得更远，如果你想在机架之间，然后你必须使用光收发器，这大约是电力成本的四倍，呃，也是美元成本的四倍。

所以这是一个非常，建造光学非常昂贵的东西，然而，对增加带宽的需求巨大，所以我们会看到更多的光学，我们会看到光收发器在芯片旁边的板上，最终我们会看到一种特殊的纤维进入徐本身，但还没有很多年。

我实际上需要多少带宽，我需要把所有的带宽都放在芯片上吗，嗯，这里有一个例子，你可以制定很多例子，但是如果我们如果我们拿花车16，我们在一个大模型上做了一百万次训练迭代，我们将优化器分布在模型上。

那么对于每个迭代，我们需要做一个集体，它由减少的散射权重组成，对权重的子集进行本地更新，然后所有人都在收集重量，这是一次全减量操作，包括减少散射操作和全收集器操作，呃，去做那件事，链接必须转移四次。

根据数据读取的模型参数数，嗯，就数据而言，很抱歉，这也是数据量，必须由本地进程从DM读取的模型数据，每次迭代每个副本，同时，每个参数每个令牌的总训练失败约为6次，如果我们使用通常的度量。

15到20倍于参数的标记，那么大约是一百个模型参数的平方，我在这里换了一些例子，所以对于一千亿个参数，如果我们用256个模型复制品，这是合理的，那么每个字节大约有一万个失败。

这是每个字节的内存访问是带宽，以及进入芯片的传输带宽，训练很有趣，您需要大约相同数量的带宽为每个，芯片之间的带宽与内存之间的带宽大致相同，如果最小化内存带宽，通过拥有100亿个参数模型的面RAM。

你不需要那么多复制品，每个字节的失败次数增加，但不是很多，所以你说的是大约10到地板翻转，每字节十到四次失败，转移操作所需的时间与失败相同，你可能希望转移集体比失败花更少的时间。

所以一个PETA翻牌芯片，这可能带来了50%的愤怒失败，至少需要每秒100GB的内存带宽，每秒至少100千兆字节的传输带宽，也许更多，如果你能适应，但可能不是每秒太字节，这很有趣。

因为GPU尤其在今天出售，根据带宽对培训有用，它不是真的用来训练的，呃，训练仍然是完全失败的统治，许多类型的推理也是如此，训练机不需要每秒多TB DRAM，它也不需要每秒多兆字节的芯片连接。

但这两者都需要每秒至少数百千兆字节，让我们看看我们真正做到的情况，需要内存带宽的损失，而且那个，当然是象征性的回归生成，这张图是变压器堆一层的漫画，左边的部分是注意部分，右边的部分是前馈网络部分。

红色物体是权重矩阵，而黑色平面对象是激活矩阵，因为这是象征性的倒退，只有一个令牌，嗯，在这些激活矩阵中的每一个，IS在每次迭代中都是活动的，除了K和V的缓存历史记录，我用浅蓝色突出显示的。

我用了d是一个维度，q是上下文的符号，呃这些东西，然后嗯，如果你看每个值的失败次数，从内存读取，那你，然后你得出这个方程，注意里面的批处理术语，换句话说，该方程允许您批量处理多个对话。

为了每周得到更好的失败，这是很常见的没有它，自动攻击令牌的算术时态强度，明智的一代每次阅读只有两次失败，换句话说，如果你有每秒3TB，最先进的HBN存储系统，你每秒只能使用类似数量的恐怖失败。

不是每秒的PETA翻牌，最便宜的，嗯，如果上下文很小，然后批处理在一定程度上有所帮助，我们可以在上下文中得到12倍维度的渐近线，如果尺寸，例如，是8000个，上下文是2000个令牌，嗯。

然后你知道这可能是100-100次失败每字节，嗯，你仍然会有完全的RAM带宽，限制在现代GPU UM或类似的大型上下文中，当然，这越来越成为我们想要的，关于如何处理大上下文的研究呈爆炸式增长。

那么配料不能提高密度，计算强度，换句话说，你每次都失败两次，我用红色突出显示，致命的陈述，如果这是部署的通用模型，AI令牌智慧自回归推理，今天就是这样，如果我们想要一个大的缓存历史记录。

这样这些模特就不会忘记你十分钟前所说的话，那我们就不需要XP使用了，当前的算术都没有，加速芯片适合于这一约束而设计，我们应该制造具有大量内存带宽的芯片，失败的次数也不多，呃，它们会便宜得多。

消耗的电力也会少得多，换句话说，我们制造的昂贵芯片之间突然出现了巨大的不匹配，以及目前对我们最有价值的算法，我们能做些什么呢，我们可以制造不同的芯片，如果算法不进化，这就是将要发生的事，呃，这需要时间。

因为新的芯片一代需要几年时间，但这就是会发生的事情，这些芯片将最大限度地提高它们的内存带宽，他们不会有太多，算法上，还有其他选择，我们可以压缩阻止批处理工作如此顺利的历史项。

我们可以稀疏地访问历史术语也有同样的效果，我们得到的历史少了，呃，我们可以做一个分层形式的自回归，嗯，使用某种超级代币，然后在这些超级令牌中生成令牌，嗯，这降低了大部分模型的迭代速率，比如说。

我们就可以，呃，也许不用自动回归，所以不是所有的生成人工智能都使用自动回归，呃，图像生成器，比如说，中程稳定扩散，它们本身不是自动回归的，它们是迭代的，但它们在整个图像上迭代，整个数据样本。

不在数据示例中的令牌上，这意味着它们每个字节的失败要高得多，算术强度使他们表现得更像训练，换句话说，它们非常适合于高内存带宽的翻牌中心芯片，但一个更受失败限制的。

所以也许语言会屈服于迭代扩散去噪而不是自回归，或者语言本质上更好，有了自动回归，图像本质上是更好的扩散噪声，谁知道我们发现此刻我们可以做的其他事情，我们可以部分地模拟我们希望提供给人工智能的所有信息。

而不是为其他零件建模，我特别被这种关注点分离的想法所吸引，因为我认为这是算法和硬件都可以玩的东西，所以我们训练的基础模型学到了什么，呃，我们学到的第一件事，当然是交流的能力。

使用一种语言来形成构造良好的语法，要有表现力，流利，嗯，在多模态模型中，我们可以使用其他媒体，如图片，所以流畅的交流是通过一个大的基础模型来学习的，我们也学到了很多关于世界的知识。

最后我们学习如何做一些推理，至少常识推理，或，或者你可以把它描述为计算，构建和执行算法的能力出现了，呃，大型地基模型比例尺，现在这三件事是分开的，在某种程度上。

当然很像人类与维基百科这样的知识库是分离的，和Python这样的算法执行引擎，嗯，也许人类必须这么做，当然啦，因为我们的大脑容量有限，也许AIS不必那样做，但它可能是有效的，如果他们真的这样做了。

单独的担忧，那么自动回归部分就会，可能是，如果这是象征性的明智的话，只需要处理语言的构造，模型的这一部分可能比模型的大部分要小得多，这是一个例子，我相信你们都很熟悉，在大型语言模型中检索一些炎症的想法。

换句话说，查询进来，有第一个网络对查询执行分析，并构造检索，好吧，来吧，检索命令几乎进入知识数据库，它不是建模的，它被索引了，换句话说，它被组织起来，这样我们就可以检索相关的东西。

但它实际上不是自动回归模型参数的一部分，然后将未建模数据的索引输入第三个网络，那个，与查询一起合成响应，换句话说，我们把知识状态从语言中分离出来，来自国家，必须善于语言，和知识状态，这使得知识状态非常。

比模型的迭代部分大得多，至少模型的令牌迭代部分，它还允许我们处理模型的货币等问题，换句话说，我们可以很快地将数据添加到知识库中，它立即可用，我们还可以导出归因，我们要问模特参考，它用来提供响应的。

所以这种以检索网络为特征的关注点分离，我觉得很有吸引力，它当然有助于解决芯片和算法之间的不匹配，所以我的结论是不会有通用的XPU，其中x PU是GPU，一只伊普，TPU，随便你怎么称呼它，嗯。

为训练或平行推理而设计的机器，像图像生成将最大限度地提高批号率，换句话说，它会有很高的算术强度，自动回归发电机，它在像代币一样的小量子上运行，放逐令牌将优先考虑内存带宽，它将具有较低的算术强度。

你不能同时为高和低算术强度设计一个高效的芯片，你需要两个不同的芯片，最后是用于大型模型推断或微调的紧凑戳，内存容量优先于每秒失败或每秒千兆字节，问题是我能有多少千兆字节。

我能在一个芯片上安装我的GPT三个型号吗，这将是可能的LP，ddr内存同样符合与前两种不同的芯片设计，所以会有各种各样的XPU机器，我们将学会使用所有这些，因为有各种各样的人工智能要求。

人工智能的价值足以证明不止一种芯片架构是合理的，不会有通用的X PU，最后这是我最后一张幻灯片，嗯，我喜欢思考人工智能算法的进化，因为我们经历了两步才走到今天的地步，最初我们发现。

如果我们建立深度神经网络结构，我们可以以一种令人惊讶的好方式从数据中学习，呃，这种学习是有监督的，因此数据是昂贵的，所以我们缩放模型的能力受到数据成本的限制，然后我们学会了如何使用无监督学习。

所以数据变得便宜了，我们可以利用互联网，不得不对它做一些过滤，但基本上有很多廉价的数据，就在那时，模型爆炸了一千倍，我们有今天的语言模型，比如说，千亿参数的多模态模型。

而不是Resnet时代的图像模型订单一亿种方法，所以千层甲板的规模增加了，嗯，如果我们现在利用稀疏访问状态的能力，就有机会走得更远然后受到内存成本的限制，左上角的注释是所有计算机设计的真理，那就是。

可以计算的是你有多少内存的函数，每秒失败的次数，以及您可以访问的速率，内存决定了你能计算多快，但是你系统中的状态量，确定可以计算什么，所以最终任何特工的情报都会受到现有信息的限制，换句话说。

通过内存容量，不是每秒的失败，不是每秒千兆字节，所以我认为会有对内存容量的压力，以及其他与速度有关的参数，就这些非常感谢你的收听。

好的，我能工作吗，呃，它现在起作用了，是啊，是啊，谢谢西蒙，好的，呃，谢谢你，呃，谢谢。呃，给了我们关于硅如何形成芯片的非常翔实的报告，就像形成XPUS一样，它运行，呃，非常聪明，人工智能算法。

所以基于什么，你对我们有什么，我有某种，就像你刚才在最后一篇文章中提到的那样，我有几个问题，呃，没有将军，或者没有通用的XPU，所以它只是说，或者说，呃，没有，没有，呃，我能这么说什么呢，是说。

是不是说没有，我能做什么，呃，没有一个，没有一个单独的XPU或体系结构可以实现优化，病毒AI算法优化，所以我们能说没有人，单个XPU架构可以实现，是呀，呃，这正是我要说的，嗯，所以我记得当呃。

多媒体作为计算机工作负载出现，嗯，所以图像、视频和声音的处理，呃，这是在二十世纪九十年代初，和当时所有的CPU供应商，嗯说你不需要新的芯片，呃，我们可以在CPU上做这个，当然。

CPU是一个通用的计算引擎，所以你可以在他们身上做任何事情，然而，这样做是没有效率的，所以我们看到了媒体过程的出现，现在无处不在的信号处理器类型，所以你的手机，你的笔记本电脑，它们都嵌入了媒体处理器。

它总是归结为那个问题，将有针对任何市场的特定用途的芯片，足够大，足以证明生产它们的费用是合理的，并且有足够的体积，人工智能显然是一项非常非常广泛的技术，它将以很多很多的方式使用。

它可能会影响地球上的每一个企业，嗯，我们已经知道有很多不同的模型，培训等模式，良好的操作模式，如培训和罚款，调谐与推理，都有不同的要求，嗯，它们都将是非常非常大规模的工作量，因此。

为这些东西开发体系结构是值得的，呃我也觉得呃，CPU在人工智能中发挥作用，尤其是当它们开发出更高带宽的存储器时，他们很可能支持中间的子弹，换句话说，我们可能会发现我们在对我们的汽车进行推断。

几年后的攻击性语言模型，不是在GPU上，但是在CPU上，但是Gpus仍然存在，ipus仍然存在，tpus仍然存在，DSP和媒体处理器以及其他一切都将存在，今天有非常丰富多样的处理器芯片类型。

我会增加这种多样性，人工智能不会全是GPU，是啊，是啊，好的，谢谢。哦，所以下一个问题是关于IPU的，所以你知道，每个人都知道你是IPUS的主要架构之一，那么这些是什么，你在比赛中面临的一些挑战，鼓励。

呃，在GPU等更成熟的技术上采用IP US，嗯嗯，呃，这是个很好的问题，所以呃，为了一家新公司的成功崛起，嗯，我认为有两个基本要求，至少在芯片空间，这是我的呃，职业经历。

首先你需要一个提供新机会的新市场，一个尚未被某人拥有的市场，人工智能代表了那种市场，适合新芯片的市场，足够大，你需要的第二件事是，如果有大玩家和你同时进入那个市场，在我们的例子中。

一个大玩家可能是英伟达，那么你需要做一些不同于他们正在做的事情，我很惊讶有这么多初创公司开始解决，AI本质上试图克隆英伟达正在做的事情，制作一个GPU，嗯，我看不出这有什么用。

我不认为这是一个成功的商业计划，嗯，因为敌人总是有规模和在位的优势，嗯，所以你得做点不同的事情，所以我们面临的挑战是建造一种不同类型的建筑，看起来不像GPU，但至少对人工智能的某些部分仍然非常有用，呃。

计算空间，嗯，和IPU的特性，区分它与GPU的区别包括，比如说，大量的芯片上SRAM，因此，中型型号可以安装在芯片上，而不会在HA发生任何脱轨，这给了你巨大的性能，巨大的功率效率，那是一回事。

iuse的另一个特点与gpus很不一样，是由大量完全独立的进程组成的吗，换句话说，它的并行性更多的是在程序级别，或者某些程序将不在数据路径级别，gpu中的大部分并行性来自于向量操作或向量数据路径。

I确实有矢量数据路径，但他们是，嗯，它们相对于GPU来说很小，你有更多独立的子程序，这意味着如果您的数据结构不太规则，例如图神经网络，那么IU有很大的优势，我们已经展示了，比如说。

我们提交了我们的第一次尝试，打开OGB，lfc，开放图基准大规模挑战，去年，我们第一次提交，在我们第一次到达时进入的三个类别中，我们在两个类别中名列前茅，但我们打败了所有其他技术，嗯在那个区域。

这证明了IP是一个非常非常好的图形，结构化数据，比如分子，比如呃，知识图，比如说，表示知识库，好的，那就太好了，当然可以，最后一个问题，所以现在整个市场都一样，还有呃。

现在世界上有很多这样的大型语言模型，所以估计对计算能力的需求，我的意思是对计算能力的需求可能会在，我看得很远，可能在未来五到十年，那么你如何估计总体需求规模和计算能力的增长率，从长远来看。

到2030年的时候，你怎么，呃，你如何估计计算能力的整体需求规模，以及它是如何生长的，我想是，呃，估计可能是不可能的，嗯，因为，我认为人工智能的出现是人类最大的技术进步，自从机器力量的利用。

至少换句话说，自从发动机发明以来，嗯，就像蒸汽机的发明者很难告诉你的那样，十年后会发生什么，我很难告诉你人工智能机器会发生什么，十年后，对我来说很明显，至少所有的计算都有从数据中学习的元素。

所以如果人工智能被描述为从数据中学习，那么未来所有的计算都是人工智能，在某种程度上，同样明显的是，人工智能为在许多领域使用计算提供了机会，以前不使用计算的地方，换句话说，计算市场的规模将急剧上升。

所以它将是人工智能，它将比今天更大，呃，我来举个例子，一个区域，这不是一个特别关注的领域，但是它现在有点过时了，这是自动或半自动驾驶，每年汽车销售量是目前的10倍，与每年销售的服务器计算机数量一样大。

所以如果一辆车成为一台强大的服务器计算机，为了提供半自治或自治，那么你的计算市场突然上升了十倍，那只是一个应用程序，一种技术，呃，我觉得，嗯，会有很多，其中许多应用程序，所以它会，它将是巨大的。

这就是为什么能源效率是如此重要，嗯，是的是的，我不是危言耸听者，就基础设施计算目前使用的能源量而言，但有可能，能量会增加一百倍，如果发生这种情况，我会陪着你，哦，那太好了，好的，谢谢西蒙，好的。

谢谢好的，你和我刚刚注意到上海的图形港办事处，所以可能，然后我们就可以，我在找，我真的很期待见到你的提纲，下次可能在你上海的办公室，我还有很多问题要问，是呀，是啊，是啊，是啊，是啊，计算能力如何。

它们之间有什么神奇的关系，人工智能算法或者芯片设计的软件部分和硬件部分，好的，谢谢。谢谢你抽出时间，西蒙，谢谢再见，谢谢。琼斯，谢谢再见。

特邀报告&尖峰对话（张宏江、Kenneth Stanley、Will Knight、Susan Zhang） - P1 - 智源社区 - BV1nM4y1n7ww

These model where it just started to be eerily human like in some ways obviously it's not human at this point。

but it has some human like qualities and so this is clearly diverted an enormous amount of attention funding research and effort to those kinds of models and so we're going to be you know'm seeing more investment in that direction obviously in the near future but you know with the thing about AI and science in general and technology in general is that there's always going to be surprises so it doesn't just mean it's a straight shot of larger and larger language models like from here to AGI or something like that I would expect surprises still to be coming but we've certainly learned some lessons about size and data and scale that will probably continue to apply even as architecture perhaps surprises and shift。

Yeah， so Dr Jianang， yeah， obviously the theme of this conference is you always very firmly me as Hong Jang as it maybe be easier Okay。

Hong Jg okay。I try I'm trying to pronounce the surname properly but you so you know obviously it's understandable that the theme of the conference is large language models and it's just such an exciting time you know for someone who covers AI I've never seen anything like it so。

😊，But I would love to hear your perspective and how you're thinking about it as the chairman of the BAAI。

Well， the acoustic here is now that good could you repeat the question one more time Yeah okay sorry。

so what I was asking was given that we are experiencing a big moment in AI。

what you think in the big picture， it means for the state the status of AI research and the direction。

maybe that you'll be taking at the institute。It's definitely a big breakthrough that make every one of us who has been working in the field to rethink the approaches we have been using and system architectures we have been building and algorithms we have been working on。

know before ChaTBT， there are many there are tons of effort looking at various algorithm。

but I have always been a big fan of system approach。

meaning that AI AI technology AI itself is going to be a system is a system and it's not just a single algorithms。

that's actually one of the reasons why it's there to me this particular forum。

discussion and also can your book on you know the the the know success cannot be planned。

but be the if you， if you look at what we have been doing and like。

InIn most of computer science field， we we especially in academia。

we tend to look at a single algorithm and try to improve it a bit by bit。

but you know open I took a totally system approach and especially if you think about transformer was invented by Google researchers and they have came out of many。

You know， quite successful models， but none of them really has the ability of emerging。

And had shown the power of the HRGBT。And has so brilliantly combined the data alignment algorithm infer together that LED us to this breakthrough。

So I think the entire field is re thinking you know， how we carry on research。

What is right or most appropriate most effective approaches to approaching this approaching AI problems and。

give you an example in the larger language processing field。

which is a very fundamental sub areaa in AI when Ch came out。

at least I know among they are top groups in Chinese universities。

they basically tell themselves now we need to look back。

They actually one of the university really told the students PhD students say。

you know if you graduated this year， we we can't stop you because you have to graduate。

but if you graduate next year， you need to rethink your thesis because of what the problem you're trying to address。

Already， the logic extent solved byGT models。So yeah。So， you know， although you could still graduate。

if you continue along the direction， but it really your work。It meanless， I mean。

in terms of adding to the state of art。Yeah， that's pretty extraordinary I was at an event here at MIT recently where they had some linguists and cognitive scientists。

who were also saying that GPT4 and these large language models was changing their fields。

changing other areas of science I guess。Kenneth， on the subject of what might be missing。

Do you I mean it's interesting， is there anything about what we've seen with especially chat GT that makes you think here is a really exciting new direction I mean I think Hong Jg alluded to some of the things that yeah I'm curious what you think。

Yeah， when we think about what might be missing in exciting new directions。

I mean there' there's exciting new directions that build upon it and that now we can now opened up that weren't possible before and then there are limitations。

of course， where it's like these are problems that still exist in the models so let's see I can so maybe to start with just exciting directions that just build upon what we have one is I think that there's this is a very unusual one which I don't usually hear mentioned。

but I think about it a lot， which is that we haven't had the ability of a computer to actually grapple with a question of what's interesting before。

know if you just think even back two years or three years you could never imagine from a subjective point of view to even begin to say look look at this idea and tell me what you think look is this actually a good idea but this read in a good direction。

And for the first time actually you can have the computer begin to grapple with this kind of subjective question and if you think about it。

this is an extremely important question， what is interesting and what is not interesting。

even though it's totally subjective because it's the see from which all research and innovation grows it's like I decide what to do based on what I think is interesting and so if these kinds of models are someday going to actually solve big you know momentous problems in our world they need to think about which directions are the most interesting to pursue so that those will become stepping stones to actually solving these problems and interestingness is a separate issue from whether you're solving a problem it's just a question was this an interesting research idea or an interesting piece of art or an interesting story and it's very intriguing that suddenly it actually can start to engage with that question and not just in the sense of give you a rating it can even give you an articulate analysis of why something' is interesting and this is the beginning of innovation it's like the beginning。

auutonomous innovation so I think that's super interesting that that's now possible I could also talk a little bit about what I think are interesting limitations but but i'm not sure if you want to go in that direction already or not well no actually what why don't we Hongngzang what do you what do you think of this idea of algorithms。

Identifying what's interesting on maybe being a kind of innovation。

does that does that sound like a promising？Concept to you。Definitely it is。

although my own expertiseed are unnecessary in this area and definitely this is I think this is very promising direction。

Yeah， it's I mean it's it's I wonder that Kenneth， do you think the。That there's real， you know。

can tell you something that's interesting that a person can't tell。

do you know what I mean because sometimes when you see chat GT？It is impressive。

but it doesn't seem that original， so do you see examples where it finds something interesting that maybe no person would？

Good question Yeah， I definitely I believe that there's some serious limitations when it comes to comparing to a human's instinct for interestingness yeah these models don't come close。

that's true so it's just the beginning of a glimmering of the ability to grapple with this question of what's interesting but that's still extremely useful you because it's like it always comes up in sort of whenever you're thinking what should I do next if you want something on its own to think about what am I going to do next now that I finish this task now what's the next thing that would be interesting that's got to think about it a little and even being able to do it a little bit its still really intriguing。

but it's obviously something we need to build on and in fact when you talk about like unoriginal that's totally true they're not going to being original that's one of those really interesting limitations that I think is going to require proving the models and I would just point out that being original is related to being novel so novelty comes up and there's a problem I think with the current paradigm with being able to。

Identify novelty in a genuine way because if you think about it novelty as a function of chronology it depends on the order in which events happened。

whether an idea that you have now is novel or not。

but if you think about it the model is exposed to all of history simultaneously it doesn't experience its training data as a chronology that happens in an order and therefore it's not actually experiencing that you know moment of epiphany when you say oh。

this is really interesting because I've never seen anything like that before so if in the data for example it says something like that's a really novel idea it's not in the context of what came before。

it's in the context of ever came for and what came after so it's very different from the way that we experienced novelty and because of that novelty is really not in the data in any substantive way and that means that I would expect it not to be good at thinking about novelty generating novelty and so forth and to solve that will require I think somewhat of a paradigm shift because you've got to deal with chronology。

So fascinating， I was talking earlier this week with Des Deep Mind and he was saying I was thinking of this Hongjiiang because you mentioned alignment and some of the technologies that went into Cha GP and he was talking about reinforcement learning。

Being very important and。Obviously one of the things with Alpha zero and Alpha go was that it could come up with completely novel sort of strategies and it's not you know it's very different from a language model。

buts it's interesting like things that people never，Could never come up with。

But I wonder if one of the you mentioned alignment， I wonder Hongng Jang。

if the reinforcement learning or other types of machine learning are of interest to the instit to you as a way to kind of broaden the capabilities。

Definitely actually reinforcement learning and alignment。in AI in the building big models。

they are not two different things， but actually during reinforce learning is used。

In alignment process， in the learning process， that's exactly what made Cha GT。

what made the G4 much better than GB3 very very very development from GB 3。

0 to instruct GT than to chat GBT is really the alignment process that used reinforcement learning that you know you use human dialogue data you use human feedback through reinforcement learning to gather the alignment so it is a super important learning algorithm I reinforcement learning in the alignment process and alignment as。

A very critical step approach to AGI to large models safety and human align to human values so it it is super important and also alignment itself is one way very effective way to refine and trained model to a specific applications you know。

you fit more domain specific data into the model through alignment process that will help us to really adapt the model to various scenarios。

various applications， various verticals。That's great and are there other other techniques that you're interested in or you think。

I mean， I know there's the Wooow model， which is。I think my understanding is it's somewhat different from some of the other ones it was multimodal to begin with。

right？Are you looking at other techniques？From machine learning anyway that like what do do you what are you interested in for the next generation of these language models definitely VR。

We before chat T or before GB4， people in the field have been working on various models and like Google brain came out of BRT like before that again and all those the studies researchers have contributed to the field of large models and although today we see the pretraining that based on transformformer and combined with alignment it most efficient effective approach that led to GB4 and led to many models who tried to repeat the success of GB4。

but we do see there are。A large space， there are many issues that still have not been solved that would require further study and further research and that's cause for new new Ar。

even new architecture to the so you also mentioned a multi mod model。

Definitely that's one direction people in the field are pursuing very hard and we do see that as the future direction if not ultimate direction of AI models we human perceived information perceived knowledge through multimodality know we read we learn from language。

but we also watch movies so we watch video look at the pictures the way we acquire information is multimodality and so I'm not a neuro scientist。

but I believe our thinking in our brain is also multi modality so there is no reason you know our AI model is only language model。

but I want to emphasize。model is the baseline is the platform ands not only it is true that we learn how to build models building a language models the technology we learned know how we learned the inside we learn from this will help us to develop the multi modality model actually multi modality model could simply a continuation of language model So the the good thing about using transformer is basic architecture here is every modality data in every modality for to transformer is just a sequence。

Odated。You know， text， a language is a sequence and image。

if you scan image through patches is also a sequence。 and video is a sequence， a music sequence。

So it can handle， can host all those information and embedded them into the learning training structure and。

T the model itself。 So if we if， if we believe the future will be the will be the autonomous。

Intelligence， meaning that the model itself can reason and understand and take and plan and take actions we do the model itself got to be multi modality and definitely apply it into robotics。

autonomous robotics future， and general purpose autonomous robotics。

definitely it will be multimodality model。Well， that's a great segue to Ken。

Your work on sort of open ended learning continual learning and the the point you made just a moment ago about you know temporalal data it shows that maybe the ways we there there are more dimensions by which we're not。

Approaching intelligence is that is that fair to say。

you know like I mean I guess do you think of this multimodality and other ways of building intelligent systems and does do you think that requires completely different architectures and approaches Yeah I mean I think a lot because you know these models are so powerful it's just intriguing for me from the point of view of a researcher to think about what is missing that's what I think a lot about what is still missing like what kind of fundamental things are missing ands you know there's not a lot of things that are very very clearly missing because you could say well it's as long as I have it in the data it's there somewhere it'll get eventually picked up so we have a big advantage with the amount of data that we have but there are these you know very specific kinds of things where it's not just intrinsically in the data because the way the data is presented and one of those is this chronology like chronology is not in the data because the data is not presented chronologically and another one that's like that is multimodality of course。

Moality is not in data which is only text so that's clearly an opportunity so we're going to see no no doubt advances with multimodality but chronology is a little different though because you can't just like put it in it's not clear what that means exactly you know it's not you can't just put in chronological data into something that that doesn't process things chronologically and like part of what there is a little place you can sneak in chronology in these models which is in the context itself or the prompt like that's a place where it can have an order but the thing is like all of human history generally won't fit in the current kinds of prompt space and probably won't for a long time and so that's for all of the internet for that matter so that's a problem which is just an interesting research problem so I think there's just a few of these things like chronology and multimodality that you can point to concretely and then others are more like a wishy was like hallucination where it's like we see problems but we don't really know we can't really point to exactly the thing that's missing what is the thing that's the problem。

you know sometimes I've thought that maybe the hallucination problem is that it's a nonverbal activity in order to understand what you actually know and don't know which would mean it wouldn't be in the data you know like what the reasoning process I go through to think about do I actually remember this thing like when I'm inside my head and not actually articulating this outside my head but just trying to remember something someone's asking me there's some reasoning process where I come to a conclusion I don't actually know that or I do know that maybe if it's nonverbal and it's not in the data because the data is just came out of your mouth not you know before implicitly in your mind before that and so maybe there's something missing there perhaps but it's more it's a little more amorphous to point to but it's just a general interest exercise I think to think about what's missing still places that we can press forward that makes me think maybe that' some important insights to glean from cognitive science So if you think about you know experiments thatll show the way people。

Think or reason sometimes it's not verbal or not text。

Do you think that that's an important approach just to continue what you were saying？Maybe。

maybe I mean I think historically that hasn't panned out that well。

you know like if you look at large language models and the successful side of where they are which is quite remarkable。

most of it is not as a consequence of you know looking at cognitive science experiments and perhaps to the chagrin of cognitive scientists and so but that doesn't mean that it isn't it can't be helpful going forward。

but I would imagine as most of you more as just inspiration because you know the very implicit nonverbal reasoning if it exists is just so inaccessible I would expect it more to be something that would be emergent from the right kind of training than something that you could extract explicitly and then sort of write down like this is how it works so it I would imagine you would more want to reorganize training in some way maybe you could reinforce the learning in something like that so that we can elicit these kinds of steps that are non-verbal which correlate to what we do when we're trying to determine。

If something is true orme or you can be rememberedmeative do not？Hong Jianang do you have。

I know I think you have。Neuroscientists and cognitive scientists at the insute。

what do you think we can learn from those fields？It's a strange time because it feels like language models solve so much。

But but yeah， maybe there' are still things you think we definitely think we can learn from that from those field。

but it is still research undergoing and it's still a lot of work to do to be honest at this moment。

we haven't up come to any significant conclusion， we can apply them to building big models but。

Other hand， in contrast， actually I don't know if you read the recent work published by open AI folks on using GB4 to analyze GB2 base to the point that what's the function of neuron？

In GT2， you know， it does。WhenWhen when G2 you know generate a particular contact tax or output so that is very very interesting So I actually encourage those who who working on neuroscience try to you you know borrow some ideas from here it's not just we borrow ideas from neuroscientists。

but you know the other way around is also a very interesting research director but coming back to are in question yes at BAI we have a extended group。

😊，Of scientists from Qinghua University and from other universities in Beijing。

we we precisely looking at the problem of learning from a neuroscience。

we also have a small team building or what we call life models， you know， a simulation of。

humanum organs and simulation of brains。 so to to help neuroscientists to study。

you know if a particular neuro get activated。 What's the， you know how the entire brain react to it。

So we actually have a small team working on that。 we actually report to the to the year in in the conference yesterday on the。

That's very cool well the talking about the brain the complexity of the brain brings me to another subject。

which is the size of these models， the amount of computer power required and。

You know it it's extraordinary right and I think we all know that and that's one of the reasons why we're seeing such amazing results but to Hongzhang to stay with you what what do you think that means for research does it mean that it's going to become less？

acccessible less possible for as many people to work on these models。

do you think we'll see maybe more efforts to make smaller ones。

what does the size of those models and the amount of data sort of tell you about the future directions？

Yeah well well well you actually raised quite a few a number of questions so one very straightforward one。

I would like to to address is you know you mentioned about the research of the anmia how would they react to this because anything to do with big models require large amount of computing power and that you know that simply require them to。

Work on system and collaborate together and collaborate among them and collaborate among with their industry partners。

and they are my experience with Xinghua University， you know。

is that if you count how many professors researchers in Xinghua University working on topics related to big model that many of them。

they actually have quite a number of quite a high number of you know GPUus。

but they scattered among different groups， right so get them together you know。

put their resource together is obvious solution for you know if they want to work on bigger problems。

but also I would say a scientist or at demons especially those in the university。

they should and they tend to work on。basic issues。

basic problems that much of that still can can be researched on without huge amount of computing power。

but they want to build a systems， definitely definitely they need to to to collaborate among themselves and collaborate with industry。

but one thing I would look at this a problem with a positive from a positive angle is that actually I think the breakthrough of GB4 many of us in the research community and especially in acadeemia rethink。

What is the best way to conduct research in computer science in AI And if we want build a system。

if we believe AI as a system and the problem can only be solved by system approach。

Then we should pull our effort together， we should。P our resource together。 and we should。You know。

really。呃，form the form research problems。Into something that we can work together。Yeah。

I think you've done some impressive work in bringing together academics and industry so far and thats require a lot of effort that's really require a lot of effort yeah。

But what is the most challenging thing about that Well I think I would first say that GT4's success actually helped helped us a lot you know so from now on I think much easier but two years ago it was much harder you know a fundamental characteristic in academia is freedom right professors get to work on whatever they are interested and this is a good thing about academia but but when we want put everybody together from academia to work on one。

proon。And you know， they tended to， first of all， they naturally。

they look at the problem from different angles， they， oh I'll do this part， I'll do that part and。

But having them working on one thing and or even， you know。

try to segment a bigger problem into pieces and I have each one of them more on one piece。

it's just so hard。Because that's not how academemia operate。😊，But that's， you know， actually。

that actually says a lot of why the first success come from open AI because they take a system and an engineering approach。

And then Google Bra， they have an even bigger resource and a bigger team and the more well known scientists。

but they couldn't put the effort together。干嘛的 one modelto。

But they come with many models and that's a showcase academia in the universities that will be' a situation is much more fragmented。

so thats really take require a lot of effort conviction and capability of to motivate people and defite capabilities to allocate the resources in the right way。

😊，Yeah well that's great the yeahm Ken I I mean you were at open AI I'm guessing they had quite a lot of conviction around。

This singular sort of approach， but you know you've gone to academia so what was your what was your thinking why not stay。

At a place where you had huge amounts of resource。You mean。

you're saying that I've gone to academia after I was actually an academic before putting eye。Okay。

but she actually went in the opposite direction。Okay， oh I see what did did。

was that what drew you in and that you？It's I mean it's a there's a long story but it's a part of the story that you know I recognize that Id have access to vastly greater resources。

I mean that that certainly entered into my mind and is a big problem I think for academia that that pulls professors out of academia which just can't provide the same resources and I think it's a somewhat irreconcilable problem means not just。

It's not like you can just get full resources enough to match the unbelievable you know amount of industrial resources or money that we're thinking about in the future like this may not be yet happening but when you talk about things like systems that might cost $100 billion I mean that hasn't happen yet but I don't know what academia can do about that if that does happen to match something like that so it's a really interesting question。

but you have to remember that hypotheses in science can be both related to scale and not related to scale so there certainly are hypotheses that can be addressed even if there's $100 billion dollar system sitting in some you know very wealthy company but the ones related to scale yeah I don't know like we're talking about scale beyond imagination at that point and it raise a lot of questions but another thing to realize though is I think it's also interesting is that even within these companies they don't have infinite optionality you know if you're going to run imagine you。

have a $100 billion dollar experiment it's not like you can try 15 different times and test all your hypotheses and I mean you're taking a bet and you're going with it it's actually not very scientific it's really based on gut right because you can't really do the kind of systematic testing that as an academic I really believed in like that was what I was trained I'll try all the different parameterizations and I'll learn all the different angles on this and understand the system how it works like you can't afford to do that with a system like that so it's not like they can just do all the experiments any academic would like to do and even the researchers at these companies are very restricted in terms of like their individual hypotheses and whether those are ever going get the light of day so it's just like the scale is just unimaginable in the implications that it has but I still do want to emphasize it's not a reason for anyone to give up because I still think there are many hypotheses that can be worked out at small scale and some of those can disrupt a larger scale like for example of major architectural vision it's not like we go back to a neural network with 20 connections or some tiny thing but you know like。

Today's large model is tomorrow's old news and so like the things that today take a lot of compute like you know five years from now don't and they're enough to test some pretty significant hypotheses and so a lot of people will have access to that within a couple years and those hypotheses could reinvent everything in such a way that even though really big models need to be re out so I wouldn't at all give up because I don't have access to the most fancy thing but it is an interesting dichotomy that's developing that didn't exist before。

Do you think there could be better sort of collaboration or cross pollination I mean I'm sure you're right that it in some ways it's irreconcilable。

but it would seem that maybe，Industry could benefit from the perspectives of academics a lot and。

They could benefit from a compute without completely， you know。

derailing academic sort of principles or。you know， giving up， but giving up all the secrets。

but when when you know GPG4 the paper has literally zero information。

you kind of wonder if there are ways to maybe have more collaboration。😊，Yeah。

I mean and historically there's been collaboration，'s not like there's never a collaboration。

but but I think there's you know there's the cynical version and the optimistic version like you know。

both are probably kind of just play out at the same time or some people are going to think。

Well we can just read their papers like we don't need to work with them or we can just hire them because we can just offer them five times their current salary so why bother with all this like there's a lot of complexity to at least in the US to collaborating with academics because you've got to go through the office of research and then there's all this legalistic stuff like who owns what and where's the IP going some companies will just say forget that we'll just like hire the person out and you take them with us if we really care or wait for their paper to come out but then the optimistic version also exists you know where it says there is a lot to gain and it's worth it to create this collaboration and this person I think the theory there is this person is actually more comfortable in an academic setting more free like Hj said there's more freedom which maybe actually in the interest of the commercial entity to have people collaborating that are more free to kind of explore just unusual directions and maybe they want to encourage that to some extent because they do see it is in their interest and so they go through the effort to actually make the collaboration。

Work and I think you're just going to see opinions vary and both kinds of opinions will be expressed maybe even in the same company sometimes and we'll see a mix overall yeah well i'm i'm curious when you you you know Hongjiianang have you seen。

Benefits from having academics help on or more involved。Even though they're still within academia。

I specifically。When there are problems with models， things like hallucinations and。You know。

when people are worrying about alignment。It makes me wonder if actually there's going to be more good reason to have outside us。

Take a look at the people who want and more people have access to models。

I would say it's's super important and critical to have researchers who have des involved in the large model research and you mentioned specifically alignment but in all aspect of large model research the very fact is today the model itself is is still very costly I'm just give you an example here。

you know the model model itself is very costly， not just in training。

but also in operating in serving running the model so the training process not necessarily mostly efficient one either and training model size could be optimized so all those you know if you if you kind of。

If we find a way to abstract those issues into problems that academians are best added。

And then it's you know， the academ is will fund， you know well。

What perform will will make their contribution and those contributions are critical and so the its the matter is。

you know， we need the academscadem people from academia to。AndW who are able to。Extract or abstract。

Those problems， those issues and working on them。 So point instance， we。

we worked with a few visiting scientists from various universities on training optimization。

So training efficiency。So to to， to you can say to speed up the training。

also working on model size optimization。And for the same performance。

do we really need that model sign？And also， you work on new architecture of models to。

to make it modized。Right issues like that definitely require。Deeper research。呃。

Even on the building model itself， we， we， we do see the benefit of researchers from from universities。

呃。Actually， they also feel beneficial as well is otherwise they may not see that many research problems。

I think research is definitely about a solving problem， but research is also to my extent。

in my view， is actually more about a funding。Problems， define problems， then just solve them。Okay。

Ken， you're nodding， do you have any， do you want to say anything more of that？Yeah， I mean。

I strongly agree with this point that academia is an essential and critical link in the chain。

like that ability of professors to explore with a different kind of freedom that exists in industry will expose opportunities that just won't happen in industry and that's going to be essential to progress。

So you know this is a vexing problem， I think that industry tends to at the moment suck people out of academia and hurt these departments and the entire academic enterprise and it's worth a lot of thought I think from the university sides like how to counteract this because from their point of view it's unprecedented most fields。

I mean most academic fields don't have this kind of thing happening to them where it's like just so much more lucrative and better just to go somewhere people don't have this kind of optionality。

but in this field is very different and so universities I think need to treat professors in this area differently so that we can maintain that fabric which train the next generation and exposes these ideas that are going to be essential that you're not going to get an industry and so you have both sides are totally important。

that's great those are great thoughts well we've talked a lot about language models and chat GbT but we haven't brought up safety in the the kind of the AGI。

Existential safety and it's I mean I wouldn't normally bring it up。

but it's such a big topic of discussion。And it seems like a lot of people are taking。

BothBoth the short and the longer term risks seriously， so I guess。Speaking I both along and。

Short term risk I want to just ask how that affects research I mean is that going to become a huge new but it seems likely that it will area of research and will it affect disclosure will it。

So Ken， you know how do you think that this is going to change it I don't know where you sit when you on the spectrum of worry about AGI and AI。

Yeah， no I think it's worth worrying， but I mean I guess where I sit is more of know I think somewhat of ambiguity like I'm not totally sure how worried to be。

but I think it's worth being worried because there are a lot of things that could turn out to be very significant shortterm and long-term threats but you know a lot of people discuss this issue with certitude and I think that at this point in time that doesn't really make sense and it makes it actually hard to disentangle like what we don't know from what we do know and there's mostly what we don't know I think okay so that said though I think that clearly it's a field like you know AI safety research already people say I' an AI safety researcher but I don't think it's a separate discipline you know in the sense that like creating more intelligent machines is actually building AI safety and this is a paradox that actually is very difficult to solve for the field because know you could think of it。

It's sort of this simple dichotomy where there's safety research and then there's improving the model and then you can say let's put the brakes on the models and just work on safety and then we can make the models more powerful。

but the problem is making the models more powerful could be what makes them safer you know after all sanity itself is like a really important aspect of safety and sanity is a function of highlel intelligence and so this relates a little bit to like think points that are made in my book you know where we talk the book that I David with Jo Lemman where we talk about how you know often the things that lead to what you want don't actually look like what you want and so like if we just focus myopically on safety and safety is a safer and safer and safer。

the problem is that the thing that really leads to a profound and fundamental shift in safety might not look like safety research right now and so other kinds of research then needs to be happening because that could be the stepping stone that leads to the real revolution and safety and we don't know because we don't know what the future is so we have to keep our options open and it leaves us in a very kind of awkward position。

Because obviously at the same time as this kind of advance could lead to safer systems。

it could also lead to more dangerous systems， it's like just as possible that getting really powerful is actually extremely dangerous and so we're just walking a tightrope on this and just have to I think we should just acknowledge that that we don't actually understand the parameters of the dimensions around us as we try to walk the safety tightrope and we should just be very careful obviously moving forward because of that。

Wow， yeah that's those are very great point Hong Jg do how do you think about safety。

long term risk and how does the Institutestitute looking at short and long term risk？Yeah。

just for your information in this AI conference yesterday morning we in the opening keynote addressing。

we have two scientists with cut of opposite view， we have a massive telemark from the Future Life Institute of MIT and who you know actually I know here they the petition to pulse AGI research for six monthss that definitely brought up the awareness of potential risk of AGI and they have a young La who say。

oh we are far from AGI at this moment we still need to work on know get AI models more intelligent。

we are far from AGI G cannot understand。The things the level of human intelligence have so and also today we have a one day session on safety and alignment and we actually start this session is with Sam Altman this is one hour ago addressing the audience and he is in the world tour and really on this particular topic。

so that had a Q&A session with him along this topic I think Max did the right thing and Sam did the right thing to bring up the awareness of the potential risk and I really think he's doing human one kind of service by making this。

Tos by talking to various government， various institutions and。

On our research side BAI when I think nine months after we established BAI。

we actually joined forth with academemia in China we actually published published Beijing principle on AI AS。

and safety was definitely an issue there， I think also after the petition that Max initiated we actually have quite a few Chinese scientists signed on that petition including our very director of BAI who running the BA daily operationeration。

he signed that as well he has been in the effort ever since Max had the first conference in 2017 and。

When we're looking at this issue of AGI， now we definitely spend a lot of time thinking about ASI。

Not's for sure， think about assay and safety then come naturally after that and we have team looking at data。

we have team working on algorithms that clean up the data and we definitely have we have more effort working on alignment。

Alignment， but I tend to agree with。Young La a little more than Max on this and that we are still far away from you know human level AGI so with Sun spend a lot of time working on that。

I like Kenny's point on you know smarter model actually could make it safer the the but I definitely support and BAI definitely support max effort to bring up awareness to set up consensus and to。

To mobilize the community， to look into the。That's great yeah。

I'd heard that some scientists there had signed Maxes。Max's pledge。And I'm wondering， though。

is in the US and in the West， I think this issue has become quite。You know。

headline news it's become a big big story， is it a big topic in China generally， would you say？

Definitely definitely definitely among the AI circle for sure and on media for sure and the government agenda as well as far as I can read from the media yeah definitely so that's what that's why I said it's a good thing that Max and the community。

😊，P the effort up to bring up the awareness。Yeah it's good I mean it's good to hear that and I think that scientists are the ones who are going to probably have to take a leading role in that I would I would imagine right there's so you know Kenneth I don't know if you have thoughts on this but but there's there's you know talk of regulation and how things would work internationally and how you know。

Countries that are very big in AI would would ever figure things out。

but do you think that there's a path to that that scientists can kind of。

Offer some ways for to sort of。Allign themselves to use that word around certain sort of。

Princiipples or whatever。I mean， this is a really big question the way to organize not just scientifically but politically in order to somehow route this giant oncoming thing into some good direction I think you scientists of course but one problem of it to except is that just because you understand the AI doesn't mean you understand its social implications and there's another problem which is that just because you work and say social sciences doesn't mean you understand the social implications of AI either。

so like we have a problem that there's not really anybody who's like truly the right authority who can just tell us the ground truth here is like what should we do and this is leading to this very ambiguous and confusing situation that we have when we talk about regulation because there isn't an ultimate backst of authority figures that could just be like oh of course you you just talk to us but we do have this sort of tendency to go to the AI scientists which I think it be have to be cautious about。

Because they're not necessarily experts on the social implications of the scientific insight they've had and so of course though their insight into what the technology does is important。

so they obviously have to be part of the conversation and so overall though it seems like where things are heading is towards I think probably regulation will most hit the larger models and so you know it won't be affecting smaller academic types of research that much as my guess but like at the high level of power and Sman has said similar things that that's where you'll probably see like a much more tight regulations models that are actually threatening in some way to sort of human stability and then there yeah it's interesting that that's sort of similar to the problem of not having access to the most powerful models to begin with for the academics so it's sort of like almost like it doesn't even matter that much because they won't be able to do anything of those models anyway because they're too powerful native access to them and so it's really going be。

In this kind of very refined circles， but it's very important， nevertheless。

it's extremely important because like these models are potentially dangerous。And so yeah。

I think it's a group effort to find some kind of amalgam of people that we feel we can trust because the biggest problem I have is that I don't know who to trust。

I wouldn't even trust myself and that's a big problem you know to actually get a grip on this issue。

WellYeah， know those are great points I think I want to finish on a。

Upbeat positive note because obviously this is a really exciting moment in， you know。

once in a generation， I guess。For at least for AI。😊，We're pretty much out of time， but just briefly。

Jennifer， what are you most excited about when it comes to the future of AI research？一。Well。

I think I'd say maybe just throw out two things and one is just the amplification of human creativity you know we're looking at a world where right now there are many things that you may want to do and have ideas about but you cannot do because you don't have the skills in you order the talent and this is about to change and that's really interesting in terms of empowering human creativity that you I can't I can't write the story that I have in my mind or I can't paint the picture that I have in my mind I can't build the robot that I have in my mind but suddenly the facilitator can actually come between my ideas and the actual implementation and make all these things reality and it's just hard to even imagine that world you know where it's like I have the germ of an idea for a song and it's something like a fully produce like radio quality song within five minutes what is that going be like I think that's very empowering to people and has a lot of upside to it and the other thing that I think you don't hear that as much is just maybe this can help us to。

Better at finding ways to interact with each other。

finding ways to connect with each other that are more virtuous than social media today where obviously there's all kinds of issues with toxicity and so forth。

and the ability of machines to maybe think a little more deeply about what we really need and what's healthy for us on our behalf at a huge scale like billion people might be able to help us in some way to connect better with each other。

which I think would be a nice antidote to this thing that we're facing right now is that a huge proportion of our time is going to be spent talking to a cold machine and it'll be nice if it can actually help us to have more humans too。

Those are great points。😊，Really good Hong Shaang， what are you excited about for the future I'm very much in agreement with Ken。

I think after 60 or 70 years for researching in AI。

we finally come to the point that we realize AI can really empower people。Can mentioned about。

you know， empower people to create things online。 And I'm， I'm also very excited on。 finally。

we can have AI that can empower。😊，呃。Robotics robot， you， today's robot is so much you know。

specific task oriented， you know， can only do one thing。

know inspect component or pick up one thing and serving very， very， very specific task。

but with the advance advancement in in in large language models。

especially in the future in multimodality models， we can see that will be completely rewritten know even even autonomous driving will be completely rewritten and with with the systems that have moreous autonomous ability more planning ability today I mean even G4 and not have those kind of capability。

But the is a scientist who has you know been working hard in the last 30 years and tried to bring up something that can convince people oh yeah。

finally the machines can do better and finally the machine can complete the task and itself it's very exciting but I do want to quote Ken's book know a great success not cannot be planned so what we are doing now and what I have been telling my funding agencies when I raise money from government agencies from industry they often ask me a question。

you know in five years what can you deliver I always tell them that all。

I do is to increase the probability of success in AI。That will bring good things to the society。

but I cannot promise you with what exactly I can bring to the you know， timatotoes， but you know。

it is the probability that I'm going to increase。Okay， well。

That's a great note to end on and a good ad for Ken's book as well well I've really enjoyed this thank you very much thank you Hongzg for hosting this and thank you for being in it and thank you Ken for joining us and being a guest here it an excellent discussion。

😊，Well， thank you very much well stay so late and can thank you for taking the time and great to see you over Zoom。

but hopefully we'll see each other in person， if not in Beijing。

but maybe in San Francisco or Cambridge。😊，Sounds good。Great， all right， thank you， bye。G这 care。好的。😊。

呃。各位观众大家好，下一个环节的话，我们欢迎呃来自迈AI的研究员。呃，他同时的话也是模型OPT的一座呃，苏三张来进行一个线上的报告。

报告题目叫做 of developingOPT呃 hundred and seventy5 billion。那简单介绍一下我们的呃苏usan老师，他是毕业于普林斯顿大学数学系呃。

一直专注于大规模呃AI这个基础设施的那个设计。呃，同时的话他也是呃OPT的作者，然后在模型呃以及软件系统研发方面有超过10年的一个经验。呃，下面我们有请苏usan。Hi hi everyone。

yeah thanks for having me， did I just share a screen directly？Yes， yes。

you can start the presentation now。Okay， and after that I'll gather her some questions from online and we'll have a short Q&A session after your talk。

Okay， let me， I think I need to fix my setting one second。Sorry。😔，我们可以放PPT的是吧？嗯他是S screen。

You share your yeah， I'm trying to oh I have I have to restart one second， sorry， okay no problem。

大家稍等片刻。嗯。O。PPT。Okay。DoCan you see your screen？Is a screen good？

Hello everyone everything's good now Susan speech Okay right hi everyone thanks for having me so today i'm talking about the trials of developing OPT one of a billion parametermeter model that was released last year back in May。

😊，So just a bit about me， so I studied math many years ago at Princeton University afterwards I spent a significant portion of my career building these largescale distributed systems to support data processing workloads。

I then moved into building reinforcement learning systems at Open AI from 2018 to 2020 mainly focused on the dota to and Open AI5 project afterwards I briefly moved into the hardware based for pronic chip design at luminous computing when that you know didn't really work out I moved back into AI software in 2021 to develop large language models at Me so that this talk is mostly focused on the early parts of developing LLMs at meta specifically for the ones that I building print model we trained back in 2021。

So the setup here is about a team of five engineers。

we were tasked with training this model in just about three months using 1024-80 gigA100 GPUs。

these were the latest generation of GPs from NnovaVity at the time and with the training efficiency code that we had we still needed about 33 days of continuous training assuming no hardware issues。

no failures， nothing in order to go through 300 billion tokens we didn't really have an explicit infrastructure or systems team to support us outside of a customer support team from the cloud provider and the data was kind of whatever we had available at the lab at the time so this was a combination of a lot of the data sets that was used for Roberta and also for Berbot work from the dialogue agents team。

For the hyperparameter settings， we were also familiar with a very different set of hyperparameter that were circulated with the fair LLP groups at the time。

it was very different in the settings that was available or that we saw from Microsoft and NviDdia for their megaron turing NG work along with what was published by OpenAI for G3 so this you' will notice in the beginning of these runs we'll spend a lot of time trying to figure out how to bridge that gap。

啊。So in October that's when we first started our training runs the safest thing to do at the time was to go with kind of the hyperparameters used for the existing language model setups within the NLP groups this kind of shows some kind of empirical proof for how they could work the largest model trained up until that point was about a 13 billion parameter dense model so the hope was that these were transfer to the 15 billion scale without the issues of course that didn't really turn out to work so for the second run we start increasing weight decay so we start with 0。

01 we increases 0。1 that didn't really work out for the next run we start drastically shifting more towards the GPP3 settings by setting a gradient norm clipping threshold of 1。

0 reducing item beta2 from 。98 to 095 and also increasing item epsilon from 188 to 1 in86 thinking that that could help stabilize the runs so it turns out none of these settings actually mattered we realized that。

know afterwards that there was actually a bug in the code that we used to implement tensor parallelism to scale to ones that0 billion parameters we were checking at the time with a smaller scale run and noticed that it couldn't converge so this is a very obvious lesson we should have started very small before going to the largest run but given the time crunch we kind of short circuit of that and that came to bite us in the end so while we were debugging this and we had to rebith this code base causing kind of the training rent to go a bit slower we figured we would at least check the hyperparmeter settings and see if they would actually work without the really efficient tensor parallelism code so here we go back to kind of the old weight to K settings thinking that that could work fine and that didn't really help we start clipping like gradients again increase the weight to K and so this is kind of us trying to figure out how these settings interacted with one another at the ones that0 billion parameter scale we also increased warmup thinking that that could help increase the stability of the run that wasn't enough。

So by this point for run six we actually fixed our Tensor parallel code so that we can train a bit faster and so we go back to kind of the original settings we had but still keeping gradient clipping thinking that that should be pretty safe to include so that still didn't work so for run seven we added weight to K again we increase one again and we also skip the last partial batch in case it was an issue with competing the gradient there and causing kind of a little bit of instability with a smaller than normal batch so we do some more fiddling around and we also do something you know we increase the batch size from 2 million to 4 million none of these seem to really help so for the batch size case we since we saw no noticeable improvements we decided to go with a lower batch size the hope is that by taking more optimization steps maybe that would help us converge to a better minimum。

So by this point know we think we settled on a few settings that we kind of all compromise on and we're kind of ready to just like launch our actual run know we telling ourselves that by this point for the 11th run。

we should start indexing with decimals so this is run 11。

0 so for this we set on a two million batch size we keep our atom states and FP32 so the highest precision we used tensor parallelism by startinging the model across8GPUs in parallel we were doing data ablations don't I'm not go into detail about that here。

but pretty much we're trying to figure out the ideal data composition and we're also noticing some bugs and kind of when we exported the data set and added a bunch of escape characters it caused kind of an artificially low loss when the model was just learning to memorize these extra escape characters so we were trying to make sure that the data set was actually good to go we also used learned positional beddings but we weren't sure whether or not we wanted absolute learned positional beddings was kind of a Gaussian initially。

similar to GP2 or sinusoidal initialization sort of matching the original transformer implementation。

so we kind of just meet in the middle by initializing these personalal beddings with sinusoidal knit similarly for weight decay we weren't sure if it's you know 0。

01 or 0。1 that's nice so we could split the difference and use 0。05。😊。

We also use a pretty high learning rate to start so historically in kind of the implementation that we had at fair the learning rates were set up usually a lot higher than what was published externally。

so for GPD3 the learning rate was 60 negative5 I believe and so we used 3D negative4 so quite a few factor higher we also don't apply dropout on embeddings and we also include norm For that was some work that came out kind of a couple months earlier from the lab。

the thinking there was adding a bunch of layer and drawings can help stabilize the run in some other settings as well。

😊。

So just in the beginning here this was already starting to look pretty unstable in the first few hundred steps that's the first green run you can see at the top there we thought that you know maybe this is just because of the learning rate was too high so the first thing we do is you have the learning rate from 3 negative4 to 7。

5 negative5 or quarter it that didn't last rate long so that's theient yellow line right here up top this is our lost curve and so at this point we lowered gl and clipping threshold from 2。

5 to 1。5 so now we'll clip more frequently especially in the beginning and that's the purple line above so this keeps training for quite some time you know for us at this point you know getting past a few hundred steps was a blessing and this is when we hit our first actual hardware issue we hit an uncorrectable ECC error so we just restart the run you get rid of the machine and that's the gray line in the middle there and then we also noticed that we were valid。

oo frequently so that was causing kind of a 10% to 20% overhead and so we reduced the frequency of validation so while that gives us less visibility into the health of the run we thought that the speedup would be at least worth it then for the next few runs so the gradient norm start spiking now for for the green line and so we lower clip again clipping threshold again from 1。

5 to 1。0 and that we continued training there so after that point this is when things start going really really bad so this is the zoomed in portion of what we were just looking at so here you can notice there's a ton of restart in the middle for us trying to figure out exactly how to get this run back to where it was before it hopefully could but lost the continuenu going down but we don't make it very far so the first kind of issue here we tried remediating this kind of this instability by skipping batches when the gradient norm was too high。

Instead of just clipping it， we thought maybe getting some new data into the mix would help stabilize the run that didn't work out very well so we roll back a bit further and so you see for the next restart it at an earlier checkpoint and here we finally go towards changing some of the other hyperparameter so increasing weight decay thinking that that can help regularize the updates we also lower beta2 so that we we're averaging over fewer steps thinking that maybe we can adapt to the gradients more quickly and that didn't last very long so that's the first kind of pink wiggle there and then for the next bit we keep you know we keep the beta2 sort of 0。

95 but then we go back to try to figure out if weight decay changing or gradient changing with something that we needed so you sort of see this kind of overarching of theme of changing a few things at a time and maybe then bisecting this mostly is motivated by the fact that each time we do do this it costs a lot of manual overhead it costs time。

Valid so we try to batch a lot of these changes that we think are safe together。

of course it makes experimentation hard， you can't really disentangle the effect of one versus the other。

but this is kind of the expense of you know kind of only having one chance to train this model and having to sort of correlate a lot of these hyperparameter together。

So that still didn't work out too well for us the next thing we do now finally we go into this lowering the learning rate kind of mode of operation。

but that obviously we can't do that indefinitely but that still doesn't last very long。

so that's the orange line here。Now at this point we're also watching kind of what was happening in the open source community。

the big science effort was starting up and they were trying to train 100 billion parameter run and there was this pull request where they mentioned that there's this numerical stability kind of issue potentially with the MhaA calculation。

the multi-head attention calculation so specifically it's a very simple thing when you're doing a multiplication by a factor of n just split it into multiplication by squ root of n twice and that in some cases for large values of n especially for large models could help improve with stability so we implemented this change and ended up restarting our run now we also notice here for the galu term or the gallu activation there's this X cubed term and that could also cause some instabilities for certain certain layers so we just swap in value instead so that we don't have to deal with this X cube factor and you know this transfer bit longer this is a pink run so clearly we can make it past some instabilities here。

But the problem was it didn't seem like the run was actually going anywhere right just kind of plateaus and on the side we're also doing some other ablations with initialization thinking that maybe this gradient explosion or vanishing issue。

whatever this was that we were dealing with， maybe this was a factor of not initializing properly for such a large model。

but none of these ablations actually turned out to be meaningful in any way。😊。

So one thing to know here is that when we were training this run back back in November 2021。

even though we were using A100s which had BF16， we're most familiar with kind of FP16 training with no mixed precision and so in order to make these runs converge it was this extra factor that we cause a loss scalealar that is usually implemented and the thinking here is that when you're training with FP16 so you don't and with no mixed precision so you're not keeping any copy of weights in FP32 or above we use this loss scaling term to try and preserve small gradient value so the thinking is you know when we haven't overflowed in a while we'll scale loss up so that you can sort of surface the signal from small gradient values and then when we start overflowing we can scale down the loss and usually what we see is that when this loss scalear crashes to near zero effectively giving your loss in your gradients to become zero your model starts stops updating so that's also very unhealthy signal and when that happened。

d of training stop so for the pink run here when things kind of just never converged。

you'll see that near the end， the loss scale was very low so we weren't really actually getting any kind of meaningful gradients flowing through。

So at this point now we were looking at the clock and to recap you know we know we need at least 33 days of training for one I a billion model on 300 billion tokens and using 992 GPUs so notice that we change from 1024 to 992 now this is after seeing a lot of hardware issues come up and we have to switch in GPUs when GPUs go down so in order to have minimal downtime we have to train on a smaller subset of machines and leave a pool idol so that we can swap this in。

We also know that we had to benchmark the model before the end of year。

so we really didn't have much time to explore all the hybrid parameter settings and try to get things to work so from looking at all the restart from the previous lineage of models of experiments there wasn't any strong signal that these settings would actually work out if we were to keep training with them so at this point we decided to do the complete drastic switch to GPT3 and megaron codebase settings since the two already seemed relatively consistent with one another and there was some evidence that these settings Kansass successfully actually train models even though we weren't using exactly the same codebase。

So specifically here you know we updated our weight initialization overall the change was to kind of reduce the standard deviation of all the weights so that effectively you're initializing closer and closer to0 we also remove a lot of extra layer norms from the norm form setup so this now you pretty much exactly mirror what we think the GP3 architecture looks like we remove embedding scaling this was a term to kind of scale the embedding to Gaussian with standard deviation of 1 which may be too high if you notice that a GT2 cobase the standard deviation its 0。

01 and 0。02 there so it's still pretty small and we also go back to Gaussian initialization for the positional embeddings we did a series of ablation separately and notice that if we initialize with sinusoidal in it for positional embeddings they weren't actually being updated in any meaningful way so we might as well just go back to you know the Gaussian initialization。

Wight decay， we finally go to 0。1 and just stay there clipping gradient cliping same thing we stay at 1。

0 at beta2s is set as 0。95 and these are all pretty standard now if you look at a lot of works that's coming out。

most folks do use an at beta 2 of 0。95 with gradient and clipping of 1。

0 but at the time that wasnt you know it wasn't clear that this was the go2 kind of setup for these larger models。

And the learning rate here to recap， you know， we initially started with a learning rate of 3 negative4 in the previous lineage of runs。

and now we just only do 2 x， the GP3 learning rate， which is 1。2 e negative4。

And so here for the first 15 resource and now you notice that we're indexing with you know two decimal points instead of one because we were prepared for up to 100 restarts and we actually just faced a lot of systems issues so this is when A10s are just coming online these of 80 gig machines and so we faced a lot of lost GPU errors。

coa errors sometimes the job would just stop updating nickel errors and things would just slow down etc ce so while this was not a convergence issue and we can generally kind of work around it by figuring out which machine is bad a lot of this tooling didn't exist for us at the time since this was new hardware we also of course we couldn't blame the hardware the whole time we had some code issues ourselves as well so when we were doing checkpoint storage know in some cases that wasn't operating successfully or taking too long given the kind of compute environment we were in and we also noticed that our lost gain logic wasn't fully deterministic so when we actually checkpoint we lose the state of what the loss gain there was at the time。

And this is specifically for the FP16 training once again。

and actually this turned out to be a blessing in disguise。

we end up using this kind of nondeminism later on to get through some instabilities as well。😊。

So overall when we see instabilities these are kind of the four metrics we look at at the top left here。

this is the last layer L2 activation norms and so when that spikes we also see to the right the gradient norms for the overall model kind of spike as well the two generally are quite correlated of course we look at the loss curve of perplexity and for EP16 training we also look at the loss scalar which is you know when that crashes to zero it's kind of when everything else you know activation norms gradient norms spike so these all kind of are leading indicators of potential divergence which doesn't necessarily get reflected until a little bit later。

So when we hit some instabilities here， the first thing we do is to reduce clipping now so we go back to clipping of 0。

3 if you look at this previous plot here， most of the gradient norms are around 0。

2 so we decided to be pretty drastic and do a lot of clipping in case you know gradient spikes can lead to you know kind of a compounding effect。

And we have a backup plan pretty much resetting the Adam state in case that was also causing issues and mismatching with the current batch and sort of repopulating the at states。

we later kind of explored why we would want to do this in a more recent work that came out back in April of this year where we kind of studied this up to a 546 billion parametermeter model in order to kind of remediate instabilities。

And the next sort of 17 restarts， you know we go back to the whole hardware issues。

you know ECC errors lost GPUs， even high like de uncorrectable errors。

even though they're not uncorrecttable errors， it still can cause issues with the kind of compute environment we had there were also sometimes when the job would just stop updating and so we had to sort of mainly trigger some restarts and so on so。

Here， so this is when we do something quite drastic the thinking here was that all these instabilities were potentially linked to having our optimizer in a vast state or maybe just know atom in the case that we had it just for some reason wasn't scaling well so we decided to test something quite quite drastic here it's kind of bad idea but we start switching to SGD without momentum to try to and get rid of atom entirely so first we try to be clever by approximating SGD we can do that by pretty much running atom but setting beta 1 to0 and also increasing epsilon to 100 pretty much wiping out the second moment term and also increasing learning rate by the same factor in all these runs it turns out that it actually wasn't working the way we thought it was the way we were kind of reloading atom states the hyper parameters were not being reset so all these settings weren't or these experiments didn't actually yield anything meaningful so we do the honest thing of actually swapping the code entirely the。

And we use SGD but in this implementation of SGD we also had our own bug of not actually applying weight decay and we also weren't tuning the learning rate properly so there were some kind of findings that were coming out or a couple years back where for SGD you might need a higher learning rate we weren't really tuning that properly so we didn't really see any useful signal。

😊，So by this point after all this fiddling and thinking you know we probably shouldn't try to be clever anymore just do the honesty of lowering the learning rate we proceed to just lower the learning rate so the final learning rate curve here looks quite you know if you squint it looks quite like inverse square root like we go with a kind of linear decay to kind of just linearly interpolate our way down the thinking is that you still want to keep learning rate as high as possible without causing intabil but we don't exactly know what that value is at the time and you see there there's this little brief moment in the middle there where we increase learning rate the thinking there was also maybe we were too aggressive and lowering it things look too stable for once and maybe you just increasing it back can help us kind of converge faster that didn't last very long and we still had to reduce it two more times afterwards to keep it training。

So overall when we kind of look at the health of these runs where we stare at this is a tensor board kind of reflecting these metrics and the main plot up top here is the loss or perplexity curve and it's very noisy right so even if we add smoothing it's very hard to tell what direction the loss is going you can vaguely see near the end that maybe things are diverging but know that's only about a few hundred steps and given the noise is' still not definitively clear that's the case but if you look at the other metrics it's very obvious right for gradient norms and now this is the bottom row here for gradient norms you see things you spiking for your loss scalar things are crashing kind of inversely related to gradient norms going up and similarly for the activation norms usually we see that decay over the course of training and when it reverses direction like that and starts increasing it's kind of a leading indicator that something is is going wrong so in this case specifically so this is still kind of kind of zoomed out version of what we were looking at earlier just by lowering the learning rate and restarting from a checkpoint you can。

They change the direction of where all these metrics are going so the top left here is the loss curve right so the green line is restarting with a lower learning rate loss keeps going down it's very clear that that there's a directional change there similarly for gradient norms instead of spiking as before this is the second plot in the middle here in the top row instead of spiking suddenly you know it keeps kind of stabilizing and training around like you know 0。

14 and same thing so you can sort of see that a lot of metrics can shift just by fiddling with the learning rate。

So after 56 days of dealing with all these issues， you know hardware issues and these instabilities。

we finally end up with our loss curve that looks， you know something like this over the course of you know 300 billion tokens。

it's very colorful there were many many restarts even you know missing some logs in the middle when our entire cluster went away but this is you know kind of just summarizes how manual this process was to start with before we eventually started automating a lot of this work later on。

So I can stop here and take some questions， I also have some more slides for the 66 billion run I don't know if that's helpful to share but yeah I can go into that as well if that's useful for folks to look at it looks very similar to the ones I've had billion run。

Hello。嗯。我给大家线上有没有什么问题吗？可以。大概。

嗯，大家线上如果有什么问题，可以打在评论区处，然后我来和老师进行一个沟通。Actually， Susan， I have a question。

Do you think that skating laws still hold for your model during the research？

Scal laws in which sense like for for scaling up model parameter data size。

just or just like in general， if scaling laws are useful。Like for this research specifically。Yeah。

so I mean， in this case we're mostly focused on kind of the cluster stability and hardware issues and making sure that the a lot of the tooling kind of can capture of all of this manual kind of operations we were doing to restart the run and whatnot。

so it wasn't so much skik laws as opposed to kind of more the infrastructure side that we needed to build out and a lot of automation we had to build say to you know detect kind of when things were diverge and how to remediate that。

So it's not quite the same same thing， but it definitely is valuable to know kind of what the optimal model size and data set size to go for given a certain amount of compute。

Okay， I've gathered two questions online， the first is any idea of training 1000 billion language model。

Or like what that would look like a1 trillion parametermeter language model。Yes。

the parameters yeah yeah 100 yeah so in that sense， I mean so right before this work happened。

we actually fair actually released a 1 trillion parameter mixture of experts model so that's not a dense model right but you know the parameter count matches kind of what you're asking for I do think that scaling to a1 trillion parameter dense model could be quite interesting but it's also for what we know for how we want to spend compute right if we can like feed in enough data that it might make sense to do but a lot of the data sets that we're looking at may not be sufficient for that。

O。Another question is， why choose to use norm For for scaling to deep models or training stability？

Yeah thinking at the time was that the Norformer work sort of showed improved convergence and you know adding a lot of yeah potential ability to sort of stabilize kind of exploding gradients you know through the model。

you know that was very exploratory because that was demonstrated at very small scales and when we started we were trying that out but we eventually did remove it so it didn't seem like we wanted to take that chance of you know seeing if actually Norformer was useful at once of a billion parameter in case it was actually causing issues。

Okay， what framework do you use in data processing and model training？Yeah， so data processing。

I mean at the time there wasn't really a defined framework。

you know this is using existing code that had been or sorry existing kind of data sets that have been processed by other teams at previous works。

we started building out kind of a cluster sort of running kind of spark jobs it' very standard CPU intensive workloads there but you know for us it was a very ad hoc every new data set we brought online you know we would have custom logic to process it for the training side you know the code is open sourced we use Medice or under Facebook research to train this model。

it's a forA fairse so which was you know a very common sequence modeling framework that folks are using within a lab but we specifically for the codebase to a focus on decoder only autoagive dense transformers and also to make sure it can scale to the kind of largescale workloads we cared about。

Okay， there's a follow up question saying that in the process of the IFT session period。

do we have to like use the。Open source framework of the original author or we should transfer to a more mature framework like Deep Sp chat。

IWhat's the IFT period I'm not too familiar with that。

I'm not sure though it's like should we use the like the open source framework or like we should transfer to a more mature framework like deep speech chat。

Okay yeah， I'm not as familiar with Deeppe chat so I think the question here is like you know what is already running on your hardware or your infrastructure and your systems because a lot of the details here is not so much which library use but actually the integration with your actual compute environment we had a lot of glue code there like what you where you load data from how you're doing like your checkpoint storage that's most of the complexity in adapting any codebase so if you are able to sort of swap out you know these frameworks that's great but a lot of the time we spend quite a bit of overhead and making sure this code can work in any kind of compute environment。

O。Another question is， what do you think is the main cause of result difference between OP T and the bloom。

model。Oh yeah yeah， my understanding was that Blo is very multilingual and so if you were to focus on kind of English only results。

Bloom just made I think Bloom just did not see enough data on for English only so if you were to just like look at English NLP benchmarks or you'll see a Delta and that's I don't think that's really meaningful given just a different training set that was used for Bloom。

O。嗯。Oh， there's a question saying that is。OP T able to surpass the performance of Lama model。

Actually， the author of Lama model came to our community to give a speech just a few Yeah， yeah。

if you added more data for OP models。 I mean， there's nothing fundamentally that different between the two。

And I don't think the architecture differences are meaningful at all。

really we just didn't have enough data right So we had to train this model you know in a very define finite amount of time。

and only 180 billion tokens were available to us at the time。 So if we1x that data set。 you know。

I don't think there's gonna be any issues with benchmarks。 And also， you know。

this is also with a huge copy of the benchmarks are English only we did not filter our corpus for that either。

we were trying to see if there was multilingual behavior that we can capture with very little data。

嗯，对。Any latest progress in your team related to enhanced smaller LLM reasoning ability or we still cannot get understand reasoning？

Yes， though yeah， I can't really comment on that either， that's yeah a lot of internal work going on。

ok 嗯。How would you compare OT and Lama model Do you have any future plans to improve OT That's a pretty general question Yeah so I mean like I mentioned before it's the main difference is data right so we also worked on the data for Lama and you so if you actually look at the details of Lama paper they do mention that so the main thing that you know we're focused on and you also can see from the kind of atom instability work like we are scaling to larger models with the 546 billion parameter run in that regard if you feed it as much data or more than Lama we can we can probably see a difference there。

O。嗯。Do you use any extrapolation techniques in OP T model training， If yes。

do you think it's useful in improving contact land。Oh extrapolation techniques yeah。

is that like kind of taking a pretrain model and then increase contact and then sort of fine tuning on a longer context window or I'm not quite sure what extrapolation means it could mean something else to。

I'm sorry I'm not sure about the details。 I think expert do you use any extra extrapo techniques in OPT model training like。

How to read this alibi AIB no I see yeah alibi we do not we did not use alibi so we used yeah。

learned absolute positional embeddings for that without any alibi。O谢。

Do we have more like general questions？Instead of those technical details。Actually。

as a member of BI I have a question concerning the research mechanism。

I wonder what's it like working in a research team in meta AI and how do you propose subject to research or is it。

Like very flexible。 the， the， the。A research mechanism think。Oh。

like how to choose your research direction and how much flexibility we have within Me for that。

I mean， likes what's like doing research at an AI lab？Yeah。

so so it's changed just in the short time I've been there you know there is a subset of fair that is you know very open ended research you kind of have a lot of flexibility and where you choose to take things and there's like a shared pool of compute that you can use and then there's a portion of the lab that is also very focused on specific projects kind of one of the examples is a lot of the blender bot work you know focused by very specific team similarly with like you know recent releases like Cicero back in November where specific team worked on just solving kind of diplomacy game with combination of RL and language modeling so there's like also some focused efforts there where you can kind of form these like very specific project teams and just go towards that but that's also sort of slowly shifting over as we there's a lot more focus on generative AI these days and so more resources are going into a lot of generative AI research as well。

O。😔，There's one more technical question， do we have any extra optimizer except Adam to speed up or stabilize？

Yeah there's been a lot of efforts in trying this I still have yet to see much that can scale well past you know a few billion parameters where the gains are actually meaningful I'm sure there is still work to be done in this space to you know sort of improve this for large scale training but I just personally haven't been able to get you know many to work above say like 30 billion parameters。

😊，O。How do Matt think of Laura for fine tuning L L Ms。It is definitely useful。

and there's many folks exploring applying Laura to fine tune。Okay。嗯。

What's the purpose of trading OP T， Where do you think the emergence ability of LLM come from？

Actually， I want to ask that about that， too， Like。

what do you think about the emergence ability of the model， Yeah， so。

You have to remember the context here back in like August and September 2021 right back when like nobody was releasing large models。

there was this kind of aura of like you know it is risky that especially coming out of GPT2 when they did a stage release just for a 1 billion parameter model so a lot of focus for OPT was a combination of testing out new hardware and making sure we can actually sort of set ourselves up for more scaling efforts going forward and sort of make things like Lama be almost trivial to work on and then as well as sort of being to open sources so that people can start prototyping at larger scales and making these larger models more inference efficient and so on so you know there's a lineage of work that came out of that right sparse GPT Fl strand etc cetera these are all kind of different works testing whether or not you can make these large models kind of have the same performance as they were with like full precision and whatnot but then say you know lower precision or more sparse or whatnot and seeing if that can actually run a commodity hard。

So the release of OPT， especially with all the smaller models you 125 million parameters up to 66 billion and then also obviously the ones thatve had a billion parameter is to sort of enable people to study scaling laws seeing if we can actually scale up and transfer kind of methodologies from the smaller scale to a larger scale thatll work just as well it also obviously sort of showcase the kind of complexity of actually operating at the scale on new hardware by releasing a logbook showing kind of the excruciating detail it takes to do this for the first time on new compute environments and also releasing the codebase as it was for training these models which and I think megaron tuuring in theory sort of did but it was very hard to get that really running anywhere and it wasn't clear if that was exactly the codeb that was used to train their largest models whereas we literally just open source codebase that we use so this is one of those scenes where like you see a lot of papers published details about training and whatnot a lot of folks it's not intentional if they like left out details it's just there's so many implementations in details included in code that may。

I'Be captured in papers and we just wanted to make sure that everything was out there。

OhAnd then oh I think there was another question emergence right so that's the other part where I think that's the bitter lesson of scale no matter what you know as you say you know if you have you know unbounded data。

if you have the compute to train a much larger model on the same amount of data as you would on a smaller model for capabilities to emerge or whatever that capabilities looks like purely just from modeling having more capacity to model any kind of underlying pattern in the data you will see more clinical emergence in that sense。

but a lot of that is also I think contingent on what you're using to measure and whether or not that know emergence is actually some kind of hockey stick or kind of a smooth curve so I think the ruler by which we're kind of claiming there's emergence is also kind of illdefined term。

Okay， the future， the future technical path of large language models。

do you think it's a decoder only or it's rather encoder decoder？

This is tricky since from a lot of kind of discussions with researchers in the space when you if you were to kind of fiddle with the number of encoding layers and decoder layers in the limit you sort of see that if you reduce the number of encoding layers you actually get better performance in certain use in the generative case now of course I think that' subject to what you mean by performance if you're trying to like adapt this model and fine tune it and sort shift the domain distribution of the pretrain model to some new data distribution maybe you do need to encoder to help with that sort of tuneability of the model which is very different than say train the most capable model so in the final use case here it's also very unclear which if there's a material difference in architecture that results in better results as a functional kind of data that you have from all of our experimentation for ease of scalability decoder only definitely scales much better on the hardware that we have but I do see a world in which you can probably make encoder decoder work just as well。

But I think in that sense， skills matters more than say the specific details of architecture here。

So decoder only an encoder decoder， they excel in different specific tasks right what from what I've seen so far yes。

and maybe that's just completely you know， that's also arbitrary because weve only have very few tasks that we look at here too。

Okay。Yes， we have a general question and any experience to share when developing and managing such a huge project。

for example， how to set benchmarks and deadlines。Yeah。

so overall there was this kind of tradeoff right that you see between like research risks that you can take where you know trying out new things for the first time at scale versus kind of engineering risk they can take in like just pretty much purely executing on what is already out there right not actually trying novel things so there's that tradeoff of like if given some amount of compute that you have to produce a model at the end whatever that looks like right you probably don't have the luxury of trying a lot of novel new architecture。

new optimizes and whatnot and mostly the complexity comes in the form of actual engineering execution and getting things to run on new hardware so I would say even in the kind of halfhazard way it looks like when we're kind of changing these hyperparameter settings we're also doing like we're only fiddling with a very small number of variables now in theory that can like combinatorial explosion can be you somewhat you know abouted but still right we're not taking too many risks there and that's mostly because each time we restart。

And these tests are not cheap to run， we're still aiming towards getting kind of the runs back on track as soon as possible。

I would say， you， in the essence， like a lot of this work comes down to minimizing of kind of unknowns and trying our best to get to the train model as quickly as possible。

and it might look a little bit halfphard， but that's mostly because we have a very bounded amount of compute time to spend。

Okay。I can see the time is taking。 So let's take three more questions。

The first is we should make LL Ms remember the knowledge in the pre training phase。

Does this mean we can't use drop out in the pre training phase。

I think the two are pretty decoupled dropout was introduced as this like form of regularization and it helps kind of you break up correlations in the model so that maybe you can like learn a most more robust representation anecdotally we sort of see even from palm those 540 billion parameter model that Google trained they start removing dropout and it seems like you know there was there was no difference I believe Lama did as well to double check if that was true but even for a 546 billion run we took dropout out and we didn't really see a difference so it's one of those things where I think it might make a significant difference at smaller scales similar to like there was a prim paper that came out a couple of years ago that show that value squared is a great activation function made things train a lot faster but that's one of those things where at scale with numerical instability issues that just wasn't feasible for us to actually apply so things that look promising at smaller scales may not extrapolate a larger scale so yeah I would say like the two for dropout。

In this case， kind of unrelated to I forgot what the other thing was， but yeah。Okay。

What are the main challenges and limitations faced when applying auto regressive models such as GT and Lama model to African languages。

Oh， I don't have any experience in African languages， unfortunately。

so I can't comment on this at all。Okay， what do you think of generative models versus joint embedding architectures mentioned by yellow queen。

Oh， yeah， I don't think I could comment on this either more of an engineer here than speculating on research direction。

Okay， so I think I think it's about time and really thank you again。

Susan for giving us this presentation。 And next time I hope you can really join offline for our conference。

Yeah， I would love to Thanks so much for having me。 Yeah， this is great。 Thank you so much yeah。😊。

Yes， so thank you again。

生成模型论坛 - P1 - 智源社区 - BV1e14y1m7Rr

同学们大家好啊，欢迎大家来到这个2023年北京智援大会的生成模型论坛。然后我是论坛的主席和主持人，我叫李崇轩。然后非常高兴非常荣幸，有这个机会组织这样一个活动啊。

在这个志援的支持下去和大家啊一起分享啊生成模型最新的这种进展。然后今天呢我们是非常非常荣幸邀请到了呃斯坦福大学man教授，浙江大学呃赵州教授呃，智援研究院呃，北呃刘广研究员UCLA周博磊教授呃。

ford吴家俊教授给大家带来生成模型的前沿进展。然后最后呢我们会有一个非常非常简短的一个大概半小时的一个援著论坛。然后还会邀请到啊清华大学朱军教授和这个各位讲者一起来呃跟大家做一个更详细的讨论。

然后我们今天的第一个报告来自于呃fordman教授报告的题目。😊，advances in models介绍man is an associate professor in Stanford university his research on machine learning and AI he won multiple best was and outstanding top conferences likeI and and he is very famous because of his work on models。

So it time so let's welcome to give the talk okay。啊， can you can share your screen。O， thank you。

thank you so much。Perfect， can you see my slides？Oh yeah， yeah， yeah， Okay， great。

and you can hear me okay。Yeah， everything works。Okay perfect yeah great thanks for the introduction yeah it's a pleasure to be able to give this talk remotely and yeah this is John work with my former PhD student Yangs who is actually a alumni from Tinhua University and it's gonna be a assistant professor in computer science at Calch soon in order to do this know and here's another example of the kind of things you can do you know perhaps you don't want to control the geneative process through a caption。

maybe you want to provide a sketch what painting should look like you know this is the kind of image I would be able to to provide myself and then you can ask a generative model to make create a painting or create a beautiful image that is consistent with this kind of sketch provided by the user and you might get an image like the one you see at the bottom again you can see it's kind of like consistent advice provided by the user but that is much more beautiful。

う。And underlying this technology is a model that is able to understand the structure of natural images and is to be able to understand what kind of sequences of pixels are likely。

they're reasonable and which ones are not。And so the underlying assumption is that there is some underlying data distribution。

there is some function that assigns hyper probability to images that are consistent or sort of like the pixels make sense and the objects of the right structure and they look reasonable。

they they're physically consistently with the kind of things we might see in the real world or the kind of images that we can get on the internet。

And the issue that this function is unknown and the only thing we have access to is a large number of samples or examples of let's say images are harvested from the internet。

And the goal is to come up with a model of this kind of distribution。

come up with some kind of function that can be computed that can essentially aside tell us which images have high probability and which ones don't。

And if we are able to come up with this model distribution。

then we can do many of the interesting things we've seen before， we can sample from it。

we can ask the computer to generate sequences of pixels that would be assigned high probability by the model and by doing so we can generate new images that are have the right structure。

they are similar than ones we've seen during training， but they are different， they are new。

there is some aspect of creativity of generating new content here。

And another thing we can do is we can。That's the model。

you know whether a given input image is likely or not and perhaps we can do this to detect adversarial attacks or figure out whether there's something wrong with the inputs that are provided to our machine learning system so there's a lot of。

Interesting use cases for generative AI kind of tools like a probabilistic model a generative model of data distribution over natural images。

啊， now。The problem and why we've not been able to do this before is that building a complexgenrative model is challenging because this probability distribution that is shown here as a pretty simple object that you see on the left it's in fact actually very complicated because it's a probability distribution or a very high dimensional space if you think about the spaceable possible images。

images have many， many pixels and so there's many。

many different possible combinations of the colors that you can assign to the pixels in an image and so what it means is that the model needs to be able to assign probabilitybabilities to an extremely large number of possible objects。

And that's challenging。And so the question is how do we construct a distribution that is sufficiently flexible to be able to capture the complexity of a complicated distribution like they one other natural images？

And so one thing we can do is we can pick simple statistical models like a Gaussian distribution and you can think of a Gaussian distribution as some kind of like really simple neural network that will essentially take as input and image X。

you think of it as a vector of pixels and will map it to a scalar value。

which is the probability of that input according to a Gaussian distribution。

which is a relatively simple formula where you just subtract of the mean and then you use the standard normal expression to calculate the probability of this data point X。

And you can think of this as a very shallow neural network。And you know。

Gaussian distributions are great， but you can kind of imagine。

you know they're not sufficiently flexible to represent something complicated and complex like a data distribution of natural images。

And so ideally we need something more complicated， we need to introduce the power of deep learning of deep neural networks to represent this complicated function that takes an image' the input and maps it to a probability value and so the idea is that perhaps we can use a deep neural network to construct this probabilistic model and that's kind of like the whole problem is behind the whole idea of using deep generative models using deep neural networks to construct a generative AI kind of tool。

Now the reason this is not entirely straightforward is that probability distributions and probability density functions are not arbitrary functions。

but if I take an arbitrary neural network the output that it might give you for a given input image X might be。

for example a negative value and we know that probabilities cannot be cannot be negative so if we take an arbitrary neural network let's say call it a theta the output might be negative so we have to somehow change the structure or the neural network to make sure that the output is not negative and that's relatively easy to do for example we can take an exponential or something like that to make sure that the outputs are non negative。

The more complicated constraint is that we need to make sure that if we sum the probabilities of all possible inputs to this function。

we sum the probability over all possible images， we get one。

And enforcing this normalization constraint is not easy in order to make sure that this function normalized。

we basically have to divide by this normalizedizing constant this data。

which is just the integral of the unnormalized probability of all possible inputs。

And this object is easy to compute if you have something simple like a Gaussian distribution。

but in general， it's very hard to compute for an arbitrary neural network。

And it involves an integral or sum or a very high dimensional space。

and this is provably computational and intractable， even if you have a discrete input space。

this is sharply complete is believed to be even harder than NP complete problems。And you know。

there's been a lot of work in the past few centuries and from you different。

Fields and disciplines including statistical physics， statistics。

a lot of smart people have thought about ways to deal with this partition function and come up with approximational algorithms and ways to sort of like compute this quantity or approximate this quantity efficiently。

And so the way that we are going to make progress is to instead of working with probability density functions。

which have to be normalized， we're going to work with their gradient。

just called the score function， just the gradient of the log density function。And so intuitively。

you can think of the probability density function as this function that takes an input x and maps it to a scalar。

which is the probability under the model。The score is just the gradient of that function。

so if the density of axis is a mixture of two Gaussians like you see here here the color here represents the likelihood under the model。

here I have two Gausians， one on the top right， one on the bottom left。

The corresponding score is a vector field of gradients that are basically pointing towards the high probability regions every point the score is giving you the direction that you should follow if you want to increase the probability most rapidly。

And the interesting thing is that。As we'll see this will allow us to define a very flexible class of models allow us and will allow us to directly bypass the issue of the normalization constant at the same time it will allow us to get very high quality images and this is kind of like the technology behind a lot of the advances that we've seen in generative models of images。

video speech and so forth。And it will allow us to do a number of interesting other application built on top of generative models like evaluating probabilities doing outlier detection。

controllable generation， and so forth。So let's start by looking at why modeling the scores is a better alternative than modeling directly the underlying probability density function。

So as we discussed before， when you're modeling the probability density function。

you're really using a neural network to map inputs to probability values。And as we've seen。

this basically means that the integral overall all possible inputs have to be one in the one dimensional case means that you need to choose curves such that the area under the curve is fixed to be one。

And so in order to do it with that arbitrary neural network。

you basically have to divide by the area under the curve。The normalization constant。

which is typically hard to compute。Now the nice thing is that if you start looking at the score。

the gradient of the logity function， this object doesn't have to satisfy this kind normalization constraint。

it's a much easier function to model。And we can see it mathematically， if we look at the。

we take the log of the expression on the left and we take the gradient with respect to the input of this quantity。

which in this case x is just one dimensional so this is just a standard derivative。You know。

by taking the log of the expression on the left， we got two components， we got F the。

which is just the output of the neural network and then we got this the log of the partition function。

The interesting thing is that the log partition function is just a constant is the area under the curve and does not depend on x。

it's the same value regardless of what is the value of x when we take the gradient of a constant with respect to x。

we get zero。So， we see that。By taking the gradient we've eliminated the dependency on the partition function on normalization constant and so we got an object that is going to be much easier to model using a neural network。

we no longer are constrained to make sure that the area is one that the object is normalized somehow can directly use an arbitrary neural network to model the function on the right。

And that's going to be the score model and this is kind of like the key innovation that has allowed us to use more powerful neural networks to develop probabilistic models of images。

this is really the key machinery that is enabled a lot of the success that we're seeing in the score based diffusion models。

Now， the question is that when we're given some training data。

the usual we have a training set of samples that are sampled from some unknown data distribution。

Typically when we fit a generative model is we and we work directly with the density function。

we know how to choose the parameters of a density function to match the data distribution as close as possible。

typically what you would do is you would do maximum likelihood estimation。

you would try to choose the parameters data， such that the likelihood of the observed data points is as high as possible。

The question is now that we no longer work with the density directly， but we work with the gradient。

how do we somehow estimate the score of the data distribution。

how do we come up with a score function that is a good approximation to the score function of the true data generating process when we only have access to samples。

We only have access to a bunch of training training examples。

and we want to estimate the underlying vector field of grades。So it turns out that this can be done。

so if we're given a set of examples that are sampled IID from some unknown data distribution。

how do we estimate the score of the data generating process of log P data？

So the idea is that we're going to define a score model。

this is going to be let's say a neural network vector which is a vector valued function for every input。

it will give us a vector， which is an estimate of the gradient of the。

Through data generating process at that point。And what we can do is we are going to try to find parameters for these neural networks so that this function approximates the true vector field of gradients as well as possible。

Now， in order to do that， we need to be able to compare the model。

the vector field obtained by the model to the vector field to the ground truth vector field。

And so you can imagine that there is a ground truth vector filled those scores。

There is an estimated vector field of scores for a particular choice of data with the parameters of the neural network。

How do we compare these two objects？A reasonable thing to do is to basically overlap the two vector fields at every point there is going to be a truearrow。

a true gradient and an estimated gradient， we can look at their difference and if these differences are small。

then we have a good approximation of the true vector field of scoress。And so in practice。

what we can do is we can look at all these errors and we can look at the norm of this grid of these errors。

We can average them and we get scalar value， which is capturing how similar these two objects are。

Mathematically， this is known as the fis divergence。

it's just like the average difference between the ground truth score and the estimated score at every point。

And it's a reasonable metric to compare to vector field of scores。

if you can drive this quantity to zero， then we have a perfect model for the two vector field gradient。

The question is that it looks like it's a quantity that we don't know how to evaluate。

Of course we don't know how to optimize because it depends on this ground truth vector field of gradients that we don't know we only have access to samples from the data distribution。

the goal is to estimate that through the ground truth data score， how can we evaluate this quantity。

how do we optimize this quantity when it depends on something that we don't have access to。

Turns out that if you do integration by parts， you can basically rewrite this objective function into an equivalent form up to a constant。

but no longer depends on the unknown ground truth score， it only depends on stta。

which is our model of deep neural networks and so we get an objective function that is equivalent。

and now this is something that we can evaluate， we can approximate using our samples and we only depends model only depends on atta。

As you can see， we're basically trying to minimize the norm of the estimated score evaluated at the different data points。

while at the same time， trying to minimize the trace of the Jacobian of the score evaluated at the data points in the training set。

And so now this is a reasonable objective function and the challenge is that it still depends on this trace of the Jacobbian。

which is potentially expensive to compute when we're dealing with high dimensional data the trace of the Jacobbian is basically a sum of partial derivatives with respect to the different input dimensions and this might require a lot of back propagation passes through the network in order to be able to evaluate this quantity exactly luckly there are several ways to approximately compute this quantity。

One that I like is basically the idea of instead of comparing the vector field of gradients directly。

we can compare their random projections， we can pick a few directions and if the vector fields match。

then the random projections should also match。And if we pick up。

Larger number of different directions， then this should become a pretty good approximation。

And by comparing random projections， we're now comparing basically one dimensional objects。

And this leads to an objective function that' is much more computationally efficient。

It basically does not depend on the dimensionality of the data anymore。

so it scales to very high dimensional data sets like images。

And it still retains a lot of the nice properties score matching。

like consistency and a synthetic normality。I'm going to skip some of the details。

but the key takeaway is that there is a way to train a score model that is only depends on the data and it scales to high dimensional data sets either through slide score matching or other methods like deno and score matching that were developed by other researchers。

Now that we have talked about bypassing the normalization constant by working with the score instead of the density directly。

and we've seen how we can estimate the score from data。

let's talk about how we use the score models to generate new samples to do controllable generation and solve other interesting tasks。

Now， what we've seen so far is that it's possible to take samples from a data distribution and come up with a good approximation of the underlying vector field of gradients。

Now， the question is， how do we use this object to let's say， generate new samples？

How do we use estimated these arrows to essentially generate new samples。

or all we have access to is this vector field of gradients。And intuitively。

you could imagine a strategy where we initialize。Partracticles， randomdom。

And then we kind of like follow the arrows。To try to go towards high probability regions。

And intuitively this kind of works， it's not quite a valid sampling strategy because at some point you're going to get stuck in the in other local optimum and that's not quite what we want when we want to sample from a distribution。

But it turns out that a simple modification of this strategy gives a valid sampling algorithm。

And in particular， there is a sampling procedure called Lagervin dynamicss or Lagervin MCmC。

Which essentially works by following the gradient and adding noise at every step。

And it turns out that if you do this。If you instead of just following the gradient。

you add a little bit of noise on every step and you do this for a sufficiently long time。

then asytically you're going to produce samples from the underlying distribution and so to the extent that we have a good approximation of these arrows so we kind of like knowing which direction we should go to we know how to generate good samples from the model。

The challenge is that if you're trying to do that it doesn't work so here's an example of you learn a score based model on the C410 data set。

then you try to grant longar dynamics， you get the kind of samples that you see on the right so they don't even look like real images。

they look like just random gray graye with some color patches。

images that don't have the right structure， they don't look at all like the ones we've used for training the model。

So what is going on here？Well， what is going on here is that if you think about the training objective that we have。

We are training the model based on samples。And most of the samples will come from high probability regions under the data distribution。

So we're kind of like getting。A prettyty good estimate of what these arrows should look like。

When we're close to high data density regions。But we don't know we don't have a good approximation of the gradients of these arrows when we're far away from the high data density regions because we've never seen training points in those regions。

all our samples are coming from the data distribution that are like nice。

clean images and those are the only samples that we know how to deal with。

we've never trained on random noise for example。However， when we initialize our large chain。

our samples are going to be initialized all over the place and by following the arrows we are going to be able to go towards high probability regions。

So here you see again an example where we have this simple data density that we've been using as an example throughout the talk with two modes。

one at the top right， one at the bottom left， if you do score matching you can see the arrows are pretty accurate when we're close to the modes but then they are very inaccurate if you compare those arrows you can see that they're sort of pointing in the wrong direction they don't quite match the ones in the middle figure。

the true ones and the estimated ones they don't quite match the arrows look different and so what this means is that if you follow the arrows using larger dynamics。

you will have trouble exploring the low data density regions and you will get lost and you will not be able to produce good samples。

So one solution is to add noise through the data。If you add noise to the data。

then we have a new per of density that is now kind of like supported over the whole space。

And because we're going to see samples from this perturbed density all over the space。

The we you know。Practice， if you think about an image。

it means adding noise to the image you take this dog and you add noise to it so that it becomes。

It becomes a little bit fuzzer。If we do this， then we get a new density。

a perturb data density for which it's relatively easy to estimate the score。

Because we're going to see samples all over the space。

we're going to be able to estimate the scores all over the space。

Which means that our lingering dynamics procedure will be able to sample from this distribution。

However， we're no longer sampling from the clean data density。

we're now sampling from an approximation to it， we're sampling from a data density that has been artificially perturbed with let's say a noise。

or we're not going to generate clean images of the dog like you see on the left。

we're going to be generating images plus noise like the ones you see on the right。

which is not what we want。So the solution is to consider multiple noise levels。

is to consider different views of the data distribution perturbed with increasingly large noise levels。

say Sigma1， Sigma two， sigma 3， and so forth。And what we can do is we're going to try to。

whichch you can think of as adding increasingly large amounts of noise to， let's say。

images of this dog。Until the structure in the image is completely destroyed。

What we can do now is we can try to jointly estimate the vector fields of gradients of all these data distributions perturbbed with increasing large amounts of noise。

And we can do that using a single score network， just like before。

an neural network that takes x as an input， takes the noise level sigma。

and input and produces an estimate of the score for that x for that noise level。

What we can do is we can train this score based model using a score matching objective just like before by taking the fi divergence between the two distributions。

And again， basically，' just by using the same score matching loss that we talked about before。

And now that we've estimated all these vector field gradients。

We can produce samples by essentially using a variant of the La dynamics procedure that I talked about before。

what we can do is we can initialize our samples at random and we can start by following the gradients of the data distribution perturb with a very large amount of noise。

These gradients will be pretty accurate， and so we're going to start moving towards high probability regions。

Then we can use these samples to initialize a second large in chain where we now reduce the amount of noise and now again we follow these arrows。

but now we follow the gradients corresponding to a data distribution that has been perturbed with a smaller amount of noise。

And again， then we use these samples to initialize another long jump chain for a data distribution perturb that even smaller amount of noise。

Until the level of noise is so small that we're essentially sampling from a distribution that is indistinguishable from the true clean data density。

And this procedure actually works， here's an example of how it can generate examples on some common image data sets。

this was back in 2019 and we were very excited because we were able to finally get this procedure to work to generate Mist digits or to generate pretty realistic C410 kind of images and you can see how essentially the procedure is able to go from noise to data and it's essentially following these gradients it's following these arrows and it's trying to push random noise towards high probability regions and it's generating images with the right structure by following this procedure。

Now， scaling this up actually led to state of the art sample quality on CF 10 back then when we published this work in ICLR。

So for the first time， this kind of procedure was able to beat GANs generative adversarial networks。

which were the state of the art in generative modeling for a few years。

despite a lot of the engineering that went into GNS and a lot of investments from large tech companies。

we were actually able to for the first time beat GNS in terms of image quality on this academic data sets。

C can， which was very exciting because again it was growing the first time competing very different class of models was able to beat generative adversarial networks。

And you know a bit further by scaling up to bigger data sets。

higher resolution images we were able to generate， let's say。

faces like the ones you see here and this was is kind of like the key technology that is behind things like stable diffusion or Igen or Dli2 or midjoney。

all these excellent image text to image generative models are at the core based on this idea of estimating the score of a set of distributions。

so data distributions perturbed with increasingly large amount of noise。

it led to it started out as an academic project， it eventually ended up having a deep impact in industry and now it's used by a lot of users all over the world and it really has unlocked incredible new capabilities in terms of the kind of images we can generate with these models。

Now， I think soll have some time， so maybe I'll talk a little bit about how diffusion models are very useful because they allow us to control the generative process in a very natural way。

So let's say that with。Train a model that can generate images of cats and dogs。

Now let's suppose that now we wanted to only generate images corresponding to the class label dog and so let's say that we have a classifier that can tell whether an image is an image of a dog or a cat。

Is it possible to sample from the posterior distribution of images given that the corresponding class is the class y of a dog？

You know this is a well defined object， this inverse distribution。

this posterior distribution is basically just defined but through Bay as rule。

posterior distribution of images given the class label is something that we obtained from the prior distribution of our images P of x。

The likelihood P or Y given X， which could just be a classifier or some kind of way of assigning a label Y to an image X。

and then we have to normalize by this denominator PO Y。Which is typically intractable to compute。

Can you see that this behaves essentially like the normalization constant。

the partition function that we talked about before。

this is essentially a number that you have to divide the expression with to make sure that it's normalized and it's a valid probability distribution。

Computing with this denominator is what makes Bayesian inference so hard。

it's typically very difficult to compute because once again it involves some integration of a high dimensional space。

The good news is that if you look at what Bay rule is telling us if we start working with core function instead of densities directly。

AndSo if we take the log of the expression and then we take the gradients with respect to xs of the expression I have at the top。

We see that we get several pieces， but the important thing is that the denominator P y does not depend on x。

So when we take the gradient with respect to x， that term that we didn't know how to compute disappears。

And so the score of the posterior distribution is just the score of the prior model P of x plus the score of this likelihood PO Y given x that we might have access to directly。

So if we have a pretrained score model that is telling us what do the scores for images look like。

can combine it with the score of any forward model that we want could be a classifier。

or it could be anything else。And we can get a score for the posterior distribution。

so just by adding up these two components， we can get a model that can sample from a posterior distribution。

And this allows for many different applications。😊，So for example， why instead of a class label。

why could be a stroke painting？And now we can。With a given a pretrained generative model of images can combine it with a likelihood function。

POY given X， and we can get a model that can synthesize images from stroke paintings and here you see some of the examples that we can get when you can try to create an image that has right the structure of the stroke painting provided by the user but is photo realistic。

Another example is language guided image generation。

if you have a good image captioning model that can tell you which whether a caption Y is consistent with an image X。

We can ask， we can create a conditional generative model that will create an image given a caption。

and you can use this kind of machinery to build a language guided image generation so you can text to image。

generative models， you can take up image captioning model。

you can combine it without a generative model of images。

an unconditional model and you can get language guided image generation。

Another example is medical imaging air the idea is that we might want to reconstruct a medical image in the medical space like an MRI or a CT scan kind of setting。

The machine in this case will get some kind of like measurement of the patient。

And we can think about the medical imaging problem is that of reconstructing the crossal image of the patient。

given the measurement obtained from the machine。In this case。

the model that relates the cross sectional image of the patient to the measurement given by the machine is given by some physical simulation。

but it doesn't matter to us， we have the forward model。

P Y given x we can combine it with a prior model， P of X or medical images and we can get a powerful medical image reconstruction tool that is actually beating deep learning models that we' trained specifically for this task。

so it outperforms deep learning methods that were specific we're trained for this task even though our method is fully general and it's kind of like just obtained by combining a prior model with a likelihood in a very which means that it's much more general。

it can be applied to different number of projections， different kind of measurements。

And it gives better performance as even though it's more general and more powerful。And you know。

this kind of technology led to state of the art results in a variety of water data sets on audio。

material design， text to speech， shape generation， and many more。Now。

I think I might have a few more minutes， so I'll talk briefly about why these models are also called diffusion models。

And that connection arises if we start thinking about what happens if we were to consider a infinite number of noise levels。

Or recall that the。Underlying idea of this modeling framework is to not just model the data distribution。

but to model the data distribution perturbed with increasingly large amounts of noise。

And here I'm showing， let's say， the data distribution p0 and then several perturbed views of the data distribution and increasingly large amounts of noises。

Sigma 1， Sigma two， Sigma3。And you can imagine what happens if we were to consider more perturb distribution so and increasingly。

Fine grain sort of set of perturbed。Dta distribution perturbbed with different amounts of noise intensities。

And in particular， you can think about what happens if you were to consider an infinite number of noise distributions that interpolate between the clean data distribution at time zero。

And a data distribution perturbably a maximum amount of noise at time t equals capital t。

So you can kind of like imagine there is a sequence of distribution where we start with clean data on the left。

then as we move towards the right， we get data distribution pertbed with increasingly large amounts of norms。

And so in particular， you can kind of like imagine process that kind of like。

Behaves according to this set of distributions marginally。

And you can kind of think about what happens if we were to take clean data。

And perturb it by adding increasingly large amounts of noise。

Until the structure in the data is completely destroyed。

This can be described the stochastic process can be described by a simple differential equation。

stochastic differential equation where we basically just take data and add a little bit of noise at every step until after adding a sufficiently large amount of noise。

all the structure in the data is completely destroyed。

It turns out that it's possible to think about the reverse process。

Where instead of going from data to noise， we go from noise to data。

So instead of going from time zero to t， we go from time t to zero。And if we go from noise to data。😮。

This reverse process is basically solving the problem of generating data for us。

Going from noise to data is exactly the process of jet generative modeling。

And the interesting thing is that the reverse process of going from noise to data can again be described by a stochastic differential equation。

it's not important that you understand exactly what these equations mean。

the important bit is that this reverse stochastic differential equation can be described in terms of the score function。

So once again， if we can estimate the score functions of these data densities perturbbed with increasing the large amounts of noise。

then we can describe the reverse differential equation that maps noise to data。

And that's essentially the connection with the diffusion models。

it turns out that if we use score matching to estimate the score functions， we can reverse。

this stochastic differential equation， and we can get now essentially a diffusion model。

Time is continuous now that we can use to generate samples。And I think I'm running out of time。

but I'll briefly mention that this stochastic differential equation perspective is very helpful because。

It allows us to use a lot of different techniques from the numerical numerical methods kind of literature。

In particular， it's possible to use very fast solvers for essentially computing the solutions to these stochastic differential equations。

where it turns out that it's possible to convert this to an ordinary differential equation。

So where the process is no longer。tochastic is deterministic again the ordinary differential equation depends on the score function and so to the extent that we can estimate the score function。

we can define an ordinary differential equation that maps noise to data。

And we can use numerical methods to solve this differential mis skip this。

but we can use this differential equation to。Accceerate sampling。

we can try to solve this differential equation very efficiently using techniques from the numerical methods that have been developed over decades to solve ODEs fast。

for example we can coersely disreitize the timeax and kind of like take big steps achieving very high speed appss and comparable sample quality。

we can also use parallel ODE solving methods to again accelerate sampling we can use distillation。

we can kind of like screen a student model to do in one step what the ODE solver who do in two steps。

then you can repeat this process many， many times recursively until you can essentially get a student model that can generate samples in one step。

And these samples are extremely good here you can see some of the examples when we apply these techniques to stable diffusion。

We can actually generate very high quality images only requiring one or two steps。And again。

this is made possible by this interpretation of the sample procedure as solving an ordinary differential equation。

And using some collaborative techniques to accelerate sample。Maybe let's skip this and yeah， I think。

I think I'm out of time so yeah I conclude here， this is sort of like the highleve overview of diffusion models and the key ideas are using scores to model the distribution instead of likelihoods and that really enables us to use essentially arbitrary neural networks to model these vector field of gradients。

the fact that we can train these models without having to resort to adversarial methods and mini Maxs like ingenerative adversarial networks。

there is a stable proper scoring rule based on the phs dives that we can use to train the models。

And we can use these models to do controllable generation。

we can use a lot of ideas from the numerical method and numerical methods to sample from these models very efficiently and one thing that I skip is that we can also get likelihoods out of these models so not only we can generate samples by given a sample we can evaluate how likely that is under the model which is pretty useful to let's say do anomaly detection and a variety of other applications and yeah this is kind of like really the core technology behind a lot of the super exciting advances that we've seen in industry and we're seeing more and more applications of this to other domains I think one that is still open is whether these models can be competitive without aggressive ones some text that's going to be a big open challenge and hopefully we're going to see some progress on that soon to and yeah I think that's。

I'm happy to take questions。Okay， thank you so much。 thank you so much for your beautiful talk。 Ar。

maybe you cannot say it through the camera。 Actually。

the forum is full of people and many people stand on both sides。 Actually。

we have one or two questions in this session。 Oh Okay， can you， can you have the microphone。😊。

Do you realize build a general large model is image。Could you， could you repeat a question， Sorry。

do you realize build general large model is a image。Is it possible。

Is it impossible to build an image model general large model？General a large model， yeah。

What's a general large model？不是哦，我因为我不是通用大模型，现在的这个翻译或者然后GN或者是AGAAG现在不同一。

然后我就用一个okKI can translate into English。 The question is do you believe that we could build a general purpose model for for like AGI。

😊，Oh yeah that's a good question I think it's hard to say I don't think there is any constraint like I don't feel like there is any impossibility result or that we will prevent us from getting there I mean we have a develop through evolution a system that can do it and I feel like it should be possible to replicate it I don't know how close we are I personally don't think that we are very close I think there is going be a lot of work to be done to get to AGI but I don't think it's impossible I think we'll get there eventually and we just need to keep working on it and we might need new ideas。

we might need new models we might need new methods nobody knows how close we are but I do believe that will eventually get there I don't think there is anything fundamentally preventing us from from getting there。

可以。O， thank you。 we have another question。 Okay， the last one。 Okay， thank you。😊，Thank you。

So my question is that so the success success success of the scope bit model depends on the successful estimation of the score。

So does the score estimation re heavily on the architecture of the neural network。 I mean。

so currently， most model use the U net。 So is it possible for you net to estimate any distribution of general data。

Thank you。it's a great question and I think a lot of the advances were actually enabled by the fact that we could use more complex architectures that don't have to be either autoregressive or they don't have to be invertible the fact that we can just plug in a unit or really it is one of the key things that enable these models to work so well。

whether that could be better architectures， I think they're probably are out there and yeah we just need to to discover them I mean I think theres certainly a lot of architectures that don't work that we tried before getting units to work I know people have had some success with transformers too so B believe there's probably better ways to do it and depending on the modality you might want to use something else seems like units are pretty good。

On images at least， but yeah there might be other things on graph we've use the graphra neural neural network so it depends a little bit on the application and yeah it is deep learning so。

Yeah the architecture is super important， so there there's some beautiful math at the top but yeah the architecture matters and there is some magic here in terms of these neural networks being able to estimate scores and do it reliably that is really enabling the success of these models like a priority estimating the course could be hard and yeah there is some deep learning magic here going on for sure。

😊，Okay， thank you。 Thank you。 Okay， thank you again， Ima， for your time and nice talk。 Okay。

so thank you。 yeah。😊，呃，好，那么我们这个呃在的报告之后呢，是这个很荣幸的邀请到这个浙江大学啊赵州教授，为我们带来这个多模态生成式语音模型的这个最新进展。

我我简单介绍一下这个赵州教授，他是浙江大学计算机学院的教授博生导师，主要研究方向是自然语言理解计算机视觉啊和生成模型，在国际期刊和会议上发表了很多很多的文章，然后有很大的影响力。

谷歌学术引用啊8000多次，然后有很多多模态的生成式的工作啊和包括这个语音模型有这个呃nice speech啊等等。然后还有一些很有名的这个呃生成式视觉模型或者算法啊。

包括这个很先驱的这个PNDM这些然后呢他的工作呢应用于各种这个呃顶级的公司特像微软华为的这个项目中还有table非常著名的这开源模型当中，然后他曾经获得。😊。

和教育部科技进步一等奖和中国电子学会科技进步一等奖。我们欢迎赵周老师给我们带来精彩的报告。啊，谢谢各位老师和谢谢各位同学。嗯，非常呃荣幸啊。在呃支援大会上呃介绍一下呃呃我们呃近期的一些工作。

那么呃今天我给大家带来的题目是呃多模态呃音频生成式模型。那么我们的呃相比于iphone教授呃的理论。我们主要是介绍一下呃生成式模型，包括呃扩散扩散模型的一些应用的情况。那么呃我们这次报告的话。

主要是介绍一下我们呃近期的呃。三个主要的一个工作。那么第一个工作是。呃，第一个工作是呃。第一个工作是我们基于我们的语音的 speech模型。第二个是我们呃呃语音生成歌声的啊。

第三个是呃我们呃生成音频的一个d。所以我们从呃这个报告我们从三个角度来介绍一下啊，在呃音频生成的一些呃应用。那么首先我们先介绍一下，因为呃呃之前很多的呃同学都介绍了啊呃文本生成呃tex。

那么音频呃生成的话也是一种呃cros model的一种呃生成，它是呃给定呃给定的文本，那我们生成它的音频的一个形式。比如说我们把呃荷塘月色这个文本我们转化为呃语音信号。那么这个是一个它有很多的一些应用。

比如说在有声读物，包括在呃人机客服等等等等，有一些很多的一些应用。呃，那么针对呃这么一个呃这么一个呃它的一个框架呢，我们一般主要是有三个部分组成。那么第一个是在合成的是叫前端。

那前端它的作用是我们把呃呃我们的文本文本长成它的音素的形式。也就是说我们从文本中通过呃NLP的一些技术提取发音啊，包括提取它的一些韵律。那么第二个阶段是呢第二个阶段是我们在呃前端之后。

我们是给定了一些音素。那么通过这些因素呢，我们合成我们的没要频谱。那么我们可以想象成从呃文本合成它的呃频谱图的这么一个呃映射。那么第三个的话是呃声马器声马器的话呃是从我们输入是频谱图。

那么输出是我们的呃语音audio语音。那么我们这次报告我们聚焦在哪个地方呢？我们这次报告我们聚焦在呃声学模型的呃生成式的应用。那么过去的比较几个比较著名的工作，一是呃自回归的呃t，包括呃 voice。

包括呃transform tT。那么呃这次报告我们主要是介绍一下我们的生成式模型在呃声学模型，也就是说我们从因素像频谱的映射的呃低延时的net speech，呃，包括是高表现的d，还有是呃基于开放语。

因为audio它的呃生它的可以有不同的一些生成。那么是make an audio。那么首先呢我们是看一下我们的呃net speech。那么net speech的话呃。

它的一个呃develop的形式是从19年就开始呃develop。那么到23年的develop。那么有非常非常多的一些问题。我们碰到那么它原始的backbone呃基于什么呢？

原始backbone基于呃transformer在2019年呃transform被用到了呃tex to呃 speech，那么取得了非常好的一些效果。但是它还是有非常多的一些嗯呃不足的地方。

比如说transform的推理呃速度比较慢。那么呃如何使得它的推理速度呃更加快，呃，是以第一个难点。第二个是呃它的对于呃transform来说，我们的跨模态硬来说，它的呃生成的quality不是特别好。

那么怎么来提高它的生成的质量。那质量好了以后呢，第三个问题呃来让我们解决的是。是呃我们提高了模型的推理速度以及它的质量的时候，呃，如果我们的模型希望在端侧部署能否把它的模型大小呃尽可能的压缩压缩。

那么第四个问题是什么？第四个问题是呃，很多的前面的无论是并行化还是高质量还是轻量级，它呃大多是呃适用于中文和英文。但是中文的时候，我们的语言跟呃英文有一个很大的不一样，就是说我们是有一个多音性。

也就是说我们每一个词在不同的语境中，它的发音是不一样的，发音是不一样的。那么如何我们解决呃呃多音性多音性这个问题。那么最后的话是什么呢？最后的是在多音性的时候。

我们能否我们的合成能有非常好的个性化的合成。也就是说我们希望把我们的model呃进行一系列的一些泛化。所以我们现在呃step by step。

那么第一个是transform tTS是一个非常好的一个工作。那么在2019年是用transform model来做tex to必这么一个呃合成的一个工作。那么。

当时比texction呃取得了非常好的一个效果，非常好的一个效果。那么我们基于transform墨这个框架，我们继续了一些呃改进。呃，第一个是在2019年做了一个改进。

因为transform的这个框架来说，它的推理速度呃相对来说是比较慢的。第二个是它transform因为是auto regressive的一个pre。那产生一个问题，它是会存在一些漏词的一些现象。

那为了同时我们提高呃它的推理速度和解决low词的情况呢呃做了一个non auto regressive的呃工作就是呃开端。第一个那么n auto regressive的工作的话。

它主要的思想在在于我们并不是在解码端的自回归的一个呃预测一个加一个而是我们采用了一个非自回归的一个预测形式。那我们可以看一下，在这的一个重要的block是叫less regulator。

less regulator是呃呃我们的输入模态和输出模态进行一个映射。我们可以看一下less这有一个。也就是说我们。pred每个因素在我们的m频谱的里面的一个dration，也就是长度。

那么这个是实际上是实现了一个模态到另外一个模态的一个呃alignment alignmentment一个学习的过程。

但是我们知道呃这个alment duration predict是呃非常相对来说是非常难train的。因为我们会遇到呃我们的呃从一个模态到另外一个modality映射的时候有一个多封性问题。

也就是说我我们同样一句话，我们可以有不同的说话的方式和表达的方式。那么我们的映射的话会是相对来说会比较困难。那为了解决这个问题，我们可以看这个rationpre呃。

并没有用直接用ground truth来做，而是用呃autoregressive的一个transform tS作为一个teacher来教他呃抽取我们不同的一些。

那么经过这个来用ME loss来train。那么好了以后呢，我们可以看一下它的一个performance的一个呃进展，它的呃左边的话是一个momo是1到5的一个评分值。那么越高是越好。

我们可以看一下它的呃trans t是在3。88是在4。0。那fa达到了3。84是一个呃非常小的一个呃下降。那么呃fi是它的呃是有270倍的一个加速度。

270倍加速度可以看一下下图的一个 speech和之前的transform呃T的一个的一个呃程度。那么这个是一个第一个问题。第二个问题是在解决了推理加速的情况下。

还有什么遗留问题需要解决的遗留问题就是说虽然加速的时候，我们还是希望它的 performance性能可能是越来越好硬来越好。那performance我们做了一个呃在。前面做了一个详细extension。

也就是说我们之前是呃inco，后面是lanance regulator。那我们可以看一下，这是vari adapter。那除了我们做我们的lanance。

也就是ration的预测也是做了我们后面的一个音高能量等等其他的一些属性的预测。那么把 predict扩展到我们的 adapt之后，我们发现呃可以取得两个效果。第一个是它的ality是提升了提升非常多。

我们可以看一下它的version two的版本 two版本是比它的 one版本有一个很很大的提升呃，甚至它的在评测的时候已经transform t的 qualityality甚至要好。

那么依然是在inference的时，我们可以看一下它inference是如果只用 autoto regress的解码的话，它得负一次方的inence。那么合成一秒的。精品是需要1的负1次方级别。

但是在呃那alto regressive的话，从10的负1次方降到了呃10的-3次方。那么做了一个呃做了一个推理的一个提升，推理提升。那么呃这个还是会存在一个问题，所以呃会继续再挖掘一下这个问题。

那因为我们是发现就是说很多的时候我们是需要这个模型的size呃，尽可能小尽可能小。那保持它的推理的速度和我们的推理的推理出来的效果的时候，我们还是希望呃进一步压缩它 sizeize。

所以有呃所有port必这个工作port是可携带的，我们的希望是可能继续的小。那么我们看见呃我们这有呃两种的gen model。第一个是呃vari generation model呃。

我们我们通过实验发现如果用 generation model，它虽然参数不是特别大但是而且可以capture它的整个的合成的一些运力。但是由于它的lo的原因，它是带着一些的模糊性。

那么第二个是flowow based的呃可逆流 based model。那么虽然我们在参数足够的情况下，我们可以把它的。呃呃，效果做的比较逼真，但是它需要很多的参数。那么怎么办？

那我们把嗯呃VE based and flow based呃加起来，cascate起来，cascate起来就是port speech。那port speech呃做到一个什么样程度呢？

我们可以看一下port speech呃做了出两个实验，两个版本，一个是normal版本啊，一个是small版本。那nmal版本的时候呃呃VE和g模型加起来。

那么我们可以再看一下它的quality的上面是 art qualityality更好。那么呃第二个呃事情，我们是把它的size进行压缩的情况下，我们可以看一下。

它只用了大概4分之1的呃4分之1的参数量呃，依然可以达到非常comparable的一个performance。所以在呃这个程度上做了一个模型的参数的压缩。

那么使得更小参数可以取得非常compar的一个呃一个performance。那么回归到中文。因为我们呃做呃呃文本到文本到。呃，语音的合成往往会涉及到中文。那么中文有一个非常大的一个问题。

大家都没有还没有去解决。就是说我们中文和英文的来说的话，它的一个词汇往往一个词汇在不同的语境的情况下，它发音的是不一样的一个情，不一样的情况。所以说我们我们在做我们的呃没频谱的一个映射的时候映射的时候。

从语义的空间到声学的空间进行映射的时候，我们不仅是需要看当下的一个词也是需要看它的contex也contex。那么做这个的情况下怎么进行context的一个一个过程的。那么我们当时我们是可以看一下。

就是比如说它是输入了一个文本输入文本。我们可以呃对于文本来说，我们可以看它的字典。因为我们可以看一下听乐对听乐队。那么乐的话有几个发音的，一个是快乐乐观取乐。

那么是呃第一种乐的一个pronunciation它的转成它的音素，第二个是音乐乐。和呃音乐。那么pronunciation是呃第二种pronunciation。那我们借用我们中文的字典。

中文字典可以呃可以把它的因素给尽可能的map过来，尽可能m不过来。那么使得我们dictionary呃放进去。那么是解决中文的问题。那最后一个呃最后一个在这个事情是我们是呃希望呃可以把它进行一些泛化。

因为现在呃呃大语言模型也是非常 popularular了，也做了一些呃基于呃我们的pe转化为呃token用语言模型来做。

但是呃我们发现就是说不是说呃所有的呃pech所有的模块都可以转化为离散的token来做。那么什么比较可以做做的。

那我们发现是从韵律的这个code做我们的离散的token的一个呃lan model会比较好的一个性能。所以呃泛化的情况下，我们可以看一下，这里是。这个是me spectrumgram的一个结偶。

我们只在韵律上面做了一个language model的个leage model。那么呃和呃reference的是呃音质，那么合成我们的一个呃t的一个形式。那么我们可以呃试一下这个我们看一下他的demo。

这个是呃呃，我们可以看一下这个是一个奥巴马的声音， workers lost their lives。17 others were injured。

And soon nearly a mile beneath the surface of the ocean， oil began spewing into the water，它一10个10ton。

Good afternoon， everyone。 Today， we are super excited to introduce you all to introduction to deep learning。

The course of Carnegie Mellon University in the first part of the course。

we will talk about the generative deep learning that are used to generate data never existed in reality。

Good afternoon everyone Today we are super excited to introduce you all to to deep learning course Carnegielonative中。

介绍一下呃歌声sing的一个模型。那sing的模型的话是非常有有意思啊，就是得益于呃diffusion model thanks to diusion model。

那么呃iffffusion modelapply到上面去可以做什么呢？可以做一些非常高表现力的一个合成的一些工作声音工作。

我们首先看一下就是呃ffusion model那么呃左边是diiffusion model右边是呃我们的plication当然是可以直接apply我们的个usion model我们的呃 voice synthesiss的话我们发现有什么样的一个呃interesting呃interest的一个呃ide呢？

因为第一种我们是可以从我们的conditional的一来进行生成来进行生成。那么第二种的话我们我们发现啊就是说沿用 speech的一个呃一个一个思想就是两个mod进行scade那么第一个mod是呃。😊。

第一个model是capture semantic，第二个是capture它的一个音质。那么这是也是一样的。那我们我们MmanifoldM是什么意？

M是我们的原始的origin dataorig data，我们进行呃一系列加造，我们加到T的 step，那么M撇也是什么呢？M撇是。我们用另外一个model，另外一个model。

另外一个之前的model我们生成出来的一个频谱。那么之前的model的话是用我们的n speech呃 speech to和 speech分别是生成了不同的一个频谱我们进行我们到这的时。

我们发现它都是verge到一个 noise white noise那么比较interesting的呃呃interest程度就是说我们发现他在第七步的时。

第步的时候有一个 overlap有个 overlap。那么也就是说我们呃可以换一种思路啊，就是说一种是我们用呃one single model去做这样一个事情。

第二个是我们是用两个mod models models我们是用第一个是用一个辅助 model辅助辅助 model辅助 model我们生成M撇撇撇撇的时候，我们可以用我们之前的n我们cap它的语义的信息。

生成一个呃。出力度的出力度的一个频谱。除了一部频谱之后呃，进行K部的一次性佳噪和K呃一次性佳噪和K部的降噪。那么我们呃生成我们的呃music，生成我们的music的歌曲。

那么这个是呃这个的一个 performanceform。那我们发现一个比较interesting呃一个事情，就是第一个是。第一个是这个策略的事情。这个第一个策略。

我们是可以把梯步梯步的一个降噪梯步降噪给它reduce成reduce成K步，red成K步。第二个我们发现呃两个model沿用我之前的 speech的一种思想。

它的 qualityality会比呃s model会更高sing model会更高，所以是一个呃cos to find的一个呃过程，我们可以听一下demo。

就是这个是一个呃我们放了一个是你说你不为何在这时收。啊。小酒，我长睫毛，是你最美的记号。😔，对，他是可以是呃呃呃合成非常的比较表现力的。因为呃这个呃task比之前的呃pe要更难。

因为我们pe合成的语音来说，我们的音高抖动等等，都没有这么一个呃表现力。那么。接下来呢我们也是做了一个M fourM four是什么呢？

M four是呃一个一个呃ch我们呃把我们之前的d扩展到不同的一些ap。那我们可以看一下呃这个合成我们刚才展示过了，还有什么呢？还有是一个变调，我们可以听一下它的一个原始的音频我说其实你很。

那么我们是通过之前的结偶我们修改我们的音高胸膛我想说，其实你很好。那第二个是我们又是给他进行降调，暖暖就在胸膛。我想说，其实你很好。对，这个是一个呃，我们可以听下一个一个比较有意思的一个事情。

只是那种温柔，再也找不到拥抱的理由。是那种温柔，再也找不到拥抱的理由。😔，嗯，包括是呃克隆用亲吻着这这是目标人的一个歌声的一个音色。这个是我们那个是目标音频，这个是我还不肯相信。我可以看一下对。

这个是一个呃克隆，包括是合成变调等等。那么除这个之外呢，呃，对于我们之前的呃基于我们的声调和降调之后呢，我们还是可以把有些的一些呃声调的呃没有没有上去的，我们可以把它进行一些自动的一些修正。

那么是做了一个衍生成呃歌声的一个美化，我们可以看一下中文和英文，我们的美化前。那么我们对他的音调进行微调的。

这个是英文歌曲，wear beautiful like diamonds in the sky。We're beautiful like diamonds in the sky。那么除这个之外呢。

呃之前呃所有的之前的工作是比如基基基于那种细粒度的一些乐谱。也就是说我们是基于音素级别的一个音高的一个输进去的一个文件。那么呃对于我们的呃乐谱，网上的乐谱能否直接是放一个乐谱来做这个事情。

这个是实际上是建模了一个呃词级别的乐谱的一个signner，从直接的realistic的乐谱来过。我们可以听一下它的一个呃performance。光阴如梭一所在，去一所至。😔，裙丝百转，丝丝沉乱又不知。

😔，对，除了这个之外呢，我们还有一个非常好的一个呃那个基于之前的呃那个之前的好玩的东西。就是说我们是不仅是可以从我们的呃pe到呃的一转我们可以pe直接是转化为歌声。

我们可以听一下他的这是一讲话的声音我们可以看一下他的ment。从我们的一个 speech到的一个一个呃一个映射。那么呃因为之前的是呃开源的。那么我们可以看一下。

有不同的一些其他人用开源的做出来的一个效果。那么这个是一个B站。我们呃可以搜索一个B站呃的的一个呃呃这个一个创作。那我们可以看一下放一段呃我们的呃B站的一个呃呃第三方的一个用工具来做出了一个效果。嗯。

喂，这个是有有问题。🎼这句一把星辰在手心啊，这个是第三方的歌曲合成一个完整的歌曲，遥远的眼睛。🎼窗外传来记忆的声音。😔，🎼在半夜迷失，在梦间消失却幻想着夜晚之前的。🎼一种逃离，听成中少年的声音。

有着清澈的眼睛，嘴里还说着，因为我们还年轻，所以总有再一次的。🎼的权利我也消受再家，风月梦幻，把相遇念留下，可看到盛开。唯摇曳的花，却能以自把。🎼曾经路上的风吹。🎼一个的他就不必害怕笑的问。🎼放大。🎼。

🎼对，那个呃其他有很多的case，那么直接进入呃B站搜索。嗯。🎼最翻页是吧？好好。🎼对呃，其他有很多的case可以呃搜索进入B站。那么呃点击keyword呃d。那么呃这个是一个那个搜索的一个呃。

这个是一个哎。🎼对，这个是一个呃搜索页面，大家可以呃 try一下，就是说呃有一些不同的一些呃第三方。那么用的呃工具，那么做出来的一个工作。我们可以看一下这个上面是一个呃fin的一个乐谱。

它的一个是基于这个来进行一些合成。那么这整首歌是呃完整的一些合成，大约是4分钟的一个合成的一个一个一个现象，合成现象。对。🎼那只看这下匆匆一簇繁在手中，那也有更多的一些例子。

因为我这次demo的一个例子，并不是它这个排名最高的例子。那我们可以看一下，它排名最高的呃都有非常多的一个浏览量是呃今年的呃突然是大家对这个是非常感兴趣，非常感兴趣，有非常强的一个浏览量。

用呃的呃呃开源的model来来制作这么一个呃歌曲。那么放到呃放到呃B站上就会有非常强的一个强的一个呃访问量和观看量。🎼し。🎼是你。🎼想说再见吃再见，把你和我留下与自己重叠，并格在一刹那短暂秋叉。OK好。

那我们就是粗略呃可以呃感受一下，就是说呃现在目前的呃呃生成式模型啊，在我们的歌声呃做出了一个呃做出的一个效果做出一个效果呃的一个程度。

那么呃这个是之前的是呃在上面呃可以呃在B站上面可以呃有更多的一些的一些呃demo。那么接下来的话，我们是呃呃讲讲一下，就是说我们再把我呃深成式的模型从呃语音扩展到呃比较有表现力的呃歌声。

那么再扩展到呃更难的一个是呃一个音频。那么音频的话对于歌声来说呃，它不仅有表现力，它也有非常强的一个开放开放能力开放力。那么在呃在我们的歌声的上面的话，我们可以看一下它呃。对。

歌声的呃auo上面它是有更更强的一些开放域。所以我们是从第一个开放域来做。那么呃开放域做的话，首先是介绍一下第一个工作也是基于diffusion的一个application。

那么叫做呃make an audio makeake an audio。那么呃thanex to的 diffusionmod。那我们可以从什么呢？我们从可以从text呃上面给呃给我们的音频来进行配音。

那么我们可以从呃图片给我们的呃图片来配音，包括视频也可以给视频来配音，呃，放包括是音频的一些修复。呃，也可以来进行配音。

那么虽然我这个这个呃画的是呃文本其实我们support呃4种模台四种模台的一个音频的一个产生音频产生。那么第二个是呃对于我们的text来说，我们依然是做的呃不够完美。

因为呃对于我们的短的一些text来说，我们是。OK我们我们是我们是生成一个的一个音频。但是我们对于音频来说，它可以类比于我们的video videode。那么video的话呃有一个非常大的一个问题。

就是说它是有tempal信息。所以呃所以为了考虑这个tempal信息。那么我们有一个make an audiomake an audio那么呃我们通过我们可控我们输入文本的一个信息。

比如说我们先输出一个鸟叫再输出一个卡车声音再输出什么，它有一个个信息的一个一个建模再进行make an audio to那么make an audio的话是make an audio的一个升级版。

那么在make an audio的时候呢，我已经可以支持我们的一种不同模态的一个过程。所以呃在这个上面呢做了make an make an voice那make voice的话除了我们把make audio之后的文本包括其他的进行离散化之外。

那我们通过离散化和音频表征的。双重的一个解偶，那么先map到semantic model，再呃map到acoustic再进行合成，不仅是解偶也是进行离散化。在离散化之后离散化之后。

第三个就是做auio因为auio的话我们无论是输入呃 speech无论输入呃t。那么呃有不同的一些任务。

不同的一些任务有不同一些res就是说呃可以是让他产生au可以是唱歌可以是做呃 speech translation可以是做呃pe to talking face的呃 synthesis那么不同的一些任务。

那不同也一些任务。那不同一些任务好了以后呢，呃这个是auio g one audio two呢是使得auioP one再更进一步。因为auioT呃auP呃one的话。

它based on主要based on。GBT来这个呃foundation model进行构建不同的呃生成式的一些模型。

那么audioGPT two在呃是呃做了一个unified language model，可以支撑呃不同的模态到不同的模态的一个呃translation translation和 synthesis的一个呃一个一个过程。

那么我们首先看一下这个make making audio audio的话是一个在呃我们的音频生成里面的一个呃第三个赛道，它因为是有的非常的广阔的这么一个呃开放的这么一个事情。

它可以用到比如说有声读书啊配啊等等等等，都可以有非常强的一些用处。那么得益于非常呃得于我们现在比较好的呃大的语言模型呃，一个是c一个是capap。那么clip不陌生呃。

对我们的视觉来进视觉和语义进行一个编码。那么cap的话是对于我们的音频和文本之间的一个编码来进行指导。那么包括是在利用了呃lausionusion的model。

我们呃创了一个classify free的呃呃laclass的class free的 model。那么对于我们的音频来说，呃，最大一个问题在于我们对于音频来说，我们需要非常非常强的一些data。

非常强的一些da。以至于我们可以呃生成的呃音频更加有开放性。但是呃我们在我们的 website来说，我们并没有这么强的一些da那么这么多的一些da。那怎么办？在m an audio one的时候。

我们是做了一个呃基于 enhancement。那么这里是呃design了非常多的一些 rule非常 rule。那么对于我们的audio和text来说，可以进行不断的一些拼接。比如说呃这里是呃鸟的叫声。

这里是呃脚步的一些声音。OK那么鸟的叫声和脚步的一些声音可以呃可以通过不同的一些不同的一些拼接，不同一些拼接，可以把它组成组合成更多的一些音频用来它的。训练那么对于data来说的话。

我们是呃做了一个呃s prompt enhancement。那么呃通过拼接的形式产生更多一些data。那么最终来train这个model的时候。

是用了3000个小时和100万个audio taxs呃来呃来来做这么一个呃model的一个训练。那我们可以看一下它的making audio making an audio one making an audio one呃。

它的输入prot相对来说比较简单。我们可以看一下它的一个呃呃打雷呃和下雨呃，这个pro输进去。哦，能能不能帮忙放下？对。对，当我们可以支持呃打雷和下雨呃，以文本的prot输进去。

第二个是我们是支持呃图片。那么。对，这个是呃，这个是一个呃开车的一个呃声音。对，就开车一个声音。那么除了这个文本的之外，那么呃也有图片的一个prot。呃，请点击一下呃图片的一个。边楚吧。你定。B。嗯。

不。嗯，请联系一下这个图的视频和图片那个呃是。对，这个是呃右上角的图片呃生成的audio。那么以图片作为prompt生成了audio。那么下面的话是呃我们的呃video。

那么点击下这个是呃我们的烟火作为一个呃shot video生成。对，这个是支持呃他的一个video呃 videode作为pro呃来产生这种audio。那么呃音频呢就没有放在这。

那么支持不总的modelality。但是它这个还有一个问题，我们可以看一下它的pro现在是输进去它的关系来说的话，它都是一个呃基本上都是一个并列的一个关系。而且它的pro的话。

我支持的一些pro不是特别的复杂，就特别复杂。但是如果我们是希望呃能够生成更加复杂的一个呃生成更加复杂的一个audio，也就是说可以sup更加复杂的一个pro。

所以有一个make audio make audio的一个版本。那make audio的版本呃基本上是沿用了make an audio one。但是呢在对于我们的增强的数据增强的时候呢，并没有。

并没有基于规则的形式，而是用了现在大语言模型。那么进行大语言模型的时候，呃，大语言模型来进行增强。第一个第二个是对于我们的呃我们的一个我们可以看一下，这个是man speak，呃。

首先然后是呃狗叫do bark，然后是呃 birdr trip。那么这个生成是man speaking then a dog back with birth creeping in the background。

那么这个是呃对我们的prot来说，比之前的要复杂的多复杂多。这个是make on to的一个pro。那么呃总的来说是呃用了3。7K的 hours呃，3。7K的 hours data。

那么呃perform我们不看了，我们看一下几个呃量例。那我们可以看一下这个是一个呃。그는。Yeah。但是A man followed by goating then mental gateing as dark pass and window to myphone那么这个是一个呃比 making audio one更加comp的一来进行支那么呃呃这里也可以看一下这个是 vehicle engine那么le than呃。

Okay。No。对他他是对于我们making audio one的一个一个增强。那么这里面是用了很多的一些呃呃理解以及技巧，包括是理解顺序它的呃时间发生的事件的先后的顺序，先后的顺序。

那么来更好的一个来来做这个事情。那后面的话是做了一些呃更加gene的事情。就是说我们是希望呃把我们的呃mod可以做更加更加通用化的一些模型。

那么不仅是我们的tex包括呃我们的我们的pe进入mantic token和ous token进行一个呃非常好的一个解偶。那么跟我们的ide一样man token我们是capture我们之前的语义的信息。

那么这个是音频的一个信息。那么呃跟我们的 tS和我们的 speech一样做语义的方面。解耦和音频方面的解耦，而控制的话基本就是加在我们的音频的acoustic呃 conditiondition。

accoustic condition然semantic meaning的话是是固定的一个解耦的一个方式。

那我们可以呃呃可以呃看一下他的一个呃呃 with a start then I remembered how I lived alone是 writing bad poems and eating out The head of the patchwork Gro was the most curious part of her。

呃，support不同的一些呃 task，同时support呃呃zeal short的呃text to speech，包括zero short voice conversion。

包括zeal short呃s voice synthesis。我们可以呃先听一下歌声。梦也不自由哦，这个是输进去的一个pro这个progenative因为在00年后这一 dear short voice conversion这 wouldn engineer long这 source prompt nothing is more lugubriious than the contemplation。

thus in its nudity in the broad light of thought of the horrible swarming of slang。

We wouldn't engineer alone。对，这个是一个呃 makeake voice的一个版本。那么最后我们是呃呃介绍一下呃。

最后续的工作是audio gPTauio gPT呃的一个工作是把之前的我们之前的工作进行集成过去，它是可以s不同的一些task。就像我们这可以看一下。

我们是实呃一个是从au to呃 text audio to audio audio to event audio to呃 video以及是text to audio包括au to text image to audio以及是呃music score to audio的一些不同的一些工作。

它是一个呃一个能力。那么呃这的话我们可以呃呃放一个呃呃example呃，来 show一下他的一些能力。啊，请播放一下这个呃这个example。你说你不懂为何在这时牵手。😔。

这个是generate一个呃music，这个是generate audio。那么这个是一个呃right caption about一个auio。变数板。

I'm happy to help you here we go。I'm happy to help you。 Here we go。Here we go。🎼，'。Yeah。哎。

这是我们刚才看了一个demo，它是一个呃从我们的呃text to speech的语音合成模型，跨越到更加泛化和更加通用的model。那么在一个呃基于 speech对话场景的情况下。

我们可以呃让他可以做不同的一些呃task。那么这些呃我们的工作呢也是放在了呃get up，大家也可以在呃 face try一下。那么这里也是有呃demo配置。那么后面的话呃。

后面的话我们是在呃呃现在还没放出来是au。那么我们在编码的时候，我们用了自己的一个编呃一个离散化编码的一个框架和unifying。那么对我们进行系列的个升级的工作。

那最后总结一下们我们今天主要是给大家分享一下我们这几年的一个呃呃呃在音频合成的一个个系列的工作。那么主要是三个系列解决三个问题。呃，第一个呃系列是主要是做语音人说的声音人说的声音的工作。

那么做了一个是做并行的推理那是做多性的映射那是做了轻量化那做了多音和基于语模型的是那了这个之外呢，我们也做了呃歌声。那基于。

一个应用的一些歌声和低于diffusion model的应用的make an audio。那么歌声的话呃有不同的表现力的呃工作。audio的话有不同的一些开放域的一些工作。那么谢谢谢谢大家。

非常高兴有这个机会。在支援大会上呃分享一下我们呃近期的一些工作。那谢谢呃谢谢各位老师，也谢谢各位同学。啊，非常感谢赵周教授的这个精彩的报告。然后我们这个呃一个问题吧，我们有一个现场的Q问题。好好。

请请你嗯，不好意思啊，我们可以后面还有圆周会议，我们再再跟赵周老师交。你说就是谢谢赵老师带来的这个工作。然后我也关注过你们那个和那个数据集就我的问题其实很简单，就是说在AGPT2。0的时候。😊。

呃，在这个就是通过这个VQ嘛，就是通过这个incode的VQ。那其实。我就在好奇，就是说这个VQ它的会不会损失一些信息什么的。然后刚好您在前面说mega tTS的时候。

发现这个token是呃适合在韵律上面去建造呃整个语音的。呃呃呃谢谢你这个问题。呃，其实是呃有一种做法是现在做法的是，比如说我们直接把spech转化为呃呃 token，但是呢我们发现有一个问题。

就是说呃它我们我们的pech是一个音频音频它是跟文本不一样的一个一个一个modality。因为呃文本的话，它只是涉及一些语义的信息，所以转化为token是呃很reasonable的这么一个呃过程。

但是音频来说它比文本要呃要更加复杂。它有语义的一些呃信息，它有audio的一些信息，包括audio信息，比如说是有韵律信息，包括它时长能量呃音高等等等等等等的一些一些其他的一些 attribute。

那么我们在做这个的时候，那么参考我们的。

TTS的一个事情，我们呃做了很多的实验，我们发现就是说直接把我们的呃spech直接转化为呃token用呃语言模型，在呃audio和spech这个场景是呃效果不是那么的好。

因为不是说它里面的所有的属性都是合用作离散化的这么一个过程。那它会呃损失一些比如说音质啊等等等其他一些问题。所以说我们是我们做我们的tokenization的时候，我们采用了呃第二种策略。

我们是对呃我们的spech进行一系列的一些解耦。就是说我们是呃现在是用了我们的韵律，包括我们是用我们的呃音高这一些属性，我们是做呃它的token，用我们的呃我们我们的离散化表征。那么有一些的话。

比如说像呃。duration这种 durationration的话，我们依然是用我们之前的nt speech这种框架来做这样子一个预测。所以说在我们的工作里面。

我们是呃先呃我们是不同的呃先进行解偶不同的一些属性。那么作为ration继续用dration prediction来进行预测。那么音高或包括韵律的话，我们是用token来进行预测。

那最后我们又是fuion成了我们的呃这样子一个呃一个model。所以说我们不是很简单的就是呃用现有的一些工作对它的直接作为的一些呃结偶和离散化工作。因为这个效果包括音质不是特别的好。好，谢谢。

我们再次感谢赵州教授的精彩报告啊。😊。

这个我们呃这个下一个报告是来自于这个呃北京智源人工智能研究院NLP与多模态研究中心的啊刘广研究员啊，刘广博士呢它是flag AI的核心贡献者。他的主要研究方向是大语言模型和多模态文声生图的方向。然后呢。

今天这个报告也是非常非常有名。他呃给大家汇报一下这个呃低资源的多语言文声生度模型啊，altoMAT是吧，是一个非常非常漂亮的一个呃多语言的这种文道图的生成模型。好，我们欢迎这个刘广博士。

非常感谢大家今天来听我们的这个讨论和分享。然后呃我这边分享的主要呃题目就是一个基于低资源的、多语言的纹身图模型的呃这么一个一个一个改进。😊，呃，首先大家对于这个纹生图肯定都不陌生。嗯。

但是这个领域呢其实发展是非常快。呃是在最近的时候，其实在一年以前，大家其实对这个领域可能还不太了解，都没听说过。但是在最近呃就是去年5月份呃deEOI推出deE之后，完成图的那个发展就非常的迅猛。呃。

但是open eye他没有open他的那个source code和他的model。所以说后来有很多公司在follow这个工作的时候，都是呃闭然的状态。就比如说百度。

谷歌呃me journey他们呃这个效果是非常好，就是当时但是效果非常好呢，同时也带动了一大帮那个社区的一个用户跟他们进行交互，然后帮助他们的那个质量提升。但是呢没有一个开源开放的一个代码。

所以说在这种情况下，就是呃stable diffusion横空出世。他把他的模型权重和所有所有的代码都已经公开出去。然后效果非常惊艳。然后从去年的9月份开始到现在的话。

就呃不断的持续的呃有很多的新的模型出来。呃，这边的话就是一个他的一个新标的一个增长，现在是可能有更更增长更快的。像像off的那个chGBT。

所以说现在很多的这个开源社区基于s diusion做了非常非常多的一些改进和大家都是用s diion的一些衍生的产品或者是衍生的一些模型做很多有意思的一些应用。

然后所以说我们其实工作也是呃基于这个ion去做一些改进。呃那我们在这里就先简单跟大家再重新回顾一下ion一个什么样的一个状态，是会是什么样的一个组件嘛。最开始是呃呃ion有三个主要的组件。

就是右上角的一个frozen就是clip模型。刚才呃刚才赵州教授他也说到了，clip模型其实是一个比较强大的一个文图匹配的这么一个模型，他可以提供文字到图像的这么一个相关的一个表示。

但是呃这这个这个模型它是把文字编码成一个bedding到一个引用空间。然后这是第一个组件。第二个组件的话是说把文字和用于去做一个dnoice的一个操作，就就跟那个呃就是把一个白噪音逐步的还原成一张图片。

所以说。这个cep模型它实际上是提供了一个condition，就是让这个图片知道往哪个方向去做D noise，然后使得生成质量符合这个我们文本的输入。除了这两个组件之外。

就还有一个叫做呃oping hold的一个组件，就是把一张图片压缩到一个影空间，从一个影空间再还原成一张图片，就基本上这是这是这主要的三个组成部分。对，现在那就是纹身图这方面，我们呃跟进了很长的时间。

其实它本身的这个周期也不是特别长，但是我们一直在跟进跟进的话就发现，其实在纹身图研究领域有三个我们自认为的呃主要问题。首先是一个是高质量数据集的一个呃缺乏。大家都会觉得哎高质量数据集。

其实赖案他们有一个两B的英文的数据集。呃，两B就是20亿，然后5B的一个多语言的数据集，所数据量是非常大的。那么为什么还会说缺乏这样的高质量数据集呢？因为是。嗯，开源数据集它的质量是参差不齐的。

就不不一定就是所有的开源出来的数据集，你都直接能够用于去训一个高质量的文图生成的一个模型。呃，同时他的语言分布是极度的不均衡的，就是绝大部分的呃。这个文图数据其实都是英文的。然后其他的语言。

比如说中文它的分布量就没有那么多。同时这个文妥的数据获取也是比较困难的，就是可能没有那么多高质量的。比如说艺术创作或者是呃封封面设计海报等这种数据有一个高质量的一个可以获取的渠道。呃。

这所以说所以说这个高质量书籍是我们去做一个文图研究的一个主要的一个挑战。第二个呢是说可控生成。就是现在有很多研究，比如说jibo，比如说laura，他用这种方式去做一些快速的新概念风格人物的学习。

但是这种其实也是呃呃效果可能还需要有一定的提升，或者是需要额外的一些组件或者是么其他的模型进行一些组合，才能够做的比较好。然后另外一个是在生成的时候，可控性，它还不能做到非常的高的精度。

就比如说我们要去重新生成一张图片的时候，呃，当然现在conttrolnet等等一些工作已经可以把这个精度控制的比较好。但是对于背景。

或者是我们要要做到那种视频呃那种背景级别的呃完全可控的那种精度其实还是有一些有一些gap。第三个是说呃怎么样去实现一个复杂的编辑。比如说我们要同时做很多个1一系列的操作的时候。

怎么样把这个图片的可控性生成的可控性提上来。所以说这是另外我觉得是另外一个可控呃呃一个挑战。呃，第三个挑战其实是纹身图的一个评价。现在有很多纹身图的模型，然后它看上去生成的效果也还很还不错。

但是我们怎么样去评价它呢？这是一个比较难的一个问题，就是生成模型共同的一个一个问题，就是说自动化评价指标跟人的这个主观评价的指标可能一致性比较低。但是人工评价呢，它的成本又比较高。

所以说就是没有一个统一的评价标准的定义。所以我就我们就针对这三个呃难点，其实主要是针对第一个高质量数据集的这个问题进行了一些呃研究。首先就是介绍一下啊，就是我们做了一些分析。

就是可以看到有中文的和就是最左边这个黑色图，就是中文上面有洛亚呀，有ze由ro的这些中文的开源的图文的pell的那种数据集，它的数量。以及下面co有和烂两壁的图文书籍的质量。

他们是有一个明显的一个差距的。然后那样5B里面的这个有一个数据叫做marty那两B，就是他是个多语言的。那么在这个多语言语言的分布上面，大家可以看到，其实是很不均衡。有的语言非常多，有的语言非常少。

那么针对这种问题的话，我们如果要想要训练一个其他语言，就是可能是除了英文之外的其他一个语言的一个文图模型。可能他的数据量就不够了。那怎么样去解决这么一个问题？呃，我们刚才其实可以想到。

刚才有两个主要的组件，一个是那个cep模型，一个是那个unit，就是deno那个模块。所以说我们就先为了训练一个or diffusion的呃就是一个diffusion的多语言版本。

我们先训练了一个多语言版本的cep。就把可粒模型通过一种呃叫做可以叫做 teach learning或者是蒸馏的方式。把一个。本身需要大量图文队去经过训练的这么一个呃文图表征模型。

我们基本上没有用到图文队的信息，也可以达到一个比较高的水平。所以说可以看到右下角。是我们完全没有用到图文对的信息，只用到一个平行语料的信息去做了一个这样的处理。这个方法是很简单。

对我们其实也觉得它的效果没想到它的效果会有这么好。对，所以说就是在只用图文退的信息，把呃英文和中文这样子分别去做一个蒸馏，它的效果就会非常好。呃，这说起来很简单啊。

但是之前有很多同类型的方法也去做过这样的实验。但是他们有一个缺陷。就是他们在获得了一些中文的能力的时候。其他就是英文的能力会有一个极大的降低，就是我们可能就是变成一个纯的呃其他语言的。

比如说中文的一个模型，但是中文，但是在英文能力上就下降的会比较厉害。呃，我们这种方法的话，就是说不管是在英文还是中文，英文上是非常接近于原版的克利普的性能。

但是中文上呢或者是其他语言上是会达到一个sota的一个效果。呃。这篇文章虽然看上去简单，但是也。也中了1个ACL的fin。然后后面这篇就是为了我们去做一个纹身图的多语言版本。

所以说我们就做了一个多语言的c力模型。做了可粒基模型之后。我们在把这颗粒模型接到原来的diffffusion模型上，就做了一个扩展。相当于是把呃原版的sable diffusion的2。

1换展成了1个18种语言的一个文图生成模型，支持18种语言。在这个过程中，我们发现一个有意思的现象，就是我们用同样一句话把它翻译成不同的呃语言，输入到diffusion模型中去。

它的生成的图像是比较类似，但是它又包含一点类似于那种文化背景的信息。比如说。用亚洲的语言，比如说中文韩语或日语输入到生成一个小男孩的时候，去生成的小男孩可能就是亚洲人的脸型。

但是如果是用嗯欧洲的西方的甚至阿罗伯的语言去输入到这个文图生成的模型里中去，他可能会带有他们。某些时候他会生成他带有他那种民族特色的这种练习。这个现象很有意思。

就是我们其实不是预期他会有这样一种文化的气息或者文化的信呃呃信息在里面。但是呢。我们能发现这个这个现象。后续其实我们也想去去挖掘更多这样的信息出来。这个这个这一点是说。

如果我们有了一个非常高质量的英文的呃纹身图模型，是不是就已经够了。我们其他语言的那些呃不同的文化的信息，是不是可以完全通过英文表达出来，这就是我们其实想去去去做的一些工作。

就说我们觉得其实有很多呃呃信息是跟或者是语义或者是文化，它是跟语言绑定在一块的。如果。你用英文去表达或者其他语言去表达，你就丧失了这个原本的那些文化的意味。比如说什么北京的四合院啊，比如说那个那个。

北京比较著著名的那个叫做什么胶圈，这些你用翻译成英文，其实可能就完全不是不是他原来的意思。对，然后。呃，下面介绍一下，有了all diffusion呃，M18之后，我们做了一些事情。就是我们分析发现。

他其实对于文化和语言是可以有很好的一个理解和诠释。然后我们就做了一些分析，然后呃去用一些proms去激活它的一些呃中文特色的一些信息。可以我们待会可以看到。那另外一方面是说。

我们去呃想去接入到那个开源的生态，就是把我们的呃。at diffusionM18可以接入到contl net，接入到laura，我们也受了一些case，其实可以完完全无缝的兼容。

同时呃之前提到了一个可控编辑，高精度的可控编辑。我们做了一些尝试。当然这是另外一个呃研究工作。然后可以看到就是我们只是在输入中文，然后加上中国化这样的情况下就可以输出很多符合中国文化的这种嗯这种图像。

比如说中国的一些虎啊，然后荷花，然后那个就是海上日出。这这种生成其实是需要做一些参数的或者是prot的一些选择。但是如果是基于中文本身去做的话，它这个工作量会稍微小小很多。

同时还可以基于这个base模型，再继续在中文的数据集上进行一些contrain或，然后效果会进一步的提升。同时呃，这个inending的能力也是有的。

就比如说把这个带着珍珠耳环的上语可以改成不同的一些风格和形象。比如说这个case我们其实呃是一个中英混合的一个输入，就把一杯水其中musask掉之后，可以做各种ining的生成。呃。

我们是觉得是这个效果还是挺好的，挺惊艳的。所以说我们就拿出来看一下。所以大家可以就是其实我们这个呃模型已经接入到开源社区的那个呃webUI。

就是微博11的那个微bUI那个嗯开源的开源的工具中可以直接调用，可以跟他现在所有stabletable上的开源工具做一个无缝的接入。同时那个我们之前的版本，我们之前有过两个呃，版本，一个是双语版。

一个是九语版。九语版呢是可以跟那个contrnet做无缝接衔接的。可以比如说我们可以去非常提取出一张图片中的一些呃特征信息，然后基于测试信息进行高精度的可控生成，是呃可以利用现在的开源工具可以做得到。

同时那个就是另外一个case。就比如说从这个左边左上角的一个图片，其如它的那个深度图，然后基于这个深度图输入到我们的M9模型，加上contrnet这个这个这个框架。

这这个这个额外的一些参数直接切换过去之后，就可以做这种基于深度图的合同生成。还是比较有意思的。当然就是现在关于可控的或者是个性化生成方面，是laura是比较。用的比较多的。

就我们也做了一组demo的实验，就是说把几张图片输入到这个呃把几张就是大概七八张这样左上角这种风格的图片当中输入。经过我们的一个训练之后，可以生成同样类似风格的一些图片。对。

我们其实也做了一些刚才类似于呃张教授说的，我们把呃模型就是那个我们最新资源公布的大语言模型和我们的dusion模型做了一个对接，可以用文字输入。去做一些呃呃图片的生成。

同时还接入了一个叫做多步可控编辑的一个模块。那就可以。看一下效果。他其实就是在。编辑的同时能够很大程他他做了两个事情。第一个事情就是che sort，就把一个复杂的多步的呃指令。

就比如说我要把他的皮肤变白，把他眼睛变蓝，同时要把它动漫化三个操作输入到语言模型中去。语言模型首先做的第一个是叫che，就是把它指令分解，分解成多个指令。

然后再调用我们的那个高呃基于指令的微调的一个指定指指令去做可控图像编辑的这么一个模型。然后可以在很大程度上保留所有的呃细节信息的同时，去对非常高精度的可控的那个部分。

就是你文字中描述的那个部分的区域进行一个呃高精度的一个修改。所以说这是呃我们现在做的一些尝试。现在还在嗯在这个呃。开发的过程中。对，然后刚才说的那个评价。

其实呃我们是经过了我们提出了有我们有一个呃图图文多模态和纹身图的一个评价的体系和指标。然后基于这个图呃，他们的评价呃结果肯定说是说我们其实在各个方面都是非常强的。在各种不同语言上，应该是现在是so塔。

多语言上面应该是sota。呃，刚才说到那些所有的模型和工具和训练的流程和微调的流程都开源到了那个flagI这个开源平台中，大家可以扫码加入，然后可以大家一起去。去不断的去呃。

我也邀请大家加入到这f这个开发的这个呃呃社群中来。对。

好的，我我们感谢呃刘广博士的精彩报告。然后我们有时间可能一个问题，然后欢迎后面呃在线下和刘广博士交流啊。我们看到那那边有一个问题，能能把麦克风递过去吗？😊，我们。好好，麻烦你啊嗯。哎，好。

感谢刘老师的分享。后呢其实我们也在follow这个这工作。然后在这个进行之中呢，我们发现了一个问题，就是说他在这个中英平行语料这一块的，不管是cl能力和能力都可以和SD英文能力达到一个很好的个对齐。

但是我们发现他很难进一步提中文理解的能力了。比如说像你刚才提到的这个胶圈斗争的理解问题。当我们引入了我们自己内部比较好质量的数据的候，我们会发现他会急的破中英对齐那部分的能力。

在这块的话还有一些比说这个非常高效的方式可以达到英文水平达到国外比较好的标准，同时可以进一步提升中文能力的这种方式嘛。好，谢谢老师。😊，嗯，谢谢这位问了一个问题。

然后这个问题其实我们之前呃呃这个问题其实我们也想到过，就是在把双语扩展成多语甚至是十八语的时候，他我们想进一步做提升的时候，会遇到这个同样的问题。就是不管是就是型还是ion模型在做进一步提升的时候。

他会就是数据的不平衡，或者是把不同的语言混合到一起去进行这个对比学习或者是模型的训练的时候，都会遇到这样的问题。我们现在还暂时没有什么特别好的方案。但是我们觉得可能就是在呃学习的策略方面。

就比如说我们可能需要把不同语言的数据混合到一起进行继的训练。这样可能类似于语言模型现在去做就是，你如果只是在一个任务上去做这个学习，它可能会导致模型的下。

我如果我们把它呃这个diffusion模型比作或者克模型比作一个base模型的话。那么他如果只在某一个语言上去训练，他肯定会破坏他的语言对其能力。如果是把多种不同的平行的语言的数据放在一起去做训练。

可能会缓解这样的是问题。对呃，我再继续问一下，就是说你说这种数据混合扩充的方式，可能是一个正确的方向，但他会可能逐渐演成一个不是很高效的一个方式。对，这这这个问题很尖锐啊。我觉得其实是。

但是这个不高效的话，所以说现在我们是提供了一个很快速能够达到一个比较好的一个水平的一个呃呃基础阶段。但是后续该怎么样再逐步提升。我觉得可能需要进一步的去探讨和研究，欢迎交流。对好，谢谢老师啊好。

我们再次感谢刘广博是非常非常精彩的报告。好，也谢谢刚才那位的提问哈。那么我们这个我们欢迎欢迎刘文文好，呃我们接下来的一个报告是呃来自于这个UCLA的这个助理教授呃，周柏雷老师。

他的研究方向呢是计算机视觉和机器自主性的呃可解释人工智的交互。然后他还对当前AI模型的各种人本属性，非常感兴趣。然后这些属性呢超越了他们的准确性啊，比如可解释性、可控性泛化性和安全性。

那么我相信大家都非常非常熟悉周柏雷老师。他因为他很。😊。

有很多的工作啊，非常非常的有名啊，包括ca什么networkde这些。那我这个闲话少说，我们把时间交给啊周柏磊老师，他今天报告的题目是啊基于这种鸟瞰图的可控可交互的大规模场景生成。好。

我们欢迎周老师OK嗯嗯谢谢李老师的介绍，然后也谢谢支援大会的邀请。嗯，小嗯各位老师同学，大家好啊，我是周博磊，然后现在是在UCA嗯助理教授，然后之前是在香港中文大学，然后当了三年的这个助理教授。

然后再再到这边美国来。嗯，然后我之前其实做了很多不同的工作。嗯后在在港东大阶段的时候做生成模型做的比较多，主要是在这个干的模型上面做了一些可解释性分析，以及可控嗯图片生成。嗯，然后最近在在这两年的工作。

其实是逐渐在往这个嗯决策分析。😊，寄走。然后嗯今天我是想给大家分享一下呃，我们在这个场景生成以及这个嗯场景仿真的一些研究工作。然后就是想把生成模型跟这个呃机器决策2块结合起来。

这样也可以给大家啊提供一些呃新的思路。嗯，那么我这里从这个嗯嗯场景生成讲起，就是我们这里关注的是嗯这种有条件的场景生成，就相当于我们需要给定一个呃输入。嗯，比如说我们现在希望生成这些接景图片。

那么之前有这种P two PHD的方法，它就给可以给一个这种语意图，然后作为一个嗯输入，然后它就相当于每一个颜色都代表了这个你想这个区是车，然后这个是数，然后可以生成一个接景图，然后现在也有这种大模型。

比如说打力 two，然后以及一系列的这种嗯文本到图片的生成，然后你可以给一个pro，然后它可以生成嗯对应的图片。嗯，当然这些生成结果都非常不错。但是它还是存在很多问题。

就比如说我们现在希望对一幅图片进行这种交互式的嗯编辑。比如说我现在希望嗯在这个。嗯，街景图里面，然后增加一辆车，对吧？增加一辆车的话，那一个办法是直接在这个嗯输入的语意图上面。

你去啊放一个车的啊这个ms。但是这里存在一个问题，就是因为这些车其实是它是在嗯它不不仅只是在图片空间，它其实是在现实生活中的话，它其实是在三维空间，就是这种鸟鸟瞰图下面的啊这种车。

所以如果你只是在图片上面啊，放一个这样的一个编辑的话，它其实是没法很好的把这个车的离离这个相机的这个距离啊表示出来。嗯，所以这里。更直接的一种嗯表征方式，其实是这种鸟瞰图的一个表表表征方式。

因为这个鸟瞰图就BEV，它其实代表了嗯这些物体在呃空间里面的结构啊空间里面的位置。然后它跟这种light数据呃比起来的话，就这种纯3D的这种点云数据比起来的话，它又是一个非常简洁的一种表征。

因为它直接是也是到2D的，但是它是从上往下看的2D啊表征，当然这样就比较方便呢，能让我们进行这个呃图片的这个编辑。比如说我们希望在某个位置增加一辆车辆，那么就可以直接在这个鸟瞰图里面把这个车放进去啊。

那么它它就有可能把这个车在图片里面啊生成出来。然后另外这个鸟瞰图的一个嗯比较呃有意思的一个特性是我们也可以用鸟瞰图来做一些这种场景仿真。你看这里呃播放的这个视频，就是你看这些车其实是。

可以我互互相进行交互。然后我们有很多的这种嗯自动驾驶的数据库，它它采集到这些车的轨迹，那么们就可以把这个场景的这种动动态运动，然后把它重建出来。所以这是呃鸟瞰图作为一种表征非常好的一个特性啊。

所以我今天的这个讲座会分两部分进行。就是呃首先是给大家分分享一下，利用鸟瞰图，然后进行场景生成。就比如说我们输入是一个这样的啊鸟瞰图。然后我们希望生成这种啊第一视角的啊这个这个驾驶的这种图片。啊。

然后第二部分呢，我是想给大家分享一下，我们基于鸟瞰图来进行一个这个场景仿真。因为就是这个图片生成过后，其实你并没法用它来真正进行交互。那那其实是我们是希望把它能放到仿真器。

这样我们就可以把一些物理的表征也加进去。然后这些碰撞的表征也加进去。这样整个场景就可以真正动起来。然后也可以跟下游的这些比如说自动驾驶的这些任务，然后联系联系起来。嗯。

那么首先我这里第一嗯第一个讲的一个嗯工作是这个呃BV啊鸟鸟瞰图的这个感知。嗯，其实鸟瞰图的这种感知其实也是嗯也不是一个新问题。但是是可能是这几年大家逐渐嗯注意到的一个问题。

然后大部分的这个鸟瞰图的感知都是集中在嗯这个感知任务就检测任务上面，就是输入比如说一辆车输入6个视角。然后它可以生成一个这种从上往下的这个语图片，然后象征这些车的位置。然后你有了这个鸟瞰图结果过后。

然后你就可以进行一嗯这种嗯比如说规划，然后路径规划，然后这样就可以计划出它未来需要产生的这个轨迹。嗯，所以近两年其实这个这种BVperception，其实是呃相对比较火的一个一个研究课题吧。

就是在在这种物体检测，以及这种智能驾驶里面都是很重要的一个研究课题。这里其实有一篇啊这种综述，其实就是在讲这种。啊，BEV感知的这个工作啊，对于这个BEV感知。

其实我我我们组很早以以前就就就就有相应的工作。这个 network这个VPN这个工作，其实是当时我还在MIT的时候快毕业的时候带了一个实习生潘博文，然后做的一个实习的一个他实习的一个工作。

然后然后这边工作也挺有意思，就是我们这个工作其实是18年就就完成了。但是整个搞到嗯20年才发表出来。然后中间也是历时了这个两年，然后五六次的这种巨稿嗯，就是被各自我这种CV的会这个剧了。

然后最后最后然后发表到这个ro，他其实是一个tic相关机器人相关的这个会，所以这个VPN其实是最早做这个BV的一两个工作之一。我没法说他是第一个做这个，但是他是至少是前三开。做这个工作。

因为我们觉得这个任务也是呃非常有用。就是你在机器人情况下面，你你给他一个第一视角，然后生成他这个嗯从上往下的视角，这样就可以拿来做很多呃应用啊，所以这个这其实其实也是一个工作，如果真的好的话。

他他他其实是可以嗯嗯成为一个好的工作的。就现在其实做这个BEV perception的都会去引用我们这一篇工作。嗯嗯这是一个题外话。然后我们回到我们这里想做的这个BEV任务，就是之前的工作室。

就相当于我们给入呃输入的图片，然后生成这个鸟瞰图。然后我们这里呃希望把做这个的反问题，就相当于我们想啊做这个BEV generation啊，我们这里就呃生成，然后就相当于输入是一个鸟瞰图。

然后我们输出啊是这个图片啊，然后我们就这个嗯就我就带了一个这边UCA的一个本科生。然后嗯alex然后对这个问题进行了啊第一次探索。为我们之前其实有有过一次调研。

其实并没有大家并没有意识到这是一个有意思的问题。然后我们就把它呃首次把它提出来做。然后我们这里就希望给一个这种鸟瞰图，然后生成嗯嗯不同嗯视角第一人称的呃这个图片。你看右边它其实我们有呃三个相机。

因为在自动驾驶那个车上面，其实它是有放了这种一排的相机，就就比如说三个相机往前，然后三个相机往后啊，这样就可以对周围这个有感知。然后我们这里是希望生成这个前面三个感知。然后这里技术难点。

就是说呃我们怎么把嗯这个鸟瞰图的嗯这种嗯作为输入，把它加入这个图片生成里面去。另外一个难点是我们希望不同视角，就比如说这个呃左左视角跟正面视角它有它的一致性。

因为这里我们这里三张图其实是分别从deder出来的结果。然后我们希望这个分别出来的这个结果，它它之之间是有这个重叠的。你看到它其实是有车辆的这个重叠。然后我们希望重叠的部分。然后他他也能保持呃一致性啊。

然后我们就把这个BV证，然后进行了呃相应的一个呃建模，其实也是用了比较标准的eccoder啊decoder的一个结构，就用了这种VQBEwo。嗯。

然后呃我们分别对它的鸟瞰图以及这个呃图片生成这个进行进行一个这个学习，然后再把2块呃联系起来。所以这个model其实跟这种打力 two就图文本到图片生成，其实有有一些类似。

就像我们这里编辑的呃就是incode就不是文本了，就是我们这里编辑的其实是呃incode其实是BEV，然后把BEV变成鸟瞰图变成这个呃向呃特征向量过后，然后再放到这个deder呃里面去解码。

然后能把这个图片嗯解出来。啊，所以我们就做了这样一个呃类比设计。然后这里有一个呃特殊的一个设计，就是我们希望这个不同视角。它的嗯这个呃注意力其实是有有对应的这个空间关系的。

所以就做了一个这样一个小的一个呃设计，就position enco的时候把。不同视角，它的这个特征相关性，然后把它放到这个啊self attention。啊，这样就可以可以让他确保它的这个一致性。嗯。

然后我们这里是嗯出来的一些嗯结果，就左边是啊我们放的这个top down啊这个这个鸟瞰图的输入，然后右边是我们分别在嗯6个视角下面啊产生的这个图片，嗯，其实效果还是嗯比较不错的。

然后我们这里用的deder其实就是用的这个BQBAE two。啊后我们学校也实验室也没有资源可以去让这种啊扩散模型，所以就就直接用了这个BQBAE的这个结果。所以它图片上面质量其实也还是有很多瑕疵。啊。

我们觉得把这个deder如果换成更好的deder的话啊，这个图片效果可以啊进一步提升。啊，不过这这并不是这个工作的重点，重点还是我们希望能把这个问题首先提出来，然后建立一个 baselineline。

这样大家就可以来来来做。啊，然后这是另外一个一个结果。你看左边是放这个鸟瞰图，然后右边是它生成的这个结果，然后你看到右边这个建筑，它其实就从嗯从前的那个视角里面，那个建筑就在呃右边，然后在右视角。

它其实也是有的。然后然后右下视角，然后它也是有建筑的。所以它它这个方法其实它是比较好的，能保证不同视角，从这个deder出来的这个呃结果的一致性，然后都能保证。嗯。

然后我们这里啊再再给大家看一个这个视频，然后这里左上角是一个鸟瞰图的输入，然后嗯上面一行是我们生成出来的三个视角的结果，然后下面一行嗯是是这个呃ground choose。

就是它原始这个相机对应的呃样本。你可以看到它其实是呃一致性对应还是很好的。因为我们这个鸟瞰图，它其实是一个抽象的表针，它只是告诉了这个位置有车，但是什么样的车，我们是不知道的。

所以它就是有这个呃随机的这个效果。你可以看到这里我们并没有这个呃持续的这个呃限制加进去啊，所以你可以看到它每一帧其实都是在变换。但是它跟下面这个嗯实际的图片，它还是嗯能对应上。

就是他有车的位置还是应该有车，然后有建筑的地方，它有建筑，然后有些地方他也会把这个建筑换成一个其他风格，然后以及加一些这种竖进去。因在这个鸟范筒里面，他对周围这个background。

他他其实并没有对它。进行这么细的这个呃分割。所你看到它其实出来的这个生成结果还是嗯挺有趣的，就是他可以把这个场景比较好的这个给给给重建出来。嗯，然后我们这里再利用这个BV证，然后做了一些呃趣味应用吧。

就是我们这里鸟瞰图其实是一个比较简洁的一种表达。我们可以用这个仿真器来产生它这个鸟瞰图。然后这里两个结果，就是左边我们是从我们这个自研的一个呃驾驶仿真器叫m drive的一个仿真器。

然后后面我会给大家再介绍一下这个m drive，就从仿真器里面拿到的这样鸟瞰图的呃呃输入，然后再放到我们这个模型里面去，然后就可以用利用这个模型，然后把第一视角的这个呃图片给给生成出来。啊。

这样其实你可以想象，那那后面如果我们能把持续的这个呃一致性能加进去的话，那我们就就相当于可以鸟瞰图得到的这些轨迹，我们就可以直接把这个场景的这个呃生成给给变出来。嗯，这是BEV站的这个工作。

然后嗯这里我呃一个潜在的问题是，就之前那个工作其实都是在做嗯单纯的图片生成。但是生成出来的这种第一视角的这个图片，它其实比较难直接用到一个呃仿真器或者这种实际的呃驾驶仿仿真任务里面去。呃。

我们实际是希望能得到更多3D的信息。因为越有越多的3D信息，让我们就现在我们可以呃越越好的把这个场景放到这个仿真器里面。呃，所以我给大家接下来讲的一个工作是我们从这个呃呃两维图片。

然后到到3D这种呃CDware的这个图片生成，就相对它里面包，它又不是一个完整的3D，但是它是1个2。5D啊相对的一个东西，这样我们就可以比较好的更好的控制这里面车的这些呃特性。嗯。

然后这里呃嗯这个工作是我们今年CVPR嗯发表的一个工作，是由我的学生徐一豪同学在在nap实习做的做的一个工作叫diing。那我们这里想尝试的是把生成模型给nerfner是神经场模型，两者结合起来。

因为nerf其实原始是拿来做重建的，它其实并没有生成的能力。所以我们是希望呃但是但是nerf model它自身带有很多这种3D的信息。所以我们想把这两者进行一个融合。

所以我们这里就想做一个这样的一个拍出来，就当我们有呃二维的这种鸟瞰图过后，然后我们可以生成这种呃三维的这种结构图。然后再从三维的结构图里面进行这种神经的呃 rendering这种渲染。

然后能把这个场景给呃渲染出来。所以我们今年这个diing就是这个工作，就是在做这样的一个。是嗯。就我们就就就提出了这个这个模型。然后它是一个我们说它跟nerf最大不同。

就是它是个genativener这种这种概念就它其实对它采样，然后它可以呃生成新的这个场景，所以我们就就把这个输入也加进去了。就是它输入其实也是一个这种秒瞰图的一个一个3D的一个抽象表征。

然后它象征了呃物体对应的位置，然后我们可以放到这个呃gen objectject generator里面去。然后我们背景也对单独处理。这样就可以把前景跟背景，两者都结合起来。

然后再利用了一个nererf里面常见的这种啊，然后嗯给它啊vol render出来，然后再再把图片这个 sample出来，然后这里然后我们也结合了一些干的一些东西。

就是把加入了一个这种dicriminator，后跟实际。的图片真实图片进行这种分分辨。然后我们也对它这个前景，然后进行了一些呃处理，就是前景。然后我们再加了一个嗯单独的一种这种干的一个区别器。

这样就可以使得前景图片，它这个呃效果也也更好一些。然后就相当于这样嗯把把一个生成模型干的一个模型跟这种啊nerf模型，两者做到了一个一个一个结合。

然后我们这里也有local跟嗯局部跟这个整体的这个呃分辨，利用干，使得这个图片它的真实度啊可以进一步啊上去，这样我们就可以对嗯通过改变这个输入的啊这个layout结构的不同。

那么我们就可以把这个图片生成出来这个图片也进行这个呃对应的这个编辑。比如说我们想把这个车的位置改变，然后它的朝向改变，这样就是它可以啊通过这个神经网络进行这个渲染出来。嗯。

然后我们也跟之前的一一些方法进行一些对比吧，就是在clever这种3D front，然后wemo数据库，然后上面都进行了对比。然后这里我们选的就是这种value干呃一级3D然后girae。

然后我们在这些场景里面效果都是嗯都是最好的吧。就对在这种3D呃3D3D aware generation这个这个细分领域的话都是最好的一个结果。目前。嗯，然后然后接下来因为这个这个生成模型。

它其实是跟跟nerf结合了，所以我们nf包含的这些有有有用的特性，我们都可以呃加进来。你可以看到这里因为ner它其实可以调整这个相机发设的这个方向。然后我们可以调整这个场景的这个呃3D结构。

然后这里图片它其实是这个呃vol rendering出来的，就把这个图片，然后我们可以改变自由的改变这个场景的呃这个结构。嗯，然后。然后这里我们也可以对对物体进行一些这种显示的一个编辑。

比如说我们希望调整这些嗯家具的这个位置，然后我们就可以在这个嗯秒瞰图这个结构里面，然后拖动它的这个嗯嗯方框，然后这个图片，然后它也可以对应的这个位置也可以把它呃渲染出来。然后右边这个对这些物体。

然后我们也可以这种translationlocation，然后都可以达到相应的这个编辑编辑的一个结果。嗯，然后我们也可以进行一些这种嗯嗯对物体进行嗯增加跟去掉。比如说我们可以把里面不想要的车。

然后在我们输入前面给它这个给它这个呃去掉。那么那么它对应的这个呃位置，这辆车，然后就被去掉。嗯，这样如果你只是在嗯两维的图片上面通过比如说diffusion model进行这种intending的话。

它其实对3D的理解并并不是这么好的，它结果并不理想的。所以就这种3D的呃编辑的任务就确实是应该是在这种3D的表征下面进行呃操作会更好。嗯。然后这里我们也可以做一些这种增加。比如说在这个场景里面。

我们希望增加更多的车，然后就可以在最前面的输入，然后把这个车放到它对应的这个3D空间里面的位置，然后他就可以把在2D空间里面把这个图片给生成出来。所以他就对2D跟3D之间建立了呃一个比较好的一个联系。

然后我们也可以进行一些这种restelling比如改变这个车的这个呃图片啊，它的呃颜色，然后它的这个shape啊结构，然后都可以改变。然后这是我们跟之前的一些方法比较，比如说这种一级一级3D啊一些比较。

然后我们对它相机控制都是要更好啊，然后以及。以及跟giraffe，然后也是呃有有一些对比，然后我们比giraffe能取得更好的这种解偶控制。呃。

因为我们在我们的情况下面是更更显示的去去对这些不同物体进行建模。然后现在在鸟瞰图空间进行建模，这样就可以得到一个非常理想啊，以及直觉性的这种控制。嗯，然后这里也可以进行一些这种呃实际场景的一些编辑。

比如说我们可以用一个encoder，把给定的一个图片呃放到这个影空间里面来，然后再进行对应的这个编辑。嗯，然后接下来嗯我们是想。继续把这个空间嗯扩大。就是之前的工作讲的是说我们把一个位置的一个鸟瞰图。

然后放给这个模型，然后它只能生成相当于一个场景啊。的一个图片生成一个结果。但其实我们在现实生活中里面，其实是有非常大的一个场景。比如说这样一个啊地图啊。我们其实是希望生成非常大的这种场景场景结构。

那么怎么嗯这个问题其实是在在在没有多少人去探索。所以我们想把这个之前这个生圳模型扩展到很大规模这种场景的一个生成。

然后这里一个呃我们首先想到一个方法是说是不是这种嗯大场景生成就是跟视频生成其实是一个概念。这里右边我是给呃大家展示的是呃我们嗯下载的一些嗯驾驶数据库，然后你可以看到这些驾驶数据库。

它其实是展示了在很大一个空间结构呃里面进行这种运动。所以我们能感知到嗯它的这个这个结构，所以是不是我们把这个问题可以把它讲呃变成一个嗯变成一个视频生成的一个问题。呃。

我们也对这个问题进行初步的一个一个建模尝试。就是我们今年SOR一个工作，就是在这个图片嗯生成。然后们就去模改了一个叫tyle干V的一个一个结构，然后给它加入了。啊，很多这样的一些鲜艳。

比如说这种ance free的一些设计，然后以及对它的呃 decoder进行一些这种ing训练。然后对它的这个分类器，然后也也进行一些操作。那后就可以得得到相对视觉上面能看得过去的一个一个生成结果。

然后这里其实如果你你如果换一种思路来讲的话，那那视频生成，如果能把它生成的非常连续的话，那其实你就相当于是在3D空间里面嗯，对对这个呃相机进行运动。那么就能把这个场景给呃重建出来啊。

但是但是其实效果还是不是这么理想，就是我们也也跟嗯之前的一些方法进行对比。然后虽然在图片层次FID上面就比之前的效果好，但是它其实也没有真正表示这个空间的一个结构啊，那我们就又又换了一个思。

就是说那如果我们能把。整个大场景生成出来，那我们就可以把相机放在这个大场景里面啊运动，那么就可以把这个运动的这种视频然后给生成出来。所以我们就把这个问题又切换到了一个嗯大场景的一个生成。

然后我们这里是最近做的一个一个工作。然后就是嗯。嗯，想想对这种无线场景的一个生成。比如说我们呃希望在在训练的时候，然后可以给他比较小的这样的这个场景。啊呃这种鸟瞰图，然后生成图片。

然后我们在呃测试的时候，然后可以扩展它的这个空间。比如说啊如果它是一个这种呃transational呃environment的一个结构的话，那那按道理input，你给它放一个更大的这种鸟瞰图的话。

它应该可以在不同位置，然后把这个呃场景都给呃生成出来。所以我们这里对对这个问题进行了一个初步的一个一个尝试。啊，然后这里技术细节就不说了。然后因为paper还没有没有放在archive。

然后我们达到的一个目的是说我们给给入这样一个呃变化的一个。嗯，鸟瞰图，然后。通过它局部的这个生成，然后我们可以把这个局部生成的结果，然后拉成一个更大空间的一个结果。

然后这里是我们在这个clever数据库上面做的一个一个初步的一个结果。你看到左边是我们嗯放进去当前位置放进去的鸟瞰图，然后嗯第二行是这个局部生成的结果。然后最右边是我们把整个鸟瞰图这种平拉。

然后再把它整合起来的一个一个结果。然后这样就对空间，其实进行了一个更大规模的一个一个重建。嗯，然后我们也可以呃达到一些这种编辑的一个一个结果。比如说我们放入鸟瞰图。

然后我们希望嗯比如说呃推进这个前方这个方块的这个箱子，然后我们就在这个图片里面，这个箱呃方块箱子就可以往前推。然后我们把比如说把这个呃对应这个某一个位置的这个方块。然后它的呃它的风格。

然后从方块变成一个圆球或者呃变成一个圆锥，然后都可以呃达到这样一个编辑，然后我们可以在在这个场景里面放出放放放入更多的物体。然后这个模型啊后都可以生成出来。啊，然后这里是呃一个一个交互。

我们做了一个交互界面就相当于因为这个鸟瞰图它其实一个非常直觉的一个编辑方式，你就可以很好的你在哪一个位置想生成东西，然后就可以让让让这个用户，然后在某一个这个空间里面点一个东西。

然后把它拖动到对应的位置。然后我们就可以右边就是这个生成模型，然后把这个结果给。也生成出来就对应这个呃相机放的这个位置，然后就可以把对应的这个物体给它啊放出来。

就相当我们嗯用一个呃神经模型来学了一个这种仿真器。然后可以对这些啊物体进行啊对应的这样的一些编辑。OK然后这是第一部分的呃内容。然后第二部分嗯。

然后我再比较快的给大家讲一讲我们基于鸟瞰图做了这个场景的一个仿真。然后这是我目前实验室的一个比较呃大的一个方向吧，就是想把这个机器决策的一些东西，然后也跟我们视觉感知，然后两者能把它嗯整合进整合进来。

啊，因为我们这里啊呃发现就是在鸟瞰图其实是一个非常简洁的一个表征方式。然后很多的这种驾驶数据库，它的这个嗯交通场景，其实都是通过这种鸟瞰图来表征的。

然后这里我们是从嗯微猛数据库上面导进导进来的一些呃交通驾驶场景。然后这里每个车它其实有对应的这些轨迹啊，然后然后这是微猛他们有这种数据采集车，然后就把周围的这些车的轨迹都采集到。

然后我们就可以把这样的真实数据导到一个仿真器里面，然后们通过这种嗯从重新播放，然后就可以把这个呃场景给。嗯，重建出来。嗯，为了更好的嗯使使得这样的研究工作，就是机器决策跟机器感知结合起来。

然后我们实验室嗯一直在开发这个模拟器，一个叫这个mat drive的一个呃驾驶模拟器。然后我们这里强调了它的这个相对于之前的模拟器。比如说它的一个长处就是它是非常有效率，就是在单机的这个PC上面。

然后我们可以达到这种500帧的这个训练训练效率。啊，然后我们这里而且保证了它的这个场景，可以从实际数据库里面导入一些新的场景。然后我们也可以学一些生成模型来产产生新的场景。

然后这个 drive目前已经开源啊出来了，就感兴趣的同学可以去看一下。然后基于这个 drive，然后我们可以导入这种正式数据。然后这里我们是导入的一个new的一个驾数据。

然后最左上角是它实际的这个GB场景。然后我们可以在这个我们的这个仿真器里面对它进行这个重建。就直接重建。然后右下角是它在这个实际空间位置里面秒瞰图的一个结果。

然后右下角两张图是是我们这个仿真器提供的这种深度图跟嗯这种点云图的一个一个仿真结果。嗯，然后这个反嗯这个模拟器它有比较好的这种啊场景生成能力。

然后我们这里的场景生成其实跟图片的场景生成不一样的是我们首先定义了很多这种交通的路口，然后一些这种基本的结构。然后我们通过一个程序化的一个生成方法，就像对对这些我们事先定义好的一些结构进行采样。

然后跟拼乐高一样，然后把它拼起来。然后然后然后再转换成可以进行交互的一个呃环境，这样就可以避免直接用一个神经网络去生成所有的东西。嗯。然后我们也对这里面的这些场景进行了一个相对比较真的一些呃仿真跟重建。

比如说我们可以把这个交通灯，然后行人，然后也放到里面。然后这是一个一个导入这个微mo数据出来的一个结果。然后你看到啊这个是4个不同的交通场景，然后里面有不同的车，然后以及有不同的人在走。

然后这这都是通过呃物理仿真器，然后做出来的。你可想象接下来其实我们就希望能把这些工作呃，物理仿真跟这个实际的这个图片生成，然后两者结合起来，能把里面的这些呃3D的这些呃部件，然后它的真实性。

然后进一步提升。嗯，然后这里就是。可以导入这个实际嗯。嗯，实际驾驶数据的一些嗯一些结果，一些场景。比我们可以把这种高清地图，然后导进去，然后以及它的对应的这些轨迹，然后也可以导进去。

然后这样就可以嗯相当于建立一个这种数字呃数字孪生体的一个东西。然后这里是我们嗯跟把这个模拟器跟newci两者进行一个同步，然后左边是呃new数据库的一个结果。

然后右边是我们利用它的这个鸟瞰图作为一个中间介质，然后把它对应的位置，那个车给它呃放进去。但是我们目前并没有把呃左边这里面车的这些形状啊，以及它的这个视觉信息给它给它同步起来。嗯。

然后我们现在其实有些呃正在进行的一个工作，就希望能把这个在这个实际的驾驶数据里面，然后能能从实际的场景里面把把这些呃车给拿到这个物理引擎里面，这样就可以使得我们的仿问。这也也变得更嗯更更真实。

嗯嗯基于这个导导入进来的这个数据，然后我们也也进行了训练了一些生成模型。然后这里生成模型其实是呃类似于生成轨迹。因为我们发现我们导入进来的轨迹存在一个不好的地方是它都是非常断的。

就是因为它是在现实生活中采集的，所以它有很多段的这些轨迹。然后就相当于设计一些呃生成模型，然后生成好过后，那么就可以从这个模型里面采样，然后采样生成不同的这种交通啊，交通结构。

所以我们今年嗯在这个一就是这个机器人的这个会上面有一个叫traffic站的一个工作。然后它其实也是生成模型，但是它跟图片生成不太一样，我们是直接生成这个轨迹，所以我们把这个生成过程分成两个步骤。

然后第一步骤是我们首先放入一个高清地图。然后首先是让这个模型学习怎么摆车，就相当把这个车从鸟蛋图放进去。然后摆好车过后，然后第二步骤是对于每一辆车，然后我们可以生成它的未来。轨迹啊。

这样就可以得到每辆车的实际的一个一个在场景里面的轨迹。然后我们可以啊这里这个视视频就在啊播放这个摆车的生成的这个过程，热力图就象征了下一步它啊最有可能性摆车的位置。然后我们摆好车过后。

然后可以把它未来轨迹，然后生成出来。然后这样就可以对这个场景进行仿真，然后我们可以进一步把这个鸟瞰图放到我们的这个m drive里面去，然后就呃进一步可以交互的这种物理的一个仿真结果。嗯。

然后这里是一些嗯结果图，然后在不同这种啊不同路段，然后生成的。生成出来的不同的这个呃交通的这个轨迹。然后也有在同一位置，然后我们可以嗯对它进行呃一些扰动模型扰动，然后它可以生成不同的啊交通啊结构啊。

交交通流。嗯，然后我们也用到把它用到一些现有的一些编辑。比如说这种iningin paintinging，我们这里的ining啊，就是说可以可以去延伸它断了的这些轨迹。

然后我们可以对现有的这个呃场景进行一些啊扩增。比如说原原来的这些交通场景里面并没有这么多车。那我们可以给它加入更多的这个车，使得这个场景变得更复杂，这样就我们更好的可以测试啊，我们的这些这种驾驶系统。

嗯，然后我们也做了一些实验，比如说证明这个一些生成出来的这些数据啊，是有实际的用处的。就是我们用这些生成数据来训练了一个这种强化学习的驾驶模型。啊，然后这里是先是两个b。

就是我们如果用之前程序化产生的场景。然后它其实跟真实数据，它其实存在很大的很大的这个gap。所以它这个它的嗯这个成功率是降了非常多的。

然后我们这里是如果是用我们这个traffic站生成模型产生的结果的话，然后它其实是可以得到比较好的跟它真实场景数据训练出来类似的一个结果。

然后我们这里更好的一个好处是我们可以对这些生成数据进行自由的编辑。比如说使得每个场景里面它的车流的密度更大，这样就可以生成更有挑战性的场景。

然后这样就可以进一步改进这个强化强化学习ag它的这个嗯它的这个sy一个嗯它的安全。ち？我这里右下角也出现安全性啊有了进一步的一个提升。嗯。

然后我们现这个traffic站的这个模型也是呃开源在呃这个呃met drivers这个report里面，就感兴趣的啊同学跟老师可以去嗯关注一下。啊，然后我们接下来再做了一些插件。

比如说我们可以把traffic站里面产生出来的些交通场景导入到其他的模拟器里面。比如用种cola或者GTFY这个游戏侠盗猎车手的这个游戏里面，然后这样就可以呃帮助这些模拟器扩展他们的这个场景。

然后我们也也这里是一个一个结果，就是我们把我们tract产生出来的结果导入到这个color里面去。你可以看到它是。嗯，有不同的一个一些不同的测辆结果。

OK那这里就是一个嗯今天我报告的内容做一个简单一个小结，然后就给大家分享了我们在基于鸟瞰图，然后对场景进行生生成，以及对场景进行仿称。然后我们接下来一步是把这两者啊更好的融合起来嗯。

然后这里是嗯感兴趣的同学可以关注我们这个呃嗯研究方向。我我是把它命名成一个metter drivers，matter drive跟universe meta metaverse两者结合起来。

然后这样就可以嗯做出更更有意思的一些研究工作。然后这里也感谢呃我的学生跟我的合作者。谢谢大家。😊，好，我们感谢这个呃周本老师非常非常系统性，然后开创性的这个工作。

然后我们这个时间关系还是有一个问题的这个提问的时间。好，我们麦克风给到这位。😊，哎，周老师你好，非常感谢您的这个这么好的工作。然后我的问题是因为现在呃很多自动驾驶的，它存在一些con case嘛。

然后您这边这个比如说用BEV呃，可以生成这个现实图片来弥补这个问题。然后所以想问一下您这边有没有对呃这种自动驾驶鲁帮性的一个尝试。然后如果有的话，可以介绍一下效果吗？谢谢。😊，嗯，多谢这个问题，让我们。

嗯，昨天刚刚有一篇OL的一个投稿工作，就是在做这种poor case。我们这里说是sfety critical scenario的一个生成。然后这里就我们把它建模成一个这种对抗生成的一个感觉。

就相当我们希望生成更难的场景，使得这个驾驶模型能失败。所以我们这个工作应该会在很快可能两三个礼拜就会放在archcade，也欢迎你贯注。谢谢老师。好的好的。

我们再次感谢周柏磊老师给我们带来的这个精彩的报告。然后也请大家多多关注这个啊matta driver呃这个这个sorry不知道是不是说的特别对哈。这个我们一个非常创新性的一个词啊。好，谢谢周柏磊老师啊。

😊，嗯。好，那么我们下一个报告是来自于呃这个斯坦福大学助理教授吴佳俊老师。然后呃呃家俊好，我那个我我那个你不介意我中文介绍你哈。这个好好的好的，那吴家俊呢是这个斯大学计算机科学系的助理教授。然后呢。

他这个研究方向是极其感知推理与物质力世界的交互啊，从人类的知识中啊汲取灵感。然后呢，他曾经在在加入斯坦福之前在gogle research啊，担任这个访问的教研究员。然后呢，他在MIT获得博士学位啊。

导师都是大佬哈bi freeman还然后呢，并且在清华大学或博士学位的时候啊，在MA和这个图中恩老师有非常非常密切的合作。然后呢呃我们可以看到家俊老师吴家俊老师他就是各种师从大佬。

他本身做的也非常非常优秀啊，在3或者在很多这种啊物理。😊。

世界上的这种类似于基于这种物理的先艳呢，去跟环境交互上有非常非常有名的这种工作。然后我们今天呢也是非常荣幸邀请到家来给我们讲一讲他最新的这个工作啊。

报告的题目是understanding the visual world through naturally super code。好，我们欢迎吴教授。😊，嗯。现在这样行吗？能能可以的可以的。

很好好的好的。😊，嗯，好，哎，谢谢谢谢谢谢谢呃，李老李老师邀请啊，是这样呃，就是我刚刚还在和李老师说，就是啊就是说呃我我之前试过啊，我说我说哎我们这个呃尤其当时是在哪个国内哪个公司，我说我们讲讲一下。

然后我就用中文讲，后来呢或者中音夹杂着讲嘛。后来我发现这个呃就是讲的很不好啊，大家也听不懂啊，我我也讲的也不顺啊，然后呢呃就是后来我就前面我就在跟李老师说，我说我说要不我们就啊我就还是用英文来讲啊。

然后这样可能反而方便一些。然后等到之后我们啊问答环节和这个panel的时候，我们可以再用中文。😊，好吧，所以先跟大家说，这个征得了这个主办方的同意哈，所以没问题没问题没问题。用这英文来讲这个。好。嗯。

然后再再次感谢这个邀请啊，非常荣幸能够能够啊在这里啊，我叫吴佳俊，我是我是斯坦福大学现在朱理教授。Okay。

so today I'm going to talk about understanding of visual world through naturally superized code。

So so visual world， I guess it's easy to understand， right， So we live in this world。

and we use our we use our human vision to see right all these patterns， geometry。

object textures and code as well every day we code using Python using whatever。

But so I guess we all understand what code means although although I feel like know hopefully at the end of this talk。

I will be able to show that know we can interpret code or symbols when people say neurosymbiotic AI code or symbols or programs in a broader sense。

it's just not just like python or photos loops we can actually have a much broader interpretation of what code is as well as what is natural supervision So what do we mean by code can be naturally super。

I'll give you a few examples throughout the talk。 hopefully we can have more clarity on that as well。

😊，Okay。So you know the question is， how can we really leverage the kind of rich structure symbols programs that exist in the natural world that exist in our visual world for better perception。

better seen understanding So to begin with there are a lot of rich structure in this visual world if you look at scenes like this the corridors or the buildings and you realize it's not just pixels right although you know these geometry models they always model scenes as pixels but theyre actually richer structure than just pixels for example if you look at the scenes they realize there are planes scenes made of planes and there' seeing。

there's floor walls and there are symmetry the things is reflectionalsymmetric in there are repetitions you can see there are lights at the top of the scene they repeating themselves and you can see if you look at the buildings then there are kind of windows and floors to kind of repeat themselves。

So now the question is， is it possible for us to leverage such kind of structural information just' beyond pixels you to。

With pixels for smart sea understanding and editing。 So here。

let me give an example of what it means。So here's a video and we put it in this I only be Photoshop gui but it is really underlying algorithm is ours。

so what we want to do or what we can do is you know we first can do interactive segmentation given the building the user have one interaction right so this is standard everyone can do it right you have interactive segmentation to get the building and you can compute a vanishing point in 3D。

But then what we want the users to do is just through one more interaction that is okay how would building look like if want to make it taller the user can just drag and to make the building taller using one single interaction one single step or as well as how to make the building wider so the problem looks kind of simple but in practice you have to it's actually not as simple because you have to really have understanding of the scenes at multiple levels of abstraction at the lowest level you have to understand the scene has textures what is the texture。

what is the color of the building so the things should look kind of similar in the media level you have to understand there is 3D geometry right the buildings are in 3D and every facade has its surface numbers it's facing a particular direction so if you make the building taller then the phase of the building should still face the same direction and the highest level you have to understand there is repetition the floor are repeating themselves if you want to make the building taller of course there's no perfect answer but if have to pick you have to guess I would say the floors just keep repeating themselves。

And the wes should keep repeating themselves， so such kind of higher level structures and repetitions should also be kept in your answer to this question。

So how can we do that？So we're inspired very much inspired by these kind of earlier work you know people trying to use program synthesis for visual data so they start with very simple images if you have kind kind of a sketch of these line joins and now there are clearly some pan so what it did is they first use a combination of learning a satochastic search to identify the entry level putitives in the scene in this case it's just lines and rectangles right so thinking about it as you know you're trying to factorize the image right in a raw image we know it's all like raized it's 200 by 200。

300 by 300 pixels whatever so it's kind of very high dimensional space。

It pose a lot of challenge for program synthesis algorithms because program synthesis methods usually work with very low dimension it's kind of hard to scale up。

So you're saying， okay now let's first try to factorize it so that you're turning a PG file let's say into a P file or into SVG file so you're vectorizing image so you're turning this higher dimensional space into a much lower dimensional space and now you the scene just have a collection of primitive lineizing rectangles and then you can use arguably learning based program synthesis methods you search for program that explain this lower dimensional space and once you have the program you can do small things like extrapolation right so this is earlier work where the methods now kind you may feel like oh the old machine learning method but you can also imagine replacing all these things with G4 right so you can do the same thing by per G if you look at CP 203 there are old ideas being reimplemented and realized with just newer tools but fundamentally you can kind of do similar things that you can extrapolate this。

patternss that make it， larger。So this is what they did back in 2018 and on 2D line joins and a clear limitation as you may have noticed is it is assuming you have the library of objects in the image in this case you know okay the world is just made of lines and rectangles I just want to find these lines and rectangles to backize the image and then I can search for program to explain these lower- dimensional space but the world is not made of lines and rectangles if we want to generalize from these sketches to natural images like if we have a bowl of milk with a lot of serials then realize yeah okay they're clearly some structure maybe you're kid or we're just playing with these serials and we make this triangular shape but what is an object here so it is the serial but how would you represent it it's not as simple as lines and rectangles。

So especially you know， is that possible if we can find a way to identify these entry level primitive objects。

you know， even without risk requiring a lot of kind of prior knowledge。

So we were inspired by these kind of classic computer vision work on internal learning or single image learning。

which is led by know Meharrai from Israel where they have been working on this topic for more than a decade or actually only two decades so or internal learning or single image learning they rely on this key observation that is if you look at a single image。

even just a single image， they realize the patches within the single image are very likely to repeat themselves such kind of repetition happen at the same scale at these kind of red boxes。

but it can also happen across different scales at these green boxes right so but why would these kind of rep happen。

why would they exist this is because if you look at a scene like this。

there are all these classeses， they realize okay it is fundamentally these grasses they' are kind of the same type of objects definitely they have to look similar because just like they're same species in the same category。

but they' are just different instances of the same object category but because the real world is in 3D。

And you're seeing objects in 2D in a 2D image， there's perspective projection。

you're having the 3D space， your perspective projecting into 2D。

that's why objects that are closer do appear to be larger and such kind of repetition of similarity may happen across scales。

you， in addition to it may happening at the same scale。😊，O。😡。

So now we have this observation how can we leverage that So what people have observed or hope have tried is to combine that with each So for example。

if you have a picture like this where there's kind of some repetitive patterns so if you send this picture to a pretraining neural network。

let's say an image I appreciate at that and then you can take the feature maps。

these activation maps， and then can compute you know for every possible displacement if you shift this feature maps and horizontally by X pixels and vertically by Y pixels you can shift by certain number of pixels and then how likely are these future maps to be correlated with themselves what is the selfcor of these future maps after this kind of different displacement。

😊，And then you can do arcms and find x and Y that maximizes such kind of correlation or selfcorrelation。

So what does these x and Y means or they probably just indicate the most likely gap between two neighboring repeating objects right because if you ship objects by certain the picture by certain pixels。

the feature map are likely to repeat themselves， which means these know after to ship the objects displaced objects may look very similar。

That's why x and y may very likely be the gap or the distance between two repeating objects。

And the reason you use these feature maps instead of just RGB pixels is these recognition networks are trained you're supposed to be invari to all the noises in natural images capturing imaging noises or occlusions or lighting changes and stuff like that So hopefully it would just be more robust we take a version of this kind of idea。

but we make some changes to it， which allows us without prior training。 you take。

Ali bit prior training because you'll take this preaching network。

but without you training a particular data set which has to。

you can take these kind of off the shelf methods and then you can just test it on a single image and from a single image you will be able to identify the central release of these repeating objects。

So this you can think about it as you're now trying to vectorize a natural image it kind of a crazy thing to do。

you have this know pixels of you a rasterized in natural pictures。

but then you're trying to vectorize and you're trying to represent this image in a much more dimensional space。

which is the same choice of these objects。😊，And then once you have this lower dimensional space you can view what people can do before that is searching for a program that explains these ss right and now you can just search for a program to explain where the objects are。

but the thing that is not captured is because you no longer assume the word is made of ion and rectangles。

you don't know what this object is anymore， you can say okay at this point there's a line or at this point there's this rectangle or you know at this point there's something but what is that thing。

how to parameterize that thing you can actually parameterize it with a neural network with a gene neural network So this is kind of a way of you have it is neuroymotic representation or geneticrative representation of the pictures which allows you to do kind of interesting things。

for example。If you have this picture of a lower process and its kind of a missing patch and then our question is like how can I feel thats missing patch intuitively humans will say。

okay， there are all these crosses next to it。 So I I will assume I should put across there too So the power of this kind of hybrid method or representation is these kind of programmatic structure or the code tells you where to look at it tells you okay。

these are the centuriesries of the other objects and these are the patches you should look at but then how to use these patches to fill in this missing region。

you you delegate to a neural network。 So neural network takes all these reference patches as smartly do imaging painting。

ly random So you can look at it and you find okay oh it's not like I'm simply copying a neighboring patch It's not like copy and paste。

but it is actually you know looking at where I should look at and using neural network to put in the lower textures so the output image looks realistic。

but intuitive。 it really match our intuition that we should put across there。

Then you can do extrapolation。 you can have another row。

but not another row of rectangles and another row of cross of natural images。

you can do extrapolation on natural images。And because we know there is no perfect pictures that are in the natural world right so every cross must be you know the program tell you where they're supposed to be at。

but of course there must be some slight deviations。

you can actually identify these deviations and then you can magnify these deviations you can magnify the deviations from of these objects from where they're supposed to be or you can say you can magnify these irregularities which is potentially very useful for you know defect detection in。

😊，Industrial production。You can go back to this mucan serial example。

you can find the centuries of these objects， you can do imaging painting。

putting back a missing serial， you can do extrapolation。

it can add another column of serials and you can do regulargularity and。

it can magnify odd irregularities。Okay， this all looks nice。

but there's a big difference now we were in natural images。

Theres this big difference between that picture， the serial picture or and this corridor picture I showed at very early now the serial picture you're sort of assuming it it's an natural picture sure。

but you're sort of assuming that everything is on the single 2D plane and you're seeing this plane from a top down view。

😊，But this is not a case in the natural pictures because for example， in this corridor。

it's not like everything is on a single plane， right，Clearly there are multiple place orre seating。

there are four， there are two walls right， So you have all these different place。

So now the question is it possible for us to generalize this kind of you know kind of structured representation from a single plane to multiple images。

😊，This， you know shouldn't be too hard because all you need is of course you need a camera parameter。

you need where the active vector is， you have to set a camera parameters。

and then you have to find a far way of you Part an image into multiple planes。

and then for every plane you have to estimate their pole or a six off pole。

which is their positions in their surface almost。And once you have these things。

you will be able to rectify the plane because if you know where the plane is。

what is its surface normal， then think about it you can kind of rectify this image so that you're seeing the plane from a top down view and once you have that you're reducing problem the problem to one problem that you know how to solve already so you can search for programs that explains it right so you of you're able to generalizing from a single plane program to a multiplane program。

😊，This is a hard problem。 of course we're still going to rely on bottom up visual cues you know。

for example， we're going to estimate the vanishing point where the vanishing point is as well as the wireframes know the3D wireframe estimation is really hard problem it was kind of not that robust back then when we're doing it 2020 it is still not very robust right now I feel like it of first getting better so we kind of using 2D wireframes right but 2D wireframe is better in the sense that give you a lot of correct answers but also give you a lot of noises that are most positives but anyway're talk about it later but before move on I would like to first do analogy that is you we always try to go in this kind of bottom up way we go from broad pixels and then we try to identify some kind of bottom up no level visual cues and then we go all the way up to high level structured programs。

repetitions So this has been a case for line joins where people have to first identify the lines and shapes and then you search for a program for a single play images。

that's the case too you're trying to。you know going beyond these lines and shapes but you try to find the centrals of these repeating objects and then you search for a program to explain it and here you're just having one more step you know this bottom up process where you have one more step where you try to use vanishing point and wireframes to help you do the plane partition。

Okay so you can draw this kind of analogy here and try to guide your thought process about how this is being done。

but as the problem gets harder and harder we go from synthetic images to natural images actually multiplan images you can see the problem also gets harder and harder this kind of multiple possible explanations is not what could happen know for example in this particular image。

they say okay the vanish point is estimated pretty accurately but the plane the wireframes so there's just kind of so many false positives。

so based on these wireframes there seems to be many different explanations about where the planes could be right okay you know where is the wall。

where is the seeding and where is the floor， theres just kind of numerous explanations to it。

So now the question is which one is correct as humans we have a lot of fire knowledge and we say okay。

candidate2 is correct candidate two is correct because you know we have seen so many corridors and we have seen so many walls and floors and we know how they look like。

😊，But that's not the case for machines， especially if the machine is only given this single image if the machine has only seen this picture and then how would you know right candidate2 is's better than candidate one or candidate3 especially they all satisfy the vanishing point constraint and wireframe constraint right so you can see that when we go from more and more complex pictures。

you know these kind of virtual cues have become more and more limited and the problem is become harder。

how because this space is larger due to so many more uncertainties。😊。

The Hong code address his problem。I think there's kind of kind of fundamental understanding that is required that is。

Well know we have to again think about you know why would these structure exist in the first place。

why were these you objects or planes to be regular to be。

you know why would these kind of symmetry exist， why would these repetition they exist。

why would the lights repeating， why would the lights repeat themselves？

And it's fundamentally because we have this human preference and humans。

know when we introduce preferences sometimes kind of explicitly， but sometimes very slowly。

for example， for this particular corridor， it could be the case that know because we like such kind of regularity。

we like such kind of structures when we're trying to construct disputing or trying to construct this corridor。

know we introduce this kind of prior that is， okay the whole thing has to be symmetric。

the lights has to repeat themselves with a fixed integral。😊。

So because of this kind of fundamental human preference， this structure exists。😊。

What does that mean or how does that help us in solving this inverse problem？

That means if I can actually identify or solve this low level problem really well。

then the high level problem should also become easier to solve So let me give you a concrete example let's just you know take this example and go forward you know here I don't know which one of the candidate part is the best but I also say okay yeah sure I cannot really tell but let's just assume theyre all good and let's proceed and see what's going to happen。

RightIf I assume candidate one is the correct one or two or three is the correct one。

what's going to happen know we can move forward， we can assume they are correct。

we can estimate your surface novels and positions and then we can use theest surface novel and positions to rectify each of these plane and then we can run algorithm with we know before about okay。

I want to search for program to explain them and what you is you know。😊。

If you have the correct plane partition， then the estimated plane and theified plane will be very regular because that's where human preferences exist and therefore if you want to identify a search for a program that explains these you know I erecttify the planes。

then the identified program the in for the program will be much simpler。

you can see us in the middle one and the if you use the program to reconstruct the planes。

the reconstruction will be much better。😊，Right on the other hand， you if your plane estimation。

the noable problem is not solved very well， you get incorrect planes。

you estimated surface numbers incorrectly， then after you rectify it。

the planes will look kind of awkward and you won't be able to get a very good program or simple program to explain them program you' get what be emotional complex。

the reconstruction will be much worse right so the fundamental observation here。

which is I think very deep is the bottom up problems things that we typically think about Oh a surface novel or vanishing flow in whatever estimation or plane partition in this particular case and they're not totally independent from this high level program search problems but the plane partition and program think this methods they can and they should really help each other because we we really connects them is human preferences it' human colon requires。

😊，Okay， so this bottom up problem， visual perception pop down program synthesis and reasoning problem should really help each other。

and in this particular case we can use program synthesis to tell us， okay。

what is the best playing partition which is in an object a candidate2？😊。

And you will be able to get the right program plane partition。

get the right program for each planes and then have this kind of programmatic nuing boundary rep for this scene。

And what you can do afterwards is you can say okay what would happen if I move forward So compared what purely autoregressive methods this back in 2020 So now you can probably do a bit better but even do very long rangech long-term program。

you will be able to see that seeing you should keep the structure instead of just getting blurriier and blurriier Well also even beyond that not just prediction。

but what about extrapolation if you're seeing standing in this corridor。

but you say I'm not moving forward I'm moving backward what would you expect to see if you're moving backward then you should expect this change model to tell you there should be lies key coming in if you move backward then another light should come in because all these signs of repeat themselves try to expect another lie coming in instead of just producing a kind of a blurry picture which make the existing scene look smaller and further。

if you want to say okay what would happen given a single image。

But if I look around if I turn to my back why I'm going to see you know if I'm only seeing the front of a corridor and you ask him what is behind me。

of course I don't know and there are infinite possibilities but I have to pick one I would say yeah I'll just maybe producing an infinite corridor of course there are no infinite corridors in the world but if I pick one I would say that seems the most possible explanation I just have right so I just turn around and producing an infinite corridor where the lights and shadows and all these things keep repeating themselves instead of you know just producing something that's very very blurary。

O。嗯。So you can go from this a corridor example3 beauty example。

which is everything is the same except that in corridor。

you know you can think about it as a box where you're standing in the box。

but if you're looking at a building， then it is the same box but then you're just looking at a box from the outside and therefore compared with corridors you only have one where you only have one vanish points if you look at a building example you have two managing points everything else is the same so you can do an extrapolation as we showed earlier。

but compared with the existing methods where some of them they can keep the structure really well。

But they don't really respect the input that well and some of them really respect the input。

but then the reconstruction looks not as good， but now you can just you know have instruction methods to do both。

AndBut one final thing I want to say is oh we were like oh but now we have these kind of you know diffusion method taking you much better。

which is a issue So I think it will still be very interesting to think about right how these pixel level based methods can be more effectively integrated with of structure representation for example。

a common issue with these very powerful now geometry models。

especially those in sweet often may have multiple hats right so that's because the pers data in the pictures is very likely for humans to take pictures from the front of the dog or something So therefore every dog。

every pictures you have of the dog very likely to have a head and dog is facing toward you So you have generating dogs in sweet with all multiple hats right so having some kind of structured knowledge will hopefully be able to help you address this problem。

And I should say this is really done by two fantasticies students。

although now they work on very different things now。

but I like the work a lot I still talk about it and with Shaenmalo and who is now a PhD at MIT and EKi。

who is a PhD at Stanford。Okay now I want to move on to some more recent though。

so we have talked about program synthesis for visual data and we start with sketches to natural images so single image learning and then we go from single plane to multiple plays so what's next。

You know， you can think about this line drawings or single plane images as we're doing everything in 2D and if we have multiple plane。

then then it's like you know you have 2D， but then you know every plane you kind of have a surface novel you are whopping it and you're kind of putting them together right it's number of detailed geometry so which is not enough for I was say two and halfD but still you know you sort of have an envelope representation for the scene So I say that's arguably a little bit less than two and a half but I just calling it two and halfD。

😊，So naturally what is next is we when to move to 3D and there's kind of fundamental difference I want to emphasize that is you know when you just like you know in human perception as well。

you know when you say 2D and2 and half the， it is often the case that you have this viewercentric repetition right everything scene it's all about a scenes and the camera is always at your eyes and coordinate system is centered around your eyes as well it's just the word coordinate system but when we talk about 3D it often has a big change that is now you centered your around objects you know and objects has their own coordinate system and you just see objects from a totally different perspective which is the origin now it's not centered around your eye but centered around the center of the objects。

Okay， but in particular， obviously are very interesting because。

They so that requires kind of a sorry， but scenes versus objects that requires kind of a whole paradigm shift in a lot of the methods or things that representation we talked about before。

you know conceptually it should be transferable but theyre kind of deep discrepancies between them which I think works a lot of further studies but in this particular case now we can look at 3D shapes where they often have these very abstract and program like structure know this is again because when humans in a way we try to make these objects。

especially for these human artifacts we just really want them to be regular right at half of the table and these the chair the lack of the chairs we just we just have these kind of strong preference for them to be regular for them to be repetitive but also there's kind of a pragmatic concerns as well because if you have the table lags that are equally long then the table won't be stable right so you want it to be stable you want to be cheap efficient to make so you have all these considerations that makes or suggest that all these shapes they have to have these kind of structure。

So due to time constraintstrain I won't be able to talk about how we'll be able to do this in detail。

but let me just quickly show you the results in the sense that you know we were trying to use learning methods to actually take a shape and you will be able to infer the programmatic representation for the shapes which were very much inspired by work in both program synthesis ball in computer graphics kind of be a huge line of work on how you can use procedural models for computer graphic for shapes。

😊，We're using neural networks for inference and then because there's kind of very limited annotations on shape programs。

we also have a neuraler network as a program execut so that you can do mostly selfs training So but here you know if you do this in a very simple way you say okay oh the shapes has these regular structures and the legs of the chair is often like a cuboid right at top of the table is like a cylinder then you fall back to this existing limitation。

which I talk about very early I was saying oh the line look at line noise it's just made of rectangles and lines and the world is not made of rectangles than lines。

the word is much more complex Q3 you have the same issue that is and you have this table and you're saying okay the table looks nice because top of table is a cylinder and the lag table is a cubeboid that's kind of funny because the world is not made of just cylinders and cubo if you look at the chairs you're sitting right now then it's sort of having this kind of structure like oh it has to be stable there's repetitions on the lag。

Like that but the detailed geometry of the leg of the back of the bottom of the table or the chair must be you have these kind of flying beautiful curvature and stuff like that。

which is not captured by simple geometrymetric primitives so most recently at this neural conference we had a more recent work where we try to incorporate integrate these shape program representation which is highly structured symbolic with neural primitives because just like especially nowa these days which are parameterized by implicit representations neuroimplicit representation now gets very。

very popular in particular because of nerve people use it forvari for appearance but you may know before that people have been using implicit representations for geometry first and with deep networks like deepPSDf and those kind of work so concept is very similar to what we did before in 2D that is okay you still have this syndron structure program but what is an object what is the serial is parameterized the neural network here it's the same story you have airplane。

Cha， you have this programmatic representation for the airplanes for the chairs here。

what it means is， okay what is the left wing and the right wing of the airplane right So all I know or the program structure tells me they have to be the same thing because it has to be symmetric if there's a wing of the airplane then the left wing and the right wing has to be the same otherwise I don't want to sit in the airplane so for all these reasons these have to be the same this is what a programmatic rep can tell you the repetition of these parts but what is the wing just like the question of what is the serial what is the wing is not simply parameterized by lines or cuids which is now parameterized by implic in neural network they be imp in neural networks parameterized what is the wing or what is the engine of the airplane and a programmatic structure representation tells you the wing should be repeated to be reused and the only exception to change is its pulse right So it has to be reused on the left and the right and engine the well。

So you have this kind of programmatic。what the program tells you the structure or the repetition of symmetry of objects。

while the neuro perimeters parameterize the actual detailed geometry and of course you can do appearance as well like nerve。

but which is here is do geometry of the parts。😊，Right。And compare with earlier work。

Having these symbolic representation for shapes where they mostly use different ways of to parameterize the entry level prims。

but if you use a neural network to learn the entry level prim then you can see that okay it has much higher fidelity but also the symbolic structure enforces。

for example， the symmetry of the airplanes as well as the regularities lags and stuff like that we're happy to talk about this work offline is now I think we're out of time so I should move on to the final but I feel like the most exciting piece of the work。

😊，Okay so I have been suggesting this a few times right so we have these all these virtual programs in。

but we have been talking about how we can get them。

but there's a more fundamental question that is why would these structure exist in the first place I've been suggesting a lot of these program like structure they originally from human preferences in the fabrication process know for example。

if you look at this ways kind of beautiful right so this kind of RGB pictures but the way the ways look like the way it is。

it's because there are intrinsic images right So they are kind of underlying components that put together they got put together and produce the final RGB pictures so this includes。

for example， the geometry of the object surfaced almost abe。

which is the texture in the material which is how the object reflects the light right So for example。

this ways looks like it's a p line it is because I know how the way it reflecting lights makes me feel like it is a p line right so these are underlying components that got put together and produced the image and。

😊，Comp graphics， the process called rendering， that's why a large part of computer vision is you know when people say oh inverse graphics or inverse rendering。

right， you try to invert this process and get underlying components of what is there in the。😊。

An image。But the problem is so hard because you know think about it you not have you a lot of components A B and C and you know that when they got time together。

the final output is the image， but you're given the picture how can you tell what is the underlying ABC that's just impossible so you have to rely on different the problems is so ambiguous that you have to rely on different levels of inductive biases So what are the inductive biases we may have I would say it is like the program like structure or the human preferences exist in these intrinsic image。

😊，In the case of this particular base， you know， let's look at it one by way。

if you look at surface normal is irregular or nonregular I would say here， the surface normals。

I say it has this very strong regularity because when we made it。

we want it to be rotationally symmetric。 So we have this kind of strong preference about making the things to be regular。

This is the same case for the materials as well because I know the object is homogeneous is made up same material everywhere。

therefore every point on this object reflects the light in the same way。

therefore the surface normal and the materials of the object should have very strong regularity it's rotational symmetric。

it is homogeneous。 So Ill call this explicit regularity。😊，Okay。And what about abeo。

what about the texture of the objects？The texture of the object is kind of implicitly regular。

I would say， because it is not like okay every pixel on this object on this vase is having the same color right it's not like home。

it's not like pure color， but it does look to me that matcheses you know look on this a map。

they kind of look similar， the vase kind of having similar textures everywhere so has this obviously say implicit regularity。

although I don't know how to really enforce it as equations。

but does have this kind of regularities that are implicit to us。😊，Then there's lining components。

They use spec components， environment maps。 you know they。

they sometimes may have a little bit regularity， but especially when things are indoor。

it's just so complex。 I would just say， no， they're not regular。 Okay。

let's just don't worry about it。 So now we can see that， you know， even these intrinsic images。

they look kind of so complex they。对。😊，We now actually have a little bit of signals about their asymmetries or the structures or the biases in these intrinsic images。

which may allow us to solve this seemingly impossible problem that is to disentangle and infer these intrinsic images just from RGB image as infinite。

So we were this work was really done by a very talented student。

I Shaang Z Wu he was a PhD student at EGG from Oxford back then。

that was when I was at Google and he came here and data the internship So without thought okay iss that possible to leverage that kind of structure for image dendering that is taking this picture and leveraging the asymmetries or the structures we already know know different levels of inductive biases we may have in these intrinsic images so that we can dender infr them from the input and once you have that you will be able to do fancy applications。

for example， you can do novel view synthesis of how the vase will look like turning the v from a single picture into3D seeing it from different views。

but also because you been modeling the material of the after， you modeling how you reflects light。

you can re the objects you can imagine okay just going from an unalnotated picture of the vase how the v will look like under different lining conditions。

So let me talk about how we're able to enforce these different construct of strikes。The first one。

the shape is very simple right we have this explicit regularularities in the geometry we know it's rotation symmetric so you can parameter the shape using a solid revolution representation with the height which is scalar and radius which is vector the radius at different heights right so you can have this kind of structure representation or parameterization of the object shape and once you have that you can render COA and compare that with the ground truth COOS that's your first mouse that's how you enforce the regularities in geometry in surface almost。

😊，Once you have the shape。It can unwrap the object so that you're going to get a surface normal and texture。

but now you unwrap it in Q， you can think about it。

every do to the standard intrinsic image decomp in lighting of materials， in orbitbeo。😊。

And you can put it back you know during the re renderndering computer lighting components and putting back ob beles so that you can reconstruct the texture and then you can put in that object pose and shape。

you can reconstruct the original image and that's your second loss in the pixel photometric loss but here you're assuming you can parameterize object materials using the same parameters after a row everywhere where you assume every point on this object have the same material parameters they're reflecting light in the same way so this is how you enforce explicit regularities and object materials。

Okay， these are kind of all you know not that surprising。

but I think the harder problem is how you will be able to enforce this regularity or structure or code。

but very in code in terms of audioobbedo， right abedo looks kind of similar everywhere。

but how we be able to enforce that because it's not like this pixel has the same RGB value at that pixel。

But we did it in two ways， first is if it looks similar everywhere。

then if I compute me a beeto and then I put it back。

you know the mean abeeto reconstructive picture to look kind of similar to the original picture too。

so that's kind of an easy way of encing it。But more exciting way you're enforcing it is you know when we say okay the abeome maps are similar to us。

now what does that are perceptly similar to humans。

what does that mean that means no matter which abeo patch the object of the image the patch is coming from no matter where it is coming from either it's coming from this highly spec region of the base or it iss coming from this nonspeccular region base。

then aid maps should look similar to humans everywhere。

know that means if the thing if the patches are looking similar to humans。

that should also look similar to the machine is so you can enforce that by sampling patches from this obidome maps and they can send it to what we call a selfsupped aidal discriminator and the discriminator should not be able to tell where this abeo patch is coming from。

So specifically right you can predict all these aiddome maps it can predict specity maps and now the question is no even the aino is coming from a region that is highly specative。

or coming from a region that is not spec at all， if you send it to a discriminator that the whose goal is trying to classify whether where the aiddo patch is coming from。

they should not be able to do very well and the goal of your generator is actually to produce aidome maps that are consistent everywhere no matter whether there is specity or not。

😊，RightTo confuse the discriminator。 And now this is important because this generator itself you know has to solve this very challenging problem because when you do intrinsic the image composition for highly spec region。

it's really hard to do because it's overexposed the regions pure white。

But if you tell the generator that even if the input is pure white。

you have to generate a be map that looks similar to regions that are you know non spec that enforces the system to have this kind of implicly regularity to produce a consistent aid map。

This is really important and for us to achieve very good results。😊。

And so here are the final results about you know by putting different levels of inductive deviceses or different intrinsic image components。

we're able to turn a single image during testing， just a single image and also during training between a collection of images of vases。

but there's no annotation just a collection of images， no3 annotations at all。

So it's purely unsurupped training a collection of basiss during testing from a single v。

you can infer all these intrinsic image components a video surface normal and diffuse spectral components and materials and you can go in during testing going from a single image。

you can virtualize it， you can see in the v from different views。😊，And you can rely them。

Here are more results from a different data set， which is a bit more complex with more background。

but you can see that we can do equally well and again even for paintings right。

you can turn in the picture into3D and then you can see it from different view you can do all of these synthesis and you can do relying。

Yes。Okay， finally， let me wrap up by going back to the original story know at first I was saying you have line joints and line joins have lines and rectangles which are not general。

so you want to apply it generalize it to natural images and then say you have 3D shapeves。

the 3D shapes you know made of rectangles and cylinders that's not good enough that you want to do for general optics。

But now here for interesting image composition I'm saying that is great。

you have all these video components， but then you assume this thing has to be rotation symmetric。

right which is cool， but you know it's not like everywhere every object in the world is it rotational symmetric right there are very few objects that are purely rotationly symmetric How can we going from this assumption of objects are symmetric to more general objects if you show you。

let's say this beautiful bouquet of all these roses And now the question is how is that possible for us to generalize what we can do from things that are huge recly symmetric to general objects。

😊，Again， we have to really think about why these structure exists in the first place。

know human has this strong preference about making things to be regular to be repetitive。

but here for the roses it's not made by humans it's made by nature。

the nature also have this very strong in bias as I said early that is all these grasses look similar because they're the same species and all these roses sort of look similar have the regularities because they belong to the same object category。

they're just different instances of the same object category right So if you look at these roses。

they kind of have this you know similarity but these similarity exists because fundamentally they're the same object same type of object。

they're mindful instances from the same object category。

there's a reason that we call them same object class。

there's the reason we give them a name so what they're really sharing is not a rotationally symmetric object representation what they're really sharing it's thing that you belongs to them which is their object intrinsics know every rose。

they're color rows because。Share this distribution of their intrinsics。

including their geometry shape， including their texture， how they look like。

including their material， how they reflect the light of course， also including their physics。

you know how heavy they are stuff like that， But here for vision graphics purposes。

we care about how can we learn a genetic distribution of their geometry of their texture of their material so that you can get rid of this rotational symmetric assumption。

but now you really learn a genetic distribution of object intrinsics。😊。

So we basically adopt the pipeline we had before， but now we get rid of this assumption that everything has to be re symmetric。

but we still have this regularity or structure or code。

where is the code coming from the code coming from natural supervision because these things they all share the same intrinsic distributions provided to us by nature so they have to share the same geometry the same distribution after geometry of a video and you enforce that by learning to generate approximate distribution but coupled with extrinsics from the world。

including object pole including lighting so they can get the shape the shading appearance representations and you can put it produce a picture and the goal is this picture should look like a natural picture from the real world。

😊，So by enforcing this kind of more generalized constraint or code。

you're able to again learn from a single image， all we need is a single picture of okay of roses。

maybe there are 20， 30 roses， but just from a single image。

we'll be able to learn intrinsic distribution， a distribution of their intrinsics which include geometry texture of material which allows you to we can see that okay capturing doing up to novel mus as in the row at the bottom and novel view doing relating as in the second row to the bottom。

these are things we know how to do before， but also capturing a generative distribution of these roses in the middle row。

you can see that these roses we can sample roses of different sizes because we have learned the generative distribution of their intrinsics and at the same time not only learning to sample different roses。

different geometry but still doing novel synthesis relating as we can do before。

And here are more results， different types of down planes or cranes and stuff like that。Yeah。

all just need all you need is from a single image。

you are going to learn a gently distribution of these having26， you can do no syns。

you can do lighting。Okay， so to wrap up you know we talk about things in 2D and to2 and halfD and then in 3D we talk about you know kind of like how we can generalize the things we already can do to 3D。

but more fundamentally right why would these program structure exist in the first place and how we exploit them to even do fancier things like now these things that is realizing or even capturing gender distribution doing26 something I don't have time to talk about at all is time and kind of clear a programmatic structure in for example human motion。

but that's kind of an interesting topic to talk about it offline。

So to summarize right the key innovationlevation behind this line work everyone will talk about anything that is not purely neural is of some kind of neural network for recognition and then you have a symbolic representation of code for general right but here I want to say the first you have to have a very broad interpretation of what code is right and beginning we say code can be full loops。

but more importantly code can just be different types of inductive biases you put into neural network So the whole thing is still a neural network it's still entry and trainable So when people think about code they often like oh wait that's not good full loops are not very generalizable but full loops may indeed be not very generalizable except in some specific domains。

but code doesn't happen in full loops， code can be very general inductive biases which you just need to introduce into neural networks but the whole thing is still a neural network。

it is still entry and trainable as we show at end in the very example in the row example。

And now the question is where are these code coming from， right， Sure。

you can put the inductive deviceses， but which one should I pick。

I would say these are these are the things that we should be naturally supervised。

And when I say these are naturally supervised code。

I would say fundamentally there are just two courses or two reasons of this code One is you have humans right human have this strong preferences。

And when we're doing fabrication， we were making things， we introduce our own preferences。

that's the prior or the code coming from human。And the second is the code comes from nature natural supervision comes from nature because when if you believe in evolution right so when there's after instance or there's a species。

then the reason that they share these fundamental。

similar intrinsic properties geometry reflectance right and when there's3D and looking at pictures in2ity。

there's perspective projection， these are kind of fundamental。

naturally supervised code that we may consider incorporating and we like them because they're universal right It's kind of universal is true that applicable everywhere So every picture we take。

they should be applicable， therefore there's no reason to believe that they're not really they will hurt genetic instead they should really help genetic。

And in the future， we can think about how to extend it to more complex scenes with complex background with you know complex interactions between lighting and objects and background。

some are more programmatic， some are less。And if you want to do more generalizable representationation learning。

you know what's really the role of symbols and if you do want to do anytime。

something that is not purely neural， then how can we have more efficient influence algorithms because that makes optimization problems much harder as well as how can we go from passive perception to interact action to interact with the disease and you can see that a lot of these are kind of cognitive inspired。

so how can we draw connection to human cognition and to natural language because language or how we referring to things。

how we are talking about things is another source of important natural supervision。

which are very interesting but I don't have to talk about here。that's it。 Thank you。好。

啊，我们感谢这个啊吴教授带来的这个精彩的报告啊，非常非常有启发性的一种啊很很独特很unique的一个工作啊，一个系列的工作，通过这种自然界啊。

我们人的一种理解或者自然界里面天然带的这种结构来去帮助我们去很高效的啊泛化能力很强的去建模啊2D的3D的啊这种世界啊，那么我们还是有一个问题的时间。然后我们就进入我们的拍al环节好。😊，好。

我们麦克风交到这边哈，谢谢。😊，呃，吴家授，你好，就是呃我看刚才您给的那些例子里面那些物体好像都是刚性的。也就是说他们会就是那个。但是形状不会变化。嗯，然后我想问一下。

就是对于那种嗯柔性的物体该怎么处建嗯，对你说的非常好。但是在我回答这个问题时，我先把灯开一下，越来越黑了，天黑了，不好意思啊。好的好的嗯这个。😊，你看这个不容易就说觉你刚才说的那个非常好的问题啊。

就是说我们确实有 assumption但是不完全是比说举个其个对我实其我在 assumption不是果你是看这个ros的话。

其实我们更多是一个是一个 assumption设个是一个的场景但是你是以它的s我果你看前面我们个 based的sure但有一个 based flexibility但我你说的是个非常好的fu就是说我们怎么样能够不是im可但简单的是我怎么能个。

😊，whatever you care。但是呢你同时能够 capture的 dynamics，能 capture出的这个相当于你有你有一个 additional一个 dynamics how how things will evolve你有怎么样有一个dynamic reputation能够啊把握它的这个变化啊。

就是说比如说啊其实现在有一些这样的工作。比如你去看这个尚哲啊，就是我刚刚讲最后这个啊这个这个工作的时候啊，他的利的这个工作，包括在rose这个工作，他也有参与啊，就是说。他们最新的一些工作啊。

就是怎么样能够从col image中可以学ard objects，怎么样能够把这个马或者什么样的这个video中能够把这个马它有很多不同部分，怎么样能够把这些不同的ulation能够学出来。

on top of that最大家可能在看说不怎么样能够学出 parts还能够animate吧能让它动起来，那这个可能是我觉得是非常好的future其实或者也不是futurego。

就甚至是ongoing direction大家可能很多很多的group都在都在look。好的，我们再次感谢这个吴教授精彩的报告。然后我请那个家俊刘步哈，我们还后面还有一个呃连续的一个拍al的环节。

然后这个拍al环节我们是呃请赵周老师也也上台。然后我们那个朱军教授在线上参加。呃，然后那个周波的老师因为一些个人的原因，他可能很遗憾，没能就是在拍环节继续跟我们去研讨。但我相信从报告里面。

大家已经听到了他很多很有啊远见的这种啊研究的品味好啊。😊。

🎼行啊，找找嗯。到申你清上了。对。师请请哎小心小心啊，你请这个这个坐坐对呃，行，那我们这个拍的环节就正式开始呃，然后我我那个我相信我我需要介绍这个我要不再稍微自我介绍一下这个我是这个主持人李崇轩。

然后呢刚才二位讲者是家俊和这个赵教授，然后我还很荣幸邀请到了这个朱军教授线上参加。然后那个我们的这个今天那个的议题哈。第一个问题是关于就最近有一个AI有那个报道就是说大家有用AI去诈骗的。

我不知道大家听说过没有哈，就是他能合成语音，甚至就是微信直接打视频过去哈，就是去去模仿你的领导或者是你的朋友去跟你说话，你要钱。然后这个有这样的一些事情。那么我可看到就是说生成是AI的这个发展呢。

他可能会有很多很多的滥用的这种问题，导致很多社会的这种安全性啊或一些其他的这种问题。那么。😊，我们想看看啊这个从技术上或者怎么样去解决这个事情。然后我们可能先请呃，比如说朱军老师先先来开头。

要不来来说这个事情。对。好好好，大家好啊，那个抱歉，我是因为有那个有另外会，然后再。在这个外边。呃，对，刚才呃感谢谢几位嘉宾啊，包括嘉俊啊，还有那个赵周老师，感还有那个可能让柏磊老师已经下线了。

感谢大家对这个呃就首先我感谢一下大家对这个BAI conferenceence支持。然后刚才我也听了这个好几个报告，包括那个报告我也听了。后呃还有嘉俊的刚才精彩的报告。啊。对我刚才回到这个问题的话。

我就就是现在可能也是呃就随IGC发展之后，就呃带来了一些可能就会这些技术被用在一些呃特别恶意的这些目的。😊，呃，那这个其实在呃上一波的呃或者是其实也不是很远，大概是19年的时候。

当时就有一波叫这个的这种这种检测。因为当时也是因为这个深度生成模型，包括像干啊等等这个技术的发展啊，就有这种用这个合成的虚假的视频可能会在网上去传播带来一些恶意的这种效果啊，在那一波之后呢。

其实大家都在想着用用计算机呃用程序用人工智能来去识别这种自动检测这种的或者age或者是甚至语音文本等等啊，所以相关的这个呃技术方面的话，就说呃现在呃在这个AIGC更进一步发展之后。

其实这个生成的质量是更高的啊。但是实际上这个计算机生成的这些内容的话，它和这个自然的这些真实或者我们说真实的或者自然的这些图片或者视频也好。还是有很多这个。呃，区别啊。

比如说它的这个可能一些特征的这些分布上啊就会存在差异。另外包括像这个有一些通过画脸啊等等这些呃合成出来的这种呃这种图片或者视频，它本身在这个自然度上。

或者是啊它会啊这个平滑上会存在这种呃这些特征的这种区别。以用这些信息的话，实际上是呃可以通过计算机算法的方式，可能更精准的来去来去做识别啊，这个也有很多的这个呃进展和这个相关的一些相关的一些应用。啊。

但这本身它是一个相互在在在演绎呢啊，就未来的这个IGT是不是能够就发展到完全超过的。就是我现在对人来说，它是已经很很多程度啊是可以接受的啊，可以视觉上可以达到比较好的一个效果。

但是未来是在这个算法上就说呃也会对这个检测算法带来更多的影响。我相信这个肯定会是这样的。啊，但是这个本身检测的话，我觉得最重要的可能还是检测那种。内容上可能比较有这种负面影响的这种内容。啊。

并不是啊对检测是不是啊算法生成呢，这可能并不是那么的呃急迫啊，相对来说，我觉得这个需要刚才讲的一些案例里面，就才有恶意目的的这些内容。这可能从他的本身的比如要要表示的这个呃内容上可以去进一步的检测。

不光是从这个像展示的这种视觉的特征上。The。好啊，谢谢朱军老师，我们要不请佳俊也来这个呃看看分享一下对这个问题的看法哈。😊，我觉得其实我没有太多要说，因为我重朱老师说的。

然后另一方面我也会觉得说就像周老师最后说的even他肯定会越来越，然后你肯定就没有办法去分辨。所以他它实际上是一个bed个个系统炸炸就就是你不能说随便一个人都以个炸者拿路那就是说这个东西其现有entkenken很那这样你就以以一个个人在以场景下。

会非reistic，尤其是现在可能说vi还不太好。但是将来你说video肯定也会做的越来越好。似乎因为video你有很多 data在网上。

你好像没有理由说你 dynamics至少在从appance的角度啊不是appance的角度，就看起来真实的角度，它肯定是很。那我觉得最后他就是那就不是一个 scientificific问题。

那就说是是需要一个综合的一个社会来综合的考虑的问题。好好的，呃，我们谢谢佳俊的精彩分享。然后我们赵赵忠老师也请您去呃想分享一下您的看法。啊。好好啊，我我同意那个呃朱老师和嘉俊说的。

就是呃现在我们的生成模型可能会有一些生成的呃不自然的地方，我们可以来进行呃检测。那么谁的佳军说的呃模型越来越大越来越逼真的话，那么呃我们是之后是很难分辨出来的。

那么呃这个的话我们我的一个看法就是呃像呃朱老师说的就是说这个生成这个技术的本身，其实并不是说我们是需要呃这个事情，主要是他可能是会出现一些比如说是恶意的一些呃那个内容的一些情况。

那么呃针对如果模型做出来是越来越逼真，越越来越细节的一个程度的话，我就。😊，觉得可能还有一个方式可以来做，就是说我们在呃模型生成的时候呃，给模型加载呃数字的水印。那么加载数字水印的话。

我们的同时我们是identify呃这个生成的内容是呃通过什么样的哪个模型或者哪个呃机构进行呃进行produce出来。那我们就是很快的可以呃check。

那么这个模型到底是呃这个这个内容是一个是呃哪里进行生成出来的，或者是这个内容是一个真实的。那么一个是从数字水印的一种方法。那么这个就是我的一个看法呢，谢谢好的好的，谢谢赵老师的分享。

那我们下面第二个问题是后面三个问题分别为三位老师哈。第一个问题是请朱朱老师回答，就是呃这个从算法或者这个基础的这种模型的发展来看，在diion之后，您觉得我们还会有呃下一个大的突破吗。

就毕竟我们从密度的函数，然后到生成的，然后到这种呃对抗生成的，到到diiffusion到现在是吧？就可能呃是不是能做的都都已经差不多了。

就是你你你相信我们还会有下一个更power的一个geny modeling framework嘛。对。对，这这个这个问题其实挺难回答的。但如果你问我相不相信，我肯定是相信会有的。

因为呃因为我们都在做这个做做sS词，因为这但是。呃，会会充满surprise吧，就是就像这个d model出来之前，大家说这个干已经非常好了，就是可能domin很多东西。

但突然有一天这个ion出来之后就干。然后我现在基本上大部分很多在拥抱这个ion model。但是其实 model从从这个本质上他也不是说没有局限。但现在其实我们也看到很多问题吧包括大家听在讲报告的时候。

最也说这个比如现在我们在在这个模型在这个t里边就是做的是是比较好的这种效果，从各个方面但是呢到了比如怎么去做这种di这种比如像t这种这种数据是不是有可能我们在这方面去做某突破。

么去比如在那边di这种我到底怎么去怎么的能不能有种或者就我在一起的这种我想这可能都是就目前还是还是op的就大家都是。好奇的，而且我我我愿意相信就是某一天这个这gap可能就会更好的去被被被弥补。啊。

所以大家也不要觉得对对在座的或者是在那个有学生的话，我觉得大家还是要充满这个希望的。就是反能我我我们是努力，就是想把那个像扩散模型，不管做的更好，或者是从fuament上能够去呃能够去去呃比说去。

替待他或者就是有一个完全更好的。我觉得这这种努力肯定是要去一直要去做的。啊，对我我也希望就是。哪天这个这个这可能会成为现实。那现现在没有答案，我我如果现在有答案的话，我就会发给大家。好的好的。

我我们期待我们能做出来。好好，谢谢朱老师哈。那我们第二1个这个问题是请吴家俊教授回答哈，就是说我我特别我我可能跟我最开始预先准备的问题，稍微有一丢丢的区别，就是特别呃受到你这个报告的启发。

因为我是第一次完完整整听你很长的这个报告。然后我觉得就是说你确做的确实非常非常un。那就是从这种啊知识出发，或者说自然的一些规律出发。那么他有很好的比如说呃这种我们约束他的能力或控制他的能力。

然后有很好的泛化性，或者很很容易去去做更不同的这种东西。那么但他可能呃比如说最近也有一种新的方法，类似于，我们去用大规模的预训练模型啊，这种非常非常dark的这种方法去去做这种开放的去东西。

然后这个现在两个方法可能都有他的问题，但他有多特的优势。然后你怎么看待这两个流派，或者说未来特别在3D上的一个发展的一个前景。😊，对有就是说我我我不觉得说他们有什么 difference。

就是说我觉得他们可能是 same goal，但是他从不同的出发点，然后就例比如说ner，对吧？那就说它就有个 notion space其实他有一个很强的就是上有东西叫 space有like有light transport。

那怎么去然后你根据这个lige来做对吧？那么然那你后面那就说就是说其实他就是说嗯所以我觉得呃并不见得说这些东西它本质上就有一个包括现在其实还有一些啊我觉得比较有意思 work就是怎么样能够什么是一个什么是一个 conceptcept啊。

什么是一个，然后你怎么样能够ver比如说比如说我说。😊，李红轩老师对吧？我说我拍了一些照片啊，我说什么是李红轩，我这个是为什么李红轩老师，这个照片里这个是李红，那换我给他改到什么程度。

他就是朱军老师就是说就是说就是说这个这个到底是个什么意思后怎么让李轩老师成为了李红轩老师对那就是说他其实它里面有一些interest concept但是这他有很多 application那就说我比如说我想生成我我有些李轩老师图。

我想生成说李轩老师在讲课或者李轩老师在滑雪可能那怎么让他看起来确实一方面符合这个意境。一方面他有keepep啊。

以其实最近其实很多好几篇就今天就有好几篇就是其实常相关的个你可以说他们是同一个 driven perspective他们是但实种上说就是很大程度说 notion an concept。

就似像之 notion space啊，其实是可能。😊，但非常ins但是非常重要的这个ductive bias。然后就像我前面说的，就是说所谓的ductive bias或者谓 structure对吧？

就是说theres name thats important structure而并不是说啊这个structure一定要可需要ductive bias也就比说我真的要去我要来那我可要 program面就我不仅具unioneration但你就越来越具体真的要做越来越但你。

你可能需要的min是一种非常非常互常微妙的方式出现。就是当你说这个nerf或者当你在说这些fuion这些或者就是啊这些work的时候。

他们也有啊某种不过他可能是一种就是但并不是说他们是他是他并不说他们是互相矛盾的。他其实可能只是说是一个rum这一边到那一边O好好，感谢嘉建的分享。然后我们第三个问题是请问这老啊。

就是你怎么看待这个因为现模态越来越多了哈，就这个文本图像啊，后视频包主主打的这个音频的合成，它越来越多了。那你觉得就比如说以后的态模型会是一个什么的发展。

比说我需要做做到多少模算是够呢或者说是不同模之间他怎么去互相的就是。😊，提升啊等等这些这些问题。对。哦好好，谢谢李老师。😊，呃，是这样子的，就是说这个今天这个报告呢。

我们主要是呃分享了一些呃生成式模型的一些情况。那么我我们是希望嗯做生成式模型，做的它推理速度呃更快，它的模型size呃更小，包括它的表现力呃更强。那么我们我们看到我们今年可以不仅是可以生成呃视觉。

我们也可以生成呃音频这种呃这种不同的一些模态。那么从呃我的呃角度来进行出发。因为我我一直是在做多模态，包括是做呃人机交互。那么我我是呃呃思考。

那么呃接下来做的是比如说我们现在做的基本上都是以以生成式为主。那么呃接下来的话会往会网呃理解这种这种进行靠。因为之前的话呃理解和生成。这两个呃问题都是呃分别来解的。

就是很多人呃都要么是解这种呃多模态的understanding，要么是解多模态的这种生成。那么呃在这种呃大模型这种呃时代的下落。

我们是呃希望是把理解understanding和呃生成一起放在一个model里面。那么也就是说我们呃接触的理解的模态，比如说以以人机交互为例，那么我们呃输入的话有呃talking face呃，有有它的。

有他的一些呃呃pech等不同的一些呃模态。那么在我们的呃呃理解上面是我们可以更好的做我们的生成。那么这个的话最后一个愿景的话是呃通过理解和生成。因为我们是做呃多么态人机交互嘛。

那么可以做一个更好的一个motymod的一个呃dialogmod的 dialogue。那么使得我们的呃生成的，包括是合呃呃人进行交互的。那么更加的一个逼真，而且对人的一个呃粘性会更加的呃强。

那么这个是嗯我觉得是呃多模态的生成模型。在呃人机交互这么一个呃领域，我接下来我们想做的一些呃事情。那谢谢好的好的，谢谢赵周老师。好，我们这个时间可能也差不多。还有最后最后的一个问题哈。

然后我们这个问题可以请赵周老师先回答，然后我们依次，然后最后比如请朱老师做个这个总结的致词哈，就是我们怎么看待这个就现在我们去学习发展的现在哈，这个有大模型啊，有各种各样的这种生成修生的这种技术啊。

自回归的技术。那么您觉得呃下一步就是生成模型啊，如果再突破或者他未来的这个发展，最会让你激动人心的啊，这一个点是什么？对。😊，呃呃呃虽然我们我们现在有非常多的一些呃像李老师说的。

比如说是有大模型这个爆发，包括有diffusion做的呃更加呃更加更加的一个真实这么一个这么一个场景。

那么我我们发现就是说嗯像呃像AICC的话呃也可以在很多程度上是可以呃变革我们这样的一个呃生活可以给我们带来很多场景的一个呃变化。比如说呃现在的呃呃人机交互。那么基于这种呃虚拟的人包括虚拟的世界。

包括虚拟的环境。这种嗯人和人的一个这种交互。那我们不仅是在呃物理中做的是呃已经很真实。那我们在网络环境中也是可以有一个非常真实的这么一个呃真实的一个映射。那么这个我觉得呃接下来呃网。这个方向去走。

应该是一个非常呃激动人心的很多一些问题可以来进行解决。有很多的一些呃新的场景可以来进行构思。那我是这么一个观点。好好，谢谢赵老师。呃，然后我们佳俊请请您分享一下你这个吴吴教授分享一下你的观点。

两短期内最觉觉得接下来很可能首先第我觉得没有理由不能有一。😊，也好，怎么样能够effect way可以做hi interaction，然后去啊就是这里面需要什么是最好的一个handle啊。

能够briach human和 machine。那这个还是我觉得有很多的。就当然这也涉到很多 social science的一些问题啊，但这可能就比较 long term。嗯，好的好的，谢谢佳军。

那我最后请朱鑫老师做一个呃相当一个展望和总结吧。也不不能算什么总结吧。我觉得那个刚才两位老师讲的我完全同意啊，就是就就回到这个问题里面，就是未来这个啊觉得比较 startinging，或者是觉得去做的。

啊我刚才两位老师讲的都都我都完全同意。然后我想稍微补充一点，就是这个特别和这个刚才嘉俊讲的这个就是inter。其实就我们现在比如说是看人和这个这个这个算法人和机器之间的这个inter。

其实还有另外一个就是这个未来的这个这些比如我们做的这些动模态也好这种等等这些这个能力达到一定的呃一定的这种水平之后呢，其实他有另外一个方向，就是这个可能会在这个机器人和这个实体相结合。

就现在在说这个叫叫巨深的这种这种这种形态啊，他事实上他将来就说我们看到的不光是一个一个。一个算法，那它表的是一个实体的一个一个对象，它可以和环境和人和各个呃各个方面来进行这个个交互和这个呃和这个演进。

呃，我我是呃就是现在这个在呃剧神里面当然也有很多讨论。包括在我们这个这个confer里边，也应该也也有专门的s选会关于这个相关的啊这一块呢，其实也有呃就近期的进展也比较多吧。

就像像这个比较说关注像一啊等这些呃这些进展等等。我觉得这可能是未来呃比除了我们现在这个呃在在一些模态或者多个模态上生成之外啊，他进一步的可能产生更大的这种呃这种影响。啊。

如果一旦这个技术达到某某种呃能力之后呢，可能这个呃对我们整个的变化也是比较大。所这个我们可能觉得是未来比的啊。然后呃这刚才重轩说让我再再再再说两句。

我就还是接着再再再感谢一下这个感谢一下我们所有的嘉宾嘛。因为这个呃我也算这个conference一个pro。所以当时也也感谢重轩和个生模型这个筛最后这个讨论啊这整个的这个呃虽然没有全部都听完。

但是我听了说这几个报告里面都是非常的其实也是我们这个这个一贯宗旨是非符合的。我们说做这个内行的专业的，而且是水平的这种这种ence。所以其实要达到这个目标的话。

就是依赖这个这我们这个专家的这个精彩的报告，所以我最后一句话也感谢大家也感谢还在现场的这些这些观众，谢谢大家的这个支持重好谢谢朱老师。然后也谢谢各位谢谢老师。😊，还有嘉军老师，我们今天这个论坛就到这儿。

我们感谢各位观众的支持哈，谢谢大家。😊。

posted @ 2024-10-20 02:33 绝不原创的飞龙阅读(110) 评论(0) 收藏举报

刷新页面返回顶部

龙哥盟

掠夺·扩张·投机·博弈

智源大会-2023-笔记-六-

智源大会 2023 笔记（六）

特邀报告（图灵奖得主Joseph Sifakis、Graphcore CTO Simon Knowles） - P1 - 智源社区 - BV1hh4y137BB

特邀报告&尖峰对话（张宏江、Kenneth Stanley、Will Knight、Susan Zhang） - P1 - 智源社区 - BV1nM4y1n7ww

生成模型论坛 - P1 - 智源社区 - BV1e14y1m7Rr

公告