
智源大会 2023 笔记(六)

它是to to to检查这些属性,当然,甚至有报纸谈论,是啊,是啊,i对齐,他会将对话代理与人类的爱的价值观结合起来,但我们甚至不明白人类的意志是如何产生的,以及所有基于价值的过程人类的决策过程。


我有财产,你会有一个更专业的术语叫做君主,那是一个代理,当你应用x,2。你观察就会知道成功与否,通过或失败,所以OR is是一个代理,它将计算这个谓词并现在决定,如果我说a满足一个性质p,我想验证。

















































































第一个元素十四,2000年被Broadcom收购,第二个ICE A是英伟达在2011年收购的,西蒙的教育背景包括剑桥大学的电子工程学位,所以我现在很高兴地欢迎西蒙·诺尔斯先生。



















苔藓和S ram细胞是互补的,芯片上的静态存储单元,存储一位需要六个晶体管,所以通常芯片设计者从逻辑晶体管对的角度来考虑,然后呃,六套,呃S斜坡晶体管,你可以看到公羊不再变得更密集。













在处理器核心顶部安装一个L 2缓存硅片,然后将这些核心排列在衬底上,围绕着一个IO死亡成一个相当复杂的,这是迄今为止最雄心勃勃的事情之一,是底部中心特斯拉D一百,这是一个重组晶圆,换句话说。















































它开始被用于更高性能的计算环境,这被称为低功耗DD R双倍数据速率,代表,这是一个历史名称,这不是很有帮助,这也是一种垂直堆叠的技术,但是用非常非常非常便宜的,呃方法,你就会看到,我在右手边突出显示了。


这个数字可能会下降,但可能不会低于四个不同的因素,所以还是很低,这些技术之间的另一个区别,我也用黄色突出显示了,首先是LPD博士,允许更多的内存连接到一个x pu--大约半个樱桃,但是。














那么你访问内存的速度甚至比HBM快十倍,你可以以每秒几十TB的速度访问它,在这个芯片的情况下,每秒65 TB,完全无法达到的速度,具有任何动态内存,所以这是一个选择,至少对于普通型号来说,呃。


每次向前或向后通过模型时,嗯,今天的许多芯片都没有那么多的s ram,必须不止一次地阅读模型的部分来向前传递,比如说,为什么只读一遍是个好主意,因为访问呃,就DRAM能量而言。


我们可以在一个芯片上容纳多少芯片带宽,如果我们用完了外部DRAM的长边内存,使用HBM或LPD R,然后我们就有了短边,芯片设计师所说的南北,嗯,在其中安装芯片到芯片的链接,嗯,这是通常做的。













因为GPU尤其在今天出售,根据带宽对培训有用,它不是真的用来训练的,呃,训练仍然是完全失败的统治,许多类型的推理也是如此,训练机不需要每秒多TB DRAM,它也不需要每秒多兆字节的芯片连接。






然后你知道这可能是100-100次失败每字节,嗯,你仍然会有完全的RAM带宽,限制在现代GPU UM或类似的大型上下文中,当然,这越来越成为我们想要的,关于如何处理大上下文的研究呈爆炸式增长。

















所以这种以检索网络为特征的关注点分离,我觉得很有吸引力,它当然有助于解决芯片和算法之间的不匹配,所以我的结论是不会有通用的XPU,其中x PU是GPU,一只伊普,TPU,随便你怎么称呼它,嗯。




人工智能的价值足以证明不止一种芯片架构是合理的,不会有通用的X PU,最后这是我最后一张幻灯片,嗯,我喜欢思考人工智能算法的进化,因为我们经历了两步才走到今天的地步,最初我们发现。
















呃,在GPU等更成熟的技术上采用IP US,嗯嗯,呃,这是个很好的问题,所以呃,为了一家新公司的成功崛起,嗯,我认为有两个基本要求,至少在芯片空间,这是我的呃,职业经历。


















These model where it just started to be eerily human like in some ways obviously it's not human at this point。

but it has some human like qualities and so this is clearly diverted an enormous amount of attention funding research and effort to those kinds of models and so we're going to be you know'm seeing more investment in that direction obviously in the near future but you know with the thing about AI and science in general and technology in general is that there's always going to be surprises so it doesn't just mean it's a straight shot of larger and larger language models like from here to AGI or something like that I would expect surprises still to be coming but we've certainly learned some lessons about size and data and scale that will probably continue to apply even as architecture perhaps surprises and shift。

Yeah, so Dr Jianang, yeah, obviously the theme of this conference is you always very firmly me as Hong Jang as it maybe be easier Okay。

Hong Jg okay。I try I'm trying to pronounce the surname properly but you so you know obviously it's understandable that the theme of the conference is large language models and it's just such an exciting time you know for someone who covers AI I've never seen anything like it so。

😊,But I would love to hear your perspective and how you're thinking about it as the chairman of the BAAI。

Well, the acoustic here is now that good could you repeat the question one more time Yeah okay sorry。

so what I was asking was given that we are experiencing a big moment in AI。

what you think in the big picture, it means for the state the status of AI research and the direction。

maybe that you'll be taking at the institute。It's definitely a big breakthrough that make every one of us who has been working in the field to rethink the approaches we have been using and system architectures we have been building and algorithms we have been working on。

know before ChaTBT, there are many there are tons of effort looking at various algorithm。

but I have always been a big fan of system approach。

meaning that AI AI technology AI itself is going to be a system is a system and it's not just a single algorithms。

that's actually one of the reasons why it's there to me this particular forum。

discussion and also can your book on you know the the the know success cannot be planned。

but be the if you, if you look at what we have been doing and like。

InIn most of computer science field, we we especially in academia。

we tend to look at a single algorithm and try to improve it a bit by bit。

but you know open I took a totally system approach and especially if you think about transformer was invented by Google researchers and they have came out of many。

You know, quite successful models, but none of them really has the ability of emerging。

And had shown the power of the HRGBT。And has so brilliantly combined the data alignment algorithm infer together that LED us to this breakthrough。

So I think the entire field is re thinking you know, how we carry on research。

What is right or most appropriate most effective approaches to approaching this approaching AI problems and。

give you an example in the larger language processing field。

which is a very fundamental sub areaa in AI when Ch came out。

at least I know among they are top groups in Chinese universities。

they basically tell themselves now we need to look back。

They actually one of the university really told the students PhD students say。

you know if you graduated this year, we we can't stop you because you have to graduate。

but if you graduate next year, you need to rethink your thesis because of what the problem you're trying to address。

Already, the logic extent solved byGT models。So yeah。So, you know, although you could still graduate。

if you continue along the direction, but it really your work。It meanless, I mean。

in terms of adding to the state of art。Yeah, that's pretty extraordinary I was at an event here at MIT recently where they had some linguists and cognitive scientists。

who were also saying that GPT4 and these large language models was changing their fields。

changing other areas of science I guess。Kenneth, on the subject of what might be missing。

Do you I mean it's interesting, is there anything about what we've seen with especially chat GT that makes you think here is a really exciting new direction I mean I think Hong Jg alluded to some of the things that yeah I'm curious what you think。

Yeah, when we think about what might be missing in exciting new directions。

I mean there' there's exciting new directions that build upon it and that now we can now opened up that weren't possible before and then there are limitations。

of course, where it's like these are problems that still exist in the models so let's see I can so maybe to start with just exciting directions that just build upon what we have one is I think that there's this is a very unusual one which I don't usually hear mentioned。

but I think about it a lot, which is that we haven't had the ability of a computer to actually grapple with a question of what's interesting before。

know if you just think even back two years or three years you could never imagine from a subjective point of view to even begin to say look look at this idea and tell me what you think look is this actually a good idea but this read in a good direction。

And for the first time actually you can have the computer begin to grapple with this kind of subjective question and if you think about it。

this is an extremely important question, what is interesting and what is not interesting。

even though it's totally subjective because it's the see from which all research and innovation grows it's like I decide what to do based on what I think is interesting and so if these kinds of models are someday going to actually solve big you know momentous problems in our world they need to think about which directions are the most interesting to pursue so that those will become stepping stones to actually solving these problems and interestingness is a separate issue from whether you're solving a problem it's just a question was this an interesting research idea or an interesting piece of art or an interesting story and it's very intriguing that suddenly it actually can start to engage with that question and not just in the sense of give you a rating it can even give you an articulate analysis of why something' is interesting and this is the beginning of innovation it's like the beginning。

auutonomous innovation so I think that's super interesting that that's now possible I could also talk a little bit about what I think are interesting limitations but but i'm not sure if you want to go in that direction already or not well no actually what why don't we Hongngzang what do you what do you think of this idea of algorithms。

Identifying what's interesting on maybe being a kind of innovation。

does that does that sound like a promising?Concept to you。Definitely it is。

although my own expertiseed are unnecessary in this area and definitely this is I think this is very promising direction。

Yeah, it's I mean it's it's I wonder that Kenneth, do you think the。That there's real, you know。

can tell you something that's interesting that a person can't tell。

do you know what I mean because sometimes when you see chat GT?It is impressive。

but it doesn't seem that original, so do you see examples where it finds something interesting that maybe no person would?

Good question Yeah, I definitely I believe that there's some serious limitations when it comes to comparing to a human's instinct for interestingness yeah these models don't come close。

that's true so it's just the beginning of a glimmering of the ability to grapple with this question of what's interesting but that's still extremely useful you because it's like it always comes up in sort of whenever you're thinking what should I do next if you want something on its own to think about what am I going to do next now that I finish this task now what's the next thing that would be interesting that's got to think about it a little and even being able to do it a little bit its still really intriguing。

but it's obviously something we need to build on and in fact when you talk about like unoriginal that's totally true they're not going to being original that's one of those really interesting limitations that I think is going to require proving the models and I would just point out that being original is related to being novel so novelty comes up and there's a problem I think with the current paradigm with being able to。

Identify novelty in a genuine way because if you think about it novelty as a function of chronology it depends on the order in which events happened。

whether an idea that you have now is novel or not。

but if you think about it the model is exposed to all of history simultaneously it doesn't experience its training data as a chronology that happens in an order and therefore it's not actually experiencing that you know moment of epiphany when you say oh。

this is really interesting because I've never seen anything like that before so if in the data for example it says something like that's a really novel idea it's not in the context of what came before。

it's in the context of ever came for and what came after so it's very different from the way that we experienced novelty and because of that novelty is really not in the data in any substantive way and that means that I would expect it not to be good at thinking about novelty generating novelty and so forth and to solve that will require I think somewhat of a paradigm shift because you've got to deal with chronology。

So fascinating, I was talking earlier this week with Des Deep Mind and he was saying I was thinking of this Hongjiiang because you mentioned alignment and some of the technologies that went into Cha GP and he was talking about reinforcement learning。

Being very important and。Obviously one of the things with Alpha zero and Alpha go was that it could come up with completely novel sort of strategies and it's not you know it's very different from a language model。

buts it's interesting like things that people never,Could never come up with。

But I wonder if one of the you mentioned alignment, I wonder Hongng Jang。

if the reinforcement learning or other types of machine learning are of interest to the instit to you as a way to kind of broaden the capabilities。

Definitely actually reinforcement learning and alignment。in AI in the building big models。

they are not two different things, but actually during reinforce learning is used。

In alignment process, in the learning process, that's exactly what made Cha GT。

what made the G4 much better than GB3 very very very development from GB 3。

0 to instruct GT than to chat GBT is really the alignment process that used reinforcement learning that you know you use human dialogue data you use human feedback through reinforcement learning to gather the alignment so it is a super important learning algorithm I reinforcement learning in the alignment process and alignment as。

A very critical step approach to AGI to large models safety and human align to human values so it it is super important and also alignment itself is one way very effective way to refine and trained model to a specific applications you know。

you fit more domain specific data into the model through alignment process that will help us to really adapt the model to various scenarios。

various applications, various verticals。That's great and are there other other techniques that you're interested in or you think。

I mean, I know there's the Wooow model, which is。I think my understanding is it's somewhat different from some of the other ones it was multimodal to begin with。

right?Are you looking at other techniques?From machine learning anyway that like what do do you what are you interested in for the next generation of these language models definitely VR。

We before chat T or before GB4, people in the field have been working on various models and like Google brain came out of BRT like before that again and all those the studies researchers have contributed to the field of large models and although today we see the pretraining that based on transformformer and combined with alignment it most efficient effective approach that led to GB4 and led to many models who tried to repeat the success of GB4。

but we do see there are。A large space, there are many issues that still have not been solved that would require further study and further research and that's cause for new new Ar。

even new architecture to the so you also mentioned a multi mod model。

Definitely that's one direction people in the field are pursuing very hard and we do see that as the future direction if not ultimate direction of AI models we human perceived information perceived knowledge through multimodality know we read we learn from language。

but we also watch movies so we watch video look at the pictures the way we acquire information is multimodality and so I'm not a neuro scientist。

but I believe our thinking in our brain is also multi modality so there is no reason you know our AI model is only language model。

but I want to emphasize。model is the baseline is the platform ands not only it is true that we learn how to build models building a language models the technology we learned know how we learned the inside we learn from this will help us to develop the multi modality model actually multi modality model could simply a continuation of language model So the the good thing about using transformer is basic architecture here is every modality data in every modality for to transformer is just a sequence。

Odated。You know, text, a language is a sequence and image。

if you scan image through patches is also a sequence。 and video is a sequence, a music sequence。

So it can handle, can host all those information and embedded them into the learning training structure and。

T the model itself。 So if we if, if we believe the future will be the will be the autonomous。

Intelligence, meaning that the model itself can reason and understand and take and plan and take actions we do the model itself got to be multi modality and definitely apply it into robotics。

autonomous robotics future, and general purpose autonomous robotics。

definitely it will be multimodality model。Well, that's a great segue to Ken。

Your work on sort of open ended learning continual learning and the the point you made just a moment ago about you know temporalal data it shows that maybe the ways we there there are more dimensions by which we're not。

Approaching intelligence is that is that fair to say。

you know like I mean I guess do you think of this multimodality and other ways of building intelligent systems and does do you think that requires completely different architectures and approaches Yeah I mean I think a lot because you know these models are so powerful it's just intriguing for me from the point of view of a researcher to think about what is missing that's what I think a lot about what is still missing like what kind of fundamental things are missing ands you know there's not a lot of things that are very very clearly missing because you could say well it's as long as I have it in the data it's there somewhere it'll get eventually picked up so we have a big advantage with the amount of data that we have but there are these you know very specific kinds of things where it's not just intrinsically in the data because the way the data is presented and one of those is this chronology like chronology is not in the data because the data is not presented chronologically and another one that's like that is multimodality of course。

Moality is not in data which is only text so that's clearly an opportunity so we're going to see no no doubt advances with multimodality but chronology is a little different though because you can't just like put it in it's not clear what that means exactly you know it's not you can't just put in chronological data into something that that doesn't process things chronologically and like part of what there is a little place you can sneak in chronology in these models which is in the context itself or the prompt like that's a place where it can have an order but the thing is like all of human history generally won't fit in the current kinds of prompt space and probably won't for a long time and so that's for all of the internet for that matter so that's a problem which is just an interesting research problem so I think there's just a few of these things like chronology and multimodality that you can point to concretely and then others are more like a wishy was like hallucination where it's like we see problems but we don't really know we can't really point to exactly the thing that's missing what is the thing that's the problem。

you know sometimes I've thought that maybe the hallucination problem is that it's a nonverbal activity in order to understand what you actually know and don't know which would mean it wouldn't be in the data you know like what the reasoning process I go through to think about do I actually remember this thing like when I'm inside my head and not actually articulating this outside my head but just trying to remember something someone's asking me there's some reasoning process where I come to a conclusion I don't actually know that or I do know that maybe if it's nonverbal and it's not in the data because the data is just came out of your mouth not you know before implicitly in your mind before that and so maybe there's something missing there perhaps but it's more it's a little more amorphous to point to but it's just a general interest exercise I think to think about what's missing still places that we can press forward that makes me think maybe that' some important insights to glean from cognitive science So if you think about you know experiments thatll show the way people。

Think or reason sometimes it's not verbal or not text。

Do you think that that's an important approach just to continue what you were saying?Maybe。

maybe I mean I think historically that hasn't panned out that well。

you know like if you look at large language models and the successful side of where they are which is quite remarkable。

most of it is not as a consequence of you know looking at cognitive science experiments and perhaps to the chagrin of cognitive scientists and so but that doesn't mean that it isn't it can't be helpful going forward。

but I would imagine as most of you more as just inspiration because you know the very implicit nonverbal reasoning if it exists is just so inaccessible I would expect it more to be something that would be emergent from the right kind of training than something that you could extract explicitly and then sort of write down like this is how it works so it I would imagine you would more want to reorganize training in some way maybe you could reinforce the learning in something like that so that we can elicit these kinds of steps that are non-verbal which correlate to what we do when we're trying to determine。

If something is true orme or you can be rememberedmeative do not?Hong Jianang do you have。

I know I think you have。Neuroscientists and cognitive scientists at the insute。

what do you think we can learn from those fields?It's a strange time because it feels like language models solve so much。

But but yeah, maybe there' are still things you think we definitely think we can learn from that from those field。

but it is still research undergoing and it's still a lot of work to do to be honest at this moment。

we haven't up come to any significant conclusion, we can apply them to building big models but。

Other hand, in contrast, actually I don't know if you read the recent work published by open AI folks on using GB4 to analyze GB2 base to the point that what's the function of neuron?

In GT2, you know, it does。WhenWhen when G2 you know generate a particular contact tax or output so that is very very interesting So I actually encourage those who who working on neuroscience try to you you know borrow some ideas from here it's not just we borrow ideas from neuroscientists。

but you know the other way around is also a very interesting research director but coming back to are in question yes at BAI we have a extended group。

😊,Of scientists from Qinghua University and from other universities in Beijing。

we we precisely looking at the problem of learning from a neuroscience。

we also have a small team building or what we call life models, you know, a simulation of。

humanum organs and simulation of brains。 so to to help neuroscientists to study。

you know if a particular neuro get activated。 What's the, you know how the entire brain react to it。

So we actually have a small team working on that。 we actually report to the to the year in in the conference yesterday on the。

That's very cool well the talking about the brain the complexity of the brain brings me to another subject。

which is the size of these models, the amount of computer power required and。

You know it it's extraordinary right and I think we all know that and that's one of the reasons why we're seeing such amazing results but to Hongzhang to stay with you what what do you think that means for research does it mean that it's going to become less?

acccessible less possible for as many people to work on these models。

do you think we'll see maybe more efforts to make smaller ones。

what does the size of those models and the amount of data sort of tell you about the future directions?

Yeah well well well you actually raised quite a few a number of questions so one very straightforward one。

I would like to to address is you know you mentioned about the research of the anmia how would they react to this because anything to do with big models require large amount of computing power and that you know that simply require them to。

Work on system and collaborate together and collaborate among them and collaborate among with their industry partners。

and they are my experience with Xinghua University, you know。

is that if you count how many professors researchers in Xinghua University working on topics related to big model that many of them。

they actually have quite a number of quite a high number of you know GPUus。

but they scattered among different groups, right so get them together you know。

put their resource together is obvious solution for you know if they want to work on bigger problems。

but also I would say a scientist or at demons especially those in the university。

they should and they tend to work on。basic issues。

basic problems that much of that still can can be researched on without huge amount of computing power。

but they want to build a systems, definitely definitely they need to to to collaborate among themselves and collaborate with industry。

but one thing I would look at this a problem with a positive from a positive angle is that actually I think the breakthrough of GB4 many of us in the research community and especially in acadeemia rethink。

What is the best way to conduct research in computer science in AI And if we want build a system。

if we believe AI as a system and the problem can only be solved by system approach。

Then we should pull our effort together, we should。P our resource together。 and we should。You know。

really。呃,form the form research problems。Into something that we can work together。Yeah。

I think you've done some impressive work in bringing together academics and industry so far and thats require a lot of effort that's really require a lot of effort yeah。

But what is the most challenging thing about that Well I think I would first say that GT4's success actually helped helped us a lot you know so from now on I think much easier but two years ago it was much harder you know a fundamental characteristic in academia is freedom right professors get to work on whatever they are interested and this is a good thing about academia but but when we want put everybody together from academia to work on one。

proon。And you know, they tended to, first of all, they naturally。

they look at the problem from different angles, they, oh I'll do this part, I'll do that part and。

But having them working on one thing and or even, you know。

try to segment a bigger problem into pieces and I have each one of them more on one piece。

it's just so hard。Because that's not how academemia operate。😊,But that's, you know, actually。

that actually says a lot of why the first success come from open AI because they take a system and an engineering approach。

And then Google Bra, they have an even bigger resource and a bigger team and the more well known scientists。

but they couldn't put the effort together。干嘛的 one modelto。

But they come with many models and that's a showcase academia in the universities that will be' a situation is much more fragmented。

so thats really take require a lot of effort conviction and capability of to motivate people and defite capabilities to allocate the resources in the right way。

😊,Yeah well that's great the yeahm Ken I I mean you were at open AI I'm guessing they had quite a lot of conviction around。

This singular sort of approach, but you know you've gone to academia so what was your what was your thinking why not stay。

At a place where you had huge amounts of resource。You mean。

you're saying that I've gone to academia after I was actually an academic before putting eye。Okay。

but she actually went in the opposite direction。Okay, oh I see what did did。

was that what drew you in and that you?It's I mean it's a there's a long story but it's a part of the story that you know I recognize that Id have access to vastly greater resources。

I mean that that certainly entered into my mind and is a big problem I think for academia that that pulls professors out of academia which just can't provide the same resources and I think it's a somewhat irreconcilable problem means not just。

It's not like you can just get full resources enough to match the unbelievable you know amount of industrial resources or money that we're thinking about in the future like this may not be yet happening but when you talk about things like systems that might cost $100 billion I mean that hasn't happen yet but I don't know what academia can do about that if that does happen to match something like that so it's a really interesting question。

but you have to remember that hypotheses in science can be both related to scale and not related to scale so there certainly are hypotheses that can be addressed even if there's $100 billion dollar system sitting in some you know very wealthy company but the ones related to scale yeah I don't know like we're talking about scale beyond imagination at that point and it raise a lot of questions but another thing to realize though is I think it's also interesting is that even within these companies they don't have infinite optionality you know if you're going to run imagine you。

have a $100 billion dollar experiment it's not like you can try 15 different times and test all your hypotheses and I mean you're taking a bet and you're going with it it's actually not very scientific it's really based on gut right because you can't really do the kind of systematic testing that as an academic I really believed in like that was what I was trained I'll try all the different parameterizations and I'll learn all the different angles on this and understand the system how it works like you can't afford to do that with a system like that so it's not like they can just do all the experiments any academic would like to do and even the researchers at these companies are very restricted in terms of like their individual hypotheses and whether those are ever going get the light of day so it's just like the scale is just unimaginable in the implications that it has but I still do want to emphasize it's not a reason for anyone to give up because I still think there are many hypotheses that can be worked out at small scale and some of those can disrupt a larger scale like for example of major architectural vision it's not like we go back to a neural network with 20 connections or some tiny thing but you know like。

Today's large model is tomorrow's old news and so like the things that today take a lot of compute like you know five years from now don't and they're enough to test some pretty significant hypotheses and so a lot of people will have access to that within a couple years and those hypotheses could reinvent everything in such a way that even though really big models need to be re out so I wouldn't at all give up because I don't have access to the most fancy thing but it is an interesting dichotomy that's developing that didn't exist before。

Do you think there could be better sort of collaboration or cross pollination I mean I'm sure you're right that it in some ways it's irreconcilable。

but it would seem that maybe,Industry could benefit from the perspectives of academics a lot and。

They could benefit from a compute without completely, you know。

derailing academic sort of principles or。you know, giving up, but giving up all the secrets。

but when when you know GPG4 the paper has literally zero information。

you kind of wonder if there are ways to maybe have more collaboration。😊,Yeah。

I mean and historically there's been collaboration,'s not like there's never a collaboration。

but but I think there's you know there's the cynical version and the optimistic version like you know。

both are probably kind of just play out at the same time or some people are going to think。

Well we can just read their papers like we don't need to work with them or we can just hire them because we can just offer them five times their current salary so why bother with all this like there's a lot of complexity to at least in the US to collaborating with academics because you've got to go through the office of research and then there's all this legalistic stuff like who owns what and where's the IP going some companies will just say forget that we'll just like hire the person out and you take them with us if we really care or wait for their paper to come out but then the optimistic version also exists you know where it says there is a lot to gain and it's worth it to create this collaboration and this person I think the theory there is this person is actually more comfortable in an academic setting more free like Hj said there's more freedom which maybe actually in the interest of the commercial entity to have people collaborating that are more free to kind of explore just unusual directions and maybe they want to encourage that to some extent because they do see it is in their interest and so they go through the effort to actually make the collaboration。

Work and I think you're just going to see opinions vary and both kinds of opinions will be expressed maybe even in the same company sometimes and we'll see a mix overall yeah well i'm i'm curious when you you you know Hongjiianang have you seen。

Benefits from having academics help on or more involved。Even though they're still within academia。

I specifically。When there are problems with models, things like hallucinations and。You know。

when people are worrying about alignment。It makes me wonder if actually there's going to be more good reason to have outside us。

Take a look at the people who want and more people have access to models。

I would say it's's super important and critical to have researchers who have des involved in the large model research and you mentioned specifically alignment but in all aspect of large model research the very fact is today the model itself is is still very costly I'm just give you an example here。

you know the model model itself is very costly, not just in training。

but also in operating in serving running the model so the training process not necessarily mostly efficient one either and training model size could be optimized so all those you know if you if you kind of。

If we find a way to abstract those issues into problems that academians are best added。

And then it's you know, the academ is will fund, you know well。

What perform will will make their contribution and those contributions are critical and so the its the matter is。

you know, we need the academscadem people from academia to。AndW who are able to。Extract or abstract。

Those problems, those issues and working on them。 So point instance, we。

we worked with a few visiting scientists from various universities on training optimization。

So training efficiency。So to to, to you can say to speed up the training。

also working on model size optimization。And for the same performance。

do we really need that model sign?And also, you work on new architecture of models to。

to make it modized。Right issues like that definitely require。Deeper research。呃。

Even on the building model itself, we, we, we do see the benefit of researchers from from universities。

呃。Actually, they also feel beneficial as well is otherwise they may not see that many research problems。

I think research is definitely about a solving problem, but research is also to my extent。

in my view, is actually more about a funding。Problems, define problems, then just solve them。Okay。

Ken, you're nodding, do you have any, do you want to say anything more of that?Yeah, I mean。

I strongly agree with this point that academia is an essential and critical link in the chain。

like that ability of professors to explore with a different kind of freedom that exists in industry will expose opportunities that just won't happen in industry and that's going to be essential to progress。

So you know this is a vexing problem, I think that industry tends to at the moment suck people out of academia and hurt these departments and the entire academic enterprise and it's worth a lot of thought I think from the university sides like how to counteract this because from their point of view it's unprecedented most fields。

I mean most academic fields don't have this kind of thing happening to them where it's like just so much more lucrative and better just to go somewhere people don't have this kind of optionality。

but in this field is very different and so universities I think need to treat professors in this area differently so that we can maintain that fabric which train the next generation and exposes these ideas that are going to be essential that you're not going to get an industry and so you have both sides are totally important。

that's great those are great thoughts well we've talked a lot about language models and chat GbT but we haven't brought up safety in the the kind of the AGI。

Existential safety and it's I mean I wouldn't normally bring it up。

but it's such a big topic of discussion。And it seems like a lot of people are taking。

BothBoth the short and the longer term risks seriously, so I guess。Speaking I both along and。

Short term risk I want to just ask how that affects research I mean is that going to become a huge new but it seems likely that it will area of research and will it affect disclosure will it。

So Ken, you know how do you think that this is going to change it I don't know where you sit when you on the spectrum of worry about AGI and AI。

Yeah, no I think it's worth worrying, but I mean I guess where I sit is more of know I think somewhat of ambiguity like I'm not totally sure how worried to be。

but I think it's worth being worried because there are a lot of things that could turn out to be very significant shortterm and long-term threats but you know a lot of people discuss this issue with certitude and I think that at this point in time that doesn't really make sense and it makes it actually hard to disentangle like what we don't know from what we do know and there's mostly what we don't know I think okay so that said though I think that clearly it's a field like you know AI safety research already people say I' an AI safety researcher but I don't think it's a separate discipline you know in the sense that like creating more intelligent machines is actually building AI safety and this is a paradox that actually is very difficult to solve for the field because know you could think of it。

It's sort of this simple dichotomy where there's safety research and then there's improving the model and then you can say let's put the brakes on the models and just work on safety and then we can make the models more powerful。

but the problem is making the models more powerful could be what makes them safer you know after all sanity itself is like a really important aspect of safety and sanity is a function of highlel intelligence and so this relates a little bit to like think points that are made in my book you know where we talk the book that I David with Jo Lemman where we talk about how you know often the things that lead to what you want don't actually look like what you want and so like if we just focus myopically on safety and safety is a safer and safer and safer。

the problem is that the thing that really leads to a profound and fundamental shift in safety might not look like safety research right now and so other kinds of research then needs to be happening because that could be the stepping stone that leads to the real revolution and safety and we don't know because we don't know what the future is so we have to keep our options open and it leaves us in a very kind of awkward position。

Because obviously at the same time as this kind of advance could lead to safer systems。

it could also lead to more dangerous systems, it's like just as possible that getting really powerful is actually extremely dangerous and so we're just walking a tightrope on this and just have to I think we should just acknowledge that that we don't actually understand the parameters of the dimensions around us as we try to walk the safety tightrope and we should just be very careful obviously moving forward because of that。

Wow, yeah that's those are very great point Hong Jg do how do you think about safety。

long term risk and how does the Institutestitute looking at short and long term risk?Yeah。

just for your information in this AI conference yesterday morning we in the opening keynote addressing。

we have two scientists with cut of opposite view, we have a massive telemark from the Future Life Institute of MIT and who you know actually I know here they the petition to pulse AGI research for six monthss that definitely brought up the awareness of potential risk of AGI and they have a young La who say。

oh we are far from AGI at this moment we still need to work on know get AI models more intelligent。

we are far from AGI G cannot understand。The things the level of human intelligence have so and also today we have a one day session on safety and alignment and we actually start this session is with Sam Altman this is one hour ago addressing the audience and he is in the world tour and really on this particular topic。

so that had a Q&A session with him along this topic I think Max did the right thing and Sam did the right thing to bring up the awareness of the potential risk and I really think he's doing human one kind of service by making this。

Tos by talking to various government, various institutions and。

On our research side BAI when I think nine months after we established BAI。

we actually joined forth with academemia in China we actually published published Beijing principle on AI AS。

and safety was definitely an issue there, I think also after the petition that Max initiated we actually have quite a few Chinese scientists signed on that petition including our very director of BAI who running the BA daily operationeration。

he signed that as well he has been in the effort ever since Max had the first conference in 2017 and。

When we're looking at this issue of AGI, now we definitely spend a lot of time thinking about ASI。

Not's for sure, think about assay and safety then come naturally after that and we have team looking at data。

we have team working on algorithms that clean up the data and we definitely have we have more effort working on alignment。

Alignment, but I tend to agree with。Young La a little more than Max on this and that we are still far away from you know human level AGI so with Sun spend a lot of time working on that。

I like Kenny's point on you know smarter model actually could make it safer the the but I definitely support and BAI definitely support max effort to bring up awareness to set up consensus and to。

To mobilize the community, to look into the。That's great yeah。

I'd heard that some scientists there had signed Maxes。Max's pledge。And I'm wondering, though。

is in the US and in the West, I think this issue has become quite。You know。

headline news it's become a big big story, is it a big topic in China generally, would you say?

Definitely definitely definitely among the AI circle for sure and on media for sure and the government agenda as well as far as I can read from the media yeah definitely so that's what that's why I said it's a good thing that Max and the community。

😊,P the effort up to bring up the awareness。Yeah it's good I mean it's good to hear that and I think that scientists are the ones who are going to probably have to take a leading role in that I would I would imagine right there's so you know Kenneth I don't know if you have thoughts on this but but there's there's you know talk of regulation and how things would work internationally and how you know。

Countries that are very big in AI would would ever figure things out。

but do you think that there's a path to that that scientists can kind of。

Offer some ways for to sort of。Allign themselves to use that word around certain sort of。

Princiipples or whatever。I mean, this is a really big question the way to organize not just scientifically but politically in order to somehow route this giant oncoming thing into some good direction I think you scientists of course but one problem of it to except is that just because you understand the AI doesn't mean you understand its social implications and there's another problem which is that just because you work and say social sciences doesn't mean you understand the social implications of AI either。

so like we have a problem that there's not really anybody who's like truly the right authority who can just tell us the ground truth here is like what should we do and this is leading to this very ambiguous and confusing situation that we have when we talk about regulation because there isn't an ultimate backst of authority figures that could just be like oh of course you you just talk to us but we do have this sort of tendency to go to the AI scientists which I think it be have to be cautious about。

Because they're not necessarily experts on the social implications of the scientific insight they've had and so of course though their insight into what the technology does is important。

so they obviously have to be part of the conversation and so overall though it seems like where things are heading is towards I think probably regulation will most hit the larger models and so you know it won't be affecting smaller academic types of research that much as my guess but like at the high level of power and Sman has said similar things that that's where you'll probably see like a much more tight regulations models that are actually threatening in some way to sort of human stability and then there yeah it's interesting that that's sort of similar to the problem of not having access to the most powerful models to begin with for the academics so it's sort of like almost like it doesn't even matter that much because they won't be able to do anything of those models anyway because they're too powerful native access to them and so it's really going be。

In this kind of very refined circles, but it's very important, nevertheless。

it's extremely important because like these models are potentially dangerous。And so yeah。

I think it's a group effort to find some kind of amalgam of people that we feel we can trust because the biggest problem I have is that I don't know who to trust。

I wouldn't even trust myself and that's a big problem you know to actually get a grip on this issue。

WellYeah, know those are great points I think I want to finish on a。

Upbeat positive note because obviously this is a really exciting moment in, you know。

once in a generation, I guess。For at least for AI。😊,We're pretty much out of time, but just briefly。

Jennifer, what are you most excited about when it comes to the future of AI research?一。Well。

I think I'd say maybe just throw out two things and one is just the amplification of human creativity you know we're looking at a world where right now there are many things that you may want to do and have ideas about but you cannot do because you don't have the skills in you order the talent and this is about to change and that's really interesting in terms of empowering human creativity that you I can't I can't write the story that I have in my mind or I can't paint the picture that I have in my mind I can't build the robot that I have in my mind but suddenly the facilitator can actually come between my ideas and the actual implementation and make all these things reality and it's just hard to even imagine that world you know where it's like I have the germ of an idea for a song and it's something like a fully produce like radio quality song within five minutes what is that going be like I think that's very empowering to people and has a lot of upside to it and the other thing that I think you don't hear that as much is just maybe this can help us to。

Better at finding ways to interact with each other。

finding ways to connect with each other that are more virtuous than social media today where obviously there's all kinds of issues with toxicity and so forth。

and the ability of machines to maybe think a little more deeply about what we really need and what's healthy for us on our behalf at a huge scale like billion people might be able to help us in some way to connect better with each other。

which I think would be a nice antidote to this thing that we're facing right now is that a huge proportion of our time is going to be spent talking to a cold machine and it'll be nice if it can actually help us to have more humans too。

Those are great points。😊,Really good Hong Shaang, what are you excited about for the future I'm very much in agreement with Ken。

I think after 60 or 70 years for researching in AI。

we finally come to the point that we realize AI can really empower people。Can mentioned about。

you know, empower people to create things online。 And I'm, I'm also very excited on。 finally。

we can have AI that can empower。😊,呃。Robotics robot, you, today's robot is so much you know。

specific task oriented, you know, can only do one thing。

know inspect component or pick up one thing and serving very, very, very specific task。

but with the advance advancement in in in large language models。

especially in the future in multimodality models, we can see that will be completely rewritten know even even autonomous driving will be completely rewritten and with with the systems that have moreous autonomous ability more planning ability today I mean even G4 and not have those kind of capability。

But the is a scientist who has you know been working hard in the last 30 years and tried to bring up something that can convince people oh yeah。

finally the machines can do better and finally the machine can complete the task and itself it's very exciting but I do want to quote Ken's book know a great success not cannot be planned so what we are doing now and what I have been telling my funding agencies when I raise money from government agencies from industry they often ask me a question。

you know in five years what can you deliver I always tell them that all。

I do is to increase the probability of success in AI。That will bring good things to the society。

but I cannot promise you with what exactly I can bring to the you know, timatotoes, but you know。

it is the probability that I'm going to increase。Okay, well。

That's a great note to end on and a good ad for Ken's book as well well I've really enjoyed this thank you very much thank you Hongzg for hosting this and thank you for being in it and thank you Ken for joining us and being a guest here it an excellent discussion。

😊,Well, thank you very much well stay so late and can thank you for taking the time and great to see you over Zoom。

but hopefully we'll see each other in person, if not in Beijing。

but maybe in San Francisco or Cambridge。😊,Sounds good。Great, all right, thank you, bye。G这 care。好的。😊。


报告题目叫做 of developingOPT呃 hundred and seventy5 billion。那简单介绍一下我们的呃苏usan老师,他是毕业于普林斯顿大学数学系呃。

一直专注于大规模呃AI这个基础设施的那个设计。呃,同时的话他也是呃OPT的作者,然后在模型呃以及软件系统研发方面有超过10年的一个经验。呃,下面我们有请苏usan。Hi hi everyone。

yeah thanks for having me, did I just share a screen directly?Yes, yes。

you can start the presentation now。Okay, and after that I'll gather her some questions from online and we'll have a short Q&A session after your talk。

Okay, let me, I think I need to fix my setting one second。Sorry。😔,我们可以放PPT的是吧?嗯他是S screen。

You share your yeah, I'm trying to oh I have I have to restart one second, sorry, okay no problem。

大家稍等片刻。嗯。O。PPT。Okay。DoCan you see your screen?Is a screen good?

Hello everyone everything's good now Susan speech Okay right hi everyone thanks for having me so today i'm talking about the trials of developing OPT one of a billion parametermeter model that was released last year back in May。

😊,So just a bit about me, so I studied math many years ago at Princeton University afterwards I spent a significant portion of my career building these largescale distributed systems to support data processing workloads。

I then moved into building reinforcement learning systems at Open AI from 2018 to 2020 mainly focused on the dota to and Open AI5 project afterwards I briefly moved into the hardware based for pronic chip design at luminous computing when that you know didn't really work out I moved back into AI software in 2021 to develop large language models at Me so that this talk is mostly focused on the early parts of developing LLMs at meta specifically for the ones that I building print model we trained back in 2021。

So the setup here is about a team of five engineers。

we were tasked with training this model in just about three months using 1024-80 gigA100 GPUs。

these were the latest generation of GPs from NnovaVity at the time and with the training efficiency code that we had we still needed about 33 days of continuous training assuming no hardware issues。

no failures, nothing in order to go through 300 billion tokens we didn't really have an explicit infrastructure or systems team to support us outside of a customer support team from the cloud provider and the data was kind of whatever we had available at the lab at the time so this was a combination of a lot of the data sets that was used for Roberta and also for Berbot work from the dialogue agents team。

For the hyperparameter settings, we were also familiar with a very different set of hyperparameter that were circulated with the fair LLP groups at the time。

it was very different in the settings that was available or that we saw from Microsoft and NviDdia for their megaron turing NG work along with what was published by OpenAI for G3 so this you' will notice in the beginning of these runs we'll spend a lot of time trying to figure out how to bridge that gap。

啊。So in October that's when we first started our training runs the safest thing to do at the time was to go with kind of the hyperparameters used for the existing language model setups within the NLP groups this kind of shows some kind of empirical proof for how they could work the largest model trained up until that point was about a 13 billion parameter dense model so the hope was that these were transfer to the 15 billion scale without the issues of course that didn't really turn out to work so for the second run we start increasing weight decay so we start with 0。

01 we increases 0。1 that didn't really work out for the next run we start drastically shifting more towards the GPP3 settings by setting a gradient norm clipping threshold of 1。

0 reducing item beta2 from 。98 to 095 and also increasing item epsilon from 188 to 1 in86 thinking that that could help stabilize the runs so it turns out none of these settings actually mattered we realized that。

know afterwards that there was actually a bug in the code that we used to implement tensor parallelism to scale to ones that0 billion parameters we were checking at the time with a smaller scale run and noticed that it couldn't converge so this is a very obvious lesson we should have started very small before going to the largest run but given the time crunch we kind of short circuit of that and that came to bite us in the end so while we were debugging this and we had to rebith this code base causing kind of the training rent to go a bit slower we figured we would at least check the hyperparmeter settings and see if they would actually work without the really efficient tensor parallelism code so here we go back to kind of the old weight to K settings thinking that that could work fine and that didn't really help we start clipping like gradients again increase the weight to K and so this is kind of us trying to figure out how these settings interacted with one another at the ones that0 billion parameter scale we also increased warmup thinking that that could help increase the stability of the run that wasn't enough。

So by this point for run six we actually fixed our Tensor parallel code so that we can train a bit faster and so we go back to kind of the original settings we had but still keeping gradient clipping thinking that that should be pretty safe to include so that still didn't work so for run seven we added weight to K again we increase one again and we also skip the last partial batch in case it was an issue with competing the gradient there and causing kind of a little bit of instability with a smaller than normal batch so we do some more fiddling around and we also do something you know we increase the batch size from 2 million to 4 million none of these seem to really help so for the batch size case we since we saw no noticeable improvements we decided to go with a lower batch size the hope is that by taking more optimization steps maybe that would help us converge to a better minimum。

So by this point know we think we settled on a few settings that we kind of all compromise on and we're kind of ready to just like launch our actual run know we telling ourselves that by this point for the 11th run。

we should start indexing with decimals so this is run 11。

0 so for this we set on a two million batch size we keep our atom states and FP32 so the highest precision we used tensor parallelism by startinging the model across8GPUs in parallel we were doing data ablations don't I'm not go into detail about that here。

but pretty much we're trying to figure out the ideal data composition and we're also noticing some bugs and kind of when we exported the data set and added a bunch of escape characters it caused kind of an artificially low loss when the model was just learning to memorize these extra escape characters so we were trying to make sure that the data set was actually good to go we also used learned positional beddings but we weren't sure whether or not we wanted absolute learned positional beddings was kind of a Gaussian initially。

similar to GP2 or sinusoidal initialization sort of matching the original transformer implementation。

so we kind of just meet in the middle by initializing these personalal beddings with sinusoidal knit similarly for weight decay we weren't sure if it's you know 0。

01 or 0。1 that's nice so we could split the difference and use 0。05。😊。

We also use a pretty high learning rate to start so historically in kind of the implementation that we had at fair the learning rates were set up usually a lot higher than what was published externally。

so for GPD3 the learning rate was 60 negative5 I believe and so we used 3D negative4 so quite a few factor higher we also don't apply dropout on embeddings and we also include norm For that was some work that came out kind of a couple months earlier from the lab。

the thinking there was adding a bunch of layer and drawings can help stabilize the run in some other settings as well。


So just in the beginning here this was already starting to look pretty unstable in the first few hundred steps that's the first green run you can see at the top there we thought that you know maybe this is just because of the learning rate was too high so the first thing we do is you have the learning rate from 3 negative4 to 7。

5 negative5 or quarter it that didn't last rate long so that's theient yellow line right here up top this is our lost curve and so at this point we lowered gl and clipping threshold from 2。

5 to 1。5 so now we'll clip more frequently especially in the beginning and that's the purple line above so this keeps training for quite some time you know for us at this point you know getting past a few hundred steps was a blessing and this is when we hit our first actual hardware issue we hit an uncorrectable ECC error so we just restart the run you get rid of the machine and that's the gray line in the middle there and then we also noticed that we were valid。

oo frequently so that was causing kind of a 10% to 20% overhead and so we reduced the frequency of validation so while that gives us less visibility into the health of the run we thought that the speedup would be at least worth it then for the next few runs so the gradient norm start spiking now for for the green line and so we lower clip again clipping threshold again from 1。

5 to 1。0 and that we continued training there so after that point this is when things start going really really bad so this is the zoomed in portion of what we were just looking at so here you can notice there's a ton of restart in the middle for us trying to figure out exactly how to get this run back to where it was before it hopefully could but lost the continuenu going down but we don't make it very far so the first kind of issue here we tried remediating this kind of this instability by skipping batches when the gradient norm was too high。

Instead of just clipping it, we thought maybe getting some new data into the mix would help stabilize the run that didn't work out very well so we roll back a bit further and so you see for the next restart it at an earlier checkpoint and here we finally go towards changing some of the other hyperparameter so increasing weight decay thinking that that can help regularize the updates we also lower beta2 so that we we're averaging over fewer steps thinking that maybe we can adapt to the gradients more quickly and that didn't last very long so that's the first kind of pink wiggle there and then for the next bit we keep you know we keep the beta2 sort of 0。

95 but then we go back to try to figure out if weight decay changing or gradient changing with something that we needed so you sort of see this kind of overarching of theme of changing a few things at a time and maybe then bisecting this mostly is motivated by the fact that each time we do do this it costs a lot of manual overhead it costs time。

Valid so we try to batch a lot of these changes that we think are safe together。

of course it makes experimentation hard, you can't really disentangle the effect of one versus the other。

but this is kind of the expense of you know kind of only having one chance to train this model and having to sort of correlate a lot of these hyperparameter together。

So that still didn't work out too well for us the next thing we do now finally we go into this lowering the learning rate kind of mode of operation。

but that obviously we can't do that indefinitely but that still doesn't last very long。

so that's the orange line here。Now at this point we're also watching kind of what was happening in the open source community。

the big science effort was starting up and they were trying to train 100 billion parameter run and there was this pull request where they mentioned that there's this numerical stability kind of issue potentially with the MhaA calculation。

the multi-head attention calculation so specifically it's a very simple thing when you're doing a multiplication by a factor of n just split it into multiplication by squ root of n twice and that in some cases for large values of n especially for large models could help improve with stability so we implemented this change and ended up restarting our run now we also notice here for the galu term or the gallu activation there's this X cubed term and that could also cause some instabilities for certain certain layers so we just swap in value instead so that we don't have to deal with this X cube factor and you know this transfer bit longer this is a pink run so clearly we can make it past some instabilities here。

But the problem was it didn't seem like the run was actually going anywhere right just kind of plateaus and on the side we're also doing some other ablations with initialization thinking that maybe this gradient explosion or vanishing issue。

whatever this was that we were dealing with, maybe this was a factor of not initializing properly for such a large model。

but none of these ablations actually turned out to be meaningful in any way。😊。

So one thing to know here is that when we were training this run back back in November 2021。

even though we were using A100s which had BF16, we're most familiar with kind of FP16 training with no mixed precision and so in order to make these runs converge it was this extra factor that we cause a loss scalealar that is usually implemented and the thinking here is that when you're training with FP16 so you don't and with no mixed precision so you're not keeping any copy of weights in FP32 or above we use this loss scaling term to try and preserve small gradient value so the thinking is you know when we haven't overflowed in a while we'll scale loss up so that you can sort of surface the signal from small gradient values and then when we start overflowing we can scale down the loss and usually what we see is that when this loss scalear crashes to near zero effectively giving your loss in your gradients to become zero your model starts stops updating so that's also very unhealthy signal and when that happened。

d of training stop so for the pink run here when things kind of just never converged。

you'll see that near the end, the loss scale was very low so we weren't really actually getting any kind of meaningful gradients flowing through。

So at this point now we were looking at the clock and to recap you know we know we need at least 33 days of training for one I a billion model on 300 billion tokens and using 992 GPUs so notice that we change from 1024 to 992 now this is after seeing a lot of hardware issues come up and we have to switch in GPUs when GPUs go down so in order to have minimal downtime we have to train on a smaller subset of machines and leave a pool idol so that we can swap this in。

We also know that we had to benchmark the model before the end of year。

so we really didn't have much time to explore all the hybrid parameter settings and try to get things to work so from looking at all the restart from the previous lineage of models of experiments there wasn't any strong signal that these settings would actually work out if we were to keep training with them so at this point we decided to do the complete drastic switch to GPT3 and megaron codebase settings since the two already seemed relatively consistent with one another and there was some evidence that these settings Kansass successfully actually train models even though we weren't using exactly the same codebase。

So specifically here you know we updated our weight initialization overall the change was to kind of reduce the standard deviation of all the weights so that effectively you're initializing closer and closer to0 we also remove a lot of extra layer norms from the norm form setup so this now you pretty much exactly mirror what we think the GP3 architecture looks like we remove embedding scaling this was a term to kind of scale the embedding to Gaussian with standard deviation of 1 which may be too high if you notice that a GT2 cobase the standard deviation its 0。

01 and 0。02 there so it's still pretty small and we also go back to Gaussian initialization for the positional embeddings we did a series of ablation separately and notice that if we initialize with sinusoidal in it for positional embeddings they weren't actually being updated in any meaningful way so we might as well just go back to you know the Gaussian initialization。

Wight decay, we finally go to 0。1 and just stay there clipping gradient cliping same thing we stay at 1。

0 at beta2s is set as 0。95 and these are all pretty standard now if you look at a lot of works that's coming out。

most folks do use an at beta 2 of 0。95 with gradient and clipping of 1。

0 but at the time that wasnt you know it wasn't clear that this was the go2 kind of setup for these larger models。

And the learning rate here to recap, you know, we initially started with a learning rate of 3 negative4 in the previous lineage of runs。

and now we just only do 2 x, the GP3 learning rate, which is 1。2 e negative4。

And so here for the first 15 resource and now you notice that we're indexing with you know two decimal points instead of one because we were prepared for up to 100 restarts and we actually just faced a lot of systems issues so this is when A10s are just coming online these of 80 gig machines and so we faced a lot of lost GPU errors。

coa errors sometimes the job would just stop updating nickel errors and things would just slow down etc ce so while this was not a convergence issue and we can generally kind of work around it by figuring out which machine is bad a lot of this tooling didn't exist for us at the time since this was new hardware we also of course we couldn't blame the hardware the whole time we had some code issues ourselves as well so when we were doing checkpoint storage know in some cases that wasn't operating successfully or taking too long given the kind of compute environment we were in and we also noticed that our lost gain logic wasn't fully deterministic so when we actually checkpoint we lose the state of what the loss gain there was at the time。

And this is specifically for the FP16 training once again。

and actually this turned out to be a blessing in disguise。

we end up using this kind of nondeminism later on to get through some instabilities as well。😊。

So overall when we see instabilities these are kind of the four metrics we look at at the top left here。

this is the last layer L2 activation norms and so when that spikes we also see to the right the gradient norms for the overall model kind of spike as well the two generally are quite correlated of course we look at the loss curve of perplexity and for EP16 training we also look at the loss scalar which is you know when that crashes to zero it's kind of when everything else you know activation norms gradient norms spike so these all kind of are leading indicators of potential divergence which doesn't necessarily get reflected until a little bit later。

So when we hit some instabilities here, the first thing we do is to reduce clipping now so we go back to clipping of 0。

3 if you look at this previous plot here, most of the gradient norms are around 0。

2 so we decided to be pretty drastic and do a lot of clipping in case you know gradient spikes can lead to you know kind of a compounding effect。

And we have a backup plan pretty much resetting the Adam state in case that was also causing issues and mismatching with the current batch and sort of repopulating the at states。

we later kind of explored why we would want to do this in a more recent work that came out back in April of this year where we kind of studied this up to a 546 billion parametermeter model in order to kind of remediate instabilities。

And the next sort of 17 restarts, you know we go back to the whole hardware issues。

you know ECC errors lost GPUs, even high like de uncorrectable errors。

even though they're not uncorrecttable errors, it still can cause issues with the kind of compute environment we had there were also sometimes when the job would just stop updating and so we had to sort of mainly trigger some restarts and so on so。

Here, so this is when we do something quite drastic the thinking here was that all these instabilities were potentially linked to having our optimizer in a vast state or maybe just know atom in the case that we had it just for some reason wasn't scaling well so we decided to test something quite quite drastic here it's kind of bad idea but we start switching to SGD without momentum to try to and get rid of atom entirely so first we try to be clever by approximating SGD we can do that by pretty much running atom but setting beta 1 to0 and also increasing epsilon to 100 pretty much wiping out the second moment term and also increasing learning rate by the same factor in all these runs it turns out that it actually wasn't working the way we thought it was the way we were kind of reloading atom states the hyper parameters were not being reset so all these settings weren't or these experiments didn't actually yield anything meaningful so we do the honest thing of actually swapping the code entirely the。

And we use SGD but in this implementation of SGD we also had our own bug of not actually applying weight decay and we also weren't tuning the learning rate properly so there were some kind of findings that were coming out or a couple years back where for SGD you might need a higher learning rate we weren't really tuning that properly so we didn't really see any useful signal。

😊,So by this point after all this fiddling and thinking you know we probably shouldn't try to be clever anymore just do the honesty of lowering the learning rate we proceed to just lower the learning rate so the final learning rate curve here looks quite you know if you squint it looks quite like inverse square root like we go with a kind of linear decay to kind of just linearly interpolate our way down the thinking is that you still want to keep learning rate as high as possible without causing intabil but we don't exactly know what that value is at the time and you see there there's this little brief moment in the middle there where we increase learning rate the thinking there was also maybe we were too aggressive and lowering it things look too stable for once and maybe you just increasing it back can help us kind of converge faster that didn't last very long and we still had to reduce it two more times afterwards to keep it training。

So overall when we kind of look at the health of these runs where we stare at this is a tensor board kind of reflecting these metrics and the main plot up top here is the loss or perplexity curve and it's very noisy right so even if we add smoothing it's very hard to tell what direction the loss is going you can vaguely see near the end that maybe things are diverging but know that's only about a few hundred steps and given the noise is' still not definitively clear that's the case but if you look at the other metrics it's very obvious right for gradient norms and now this is the bottom row here for gradient norms you see things you spiking for your loss scalar things are crashing kind of inversely related to gradient norms going up and similarly for the activation norms usually we see that decay over the course of training and when it reverses direction like that and starts increasing it's kind of a leading indicator that something is is going wrong so in this case specifically so this is still kind of kind of zoomed out version of what we were looking at earlier just by lowering the learning rate and restarting from a checkpoint you can。

They change the direction of where all these metrics are going so the top left here is the loss curve right so the green line is restarting with a lower learning rate loss keeps going down it's very clear that that there's a directional change there similarly for gradient norms instead of spiking as before this is the second plot in the middle here in the top row instead of spiking suddenly you know it keeps kind of stabilizing and training around like you know 0。

14 and same thing so you can sort of see that a lot of metrics can shift just by fiddling with the learning rate。

So after 56 days of dealing with all these issues, you know hardware issues and these instabilities。

we finally end up with our loss curve that looks, you know something like this over the course of you know 300 billion tokens。

it's very colorful there were many many restarts even you know missing some logs in the middle when our entire cluster went away but this is you know kind of just summarizes how manual this process was to start with before we eventually started automating a lot of this work later on。

So I can stop here and take some questions, I also have some more slides for the 66 billion run I don't know if that's helpful to share but yeah I can go into that as well if that's useful for folks to look at it looks very similar to the ones I've had billion run。


嗯,大家线上如果有什么问题,可以打在评论区处,然后我来和老师进行一个沟通。Actually, Susan, I have a question。

Do you think that skating laws still hold for your model during the research?

Scal laws in which sense like for for scaling up model parameter data size。

just or just like in general, if scaling laws are useful。Like for this research specifically。Yeah。

so I mean, in this case we're mostly focused on kind of the cluster stability and hardware issues and making sure that the a lot of the tooling kind of can capture of all of this manual kind of operations we were doing to restart the run and whatnot。

so it wasn't so much skik laws as opposed to kind of more the infrastructure side that we needed to build out and a lot of automation we had to build say to you know detect kind of when things were diverge and how to remediate that。

So it's not quite the same same thing, but it definitely is valuable to know kind of what the optimal model size and data set size to go for given a certain amount of compute。

Okay, I've gathered two questions online, the first is any idea of training 1000 billion language model。

Or like what that would look like a1 trillion parametermeter language model。Yes。

the parameters yeah yeah 100 yeah so in that sense, I mean so right before this work happened。

we actually fair actually released a 1 trillion parameter mixture of experts model so that's not a dense model right but you know the parameter count matches kind of what you're asking for I do think that scaling to a1 trillion parameter dense model could be quite interesting but it's also for what we know for how we want to spend compute right if we can like feed in enough data that it might make sense to do but a lot of the data sets that we're looking at may not be sufficient for that。

O。Another question is, why choose to use norm For for scaling to deep models or training stability?

Yeah thinking at the time was that the Norformer work sort of showed improved convergence and you know adding a lot of yeah potential ability to sort of stabilize kind of exploding gradients you know through the model。

you know that was very exploratory because that was demonstrated at very small scales and when we started we were trying that out but we eventually did remove it so it didn't seem like we wanted to take that chance of you know seeing if actually Norformer was useful at once of a billion parameter in case it was actually causing issues。

Okay, what framework do you use in data processing and model training?Yeah, so data processing。

I mean at the time there wasn't really a defined framework。

you know this is using existing code that had been or sorry existing kind of data sets that have been processed by other teams at previous works。

we started building out kind of a cluster sort of running kind of spark jobs it' very standard CPU intensive workloads there but you know for us it was a very ad hoc every new data set we brought online you know we would have custom logic to process it for the training side you know the code is open sourced we use Medice or under Facebook research to train this model。

it's a forA fairse so which was you know a very common sequence modeling framework that folks are using within a lab but we specifically for the codebase to a focus on decoder only autoagive dense transformers and also to make sure it can scale to the kind of largescale workloads we cared about。

Okay, there's a follow up question saying that in the process of the IFT session period。

do we have to like use the。Open source framework of the original author or we should transfer to a more mature framework like Deep Sp chat。

IWhat's the IFT period I'm not too familiar with that。

I'm not sure though it's like should we use the like the open source framework or like we should transfer to a more mature framework like deep speech chat。

Okay yeah, I'm not as familiar with Deeppe chat so I think the question here is like you know what is already running on your hardware or your infrastructure and your systems because a lot of the details here is not so much which library use but actually the integration with your actual compute environment we had a lot of glue code there like what you where you load data from how you're doing like your checkpoint storage that's most of the complexity in adapting any codebase so if you are able to sort of swap out you know these frameworks that's great but a lot of the time we spend quite a bit of overhead and making sure this code can work in any kind of compute environment。

O。Another question is, what do you think is the main cause of result difference between OP T and the bloom。

model。Oh yeah yeah, my understanding was that Blo is very multilingual and so if you were to focus on kind of English only results。

Bloom just made I think Bloom just did not see enough data on for English only so if you were to just like look at English NLP benchmarks or you'll see a Delta and that's I don't think that's really meaningful given just a different training set that was used for Bloom。

O。嗯。Oh, there's a question saying that is。OP T able to surpass the performance of Lama model。

Actually, the author of Lama model came to our community to give a speech just a few Yeah, yeah。

if you added more data for OP models。 I mean, there's nothing fundamentally that different between the two。

And I don't think the architecture differences are meaningful at all。

really we just didn't have enough data right So we had to train this model you know in a very define finite amount of time。

and only 180 billion tokens were available to us at the time。 So if we1x that data set。 you know。

I don't think there's gonna be any issues with benchmarks。 And also, you know。

this is also with a huge copy of the benchmarks are English only we did not filter our corpus for that either。

we were trying to see if there was multilingual behavior that we can capture with very little data。

嗯,对。Any latest progress in your team related to enhanced smaller LLM reasoning ability or we still cannot get understand reasoning?

Yes, though yeah, I can't really comment on that either, that's yeah a lot of internal work going on。

ok 嗯。How would you compare OT and Lama model Do you have any future plans to improve OT That's a pretty general question Yeah so I mean like I mentioned before it's the main difference is data right so we also worked on the data for Lama and you so if you actually look at the details of Lama paper they do mention that so the main thing that you know we're focused on and you also can see from the kind of atom instability work like we are scaling to larger models with the 546 billion parameter run in that regard if you feed it as much data or more than Lama we can we can probably see a difference there。

O。嗯。Do you use any extrapolation techniques in OP T model training, If yes。

do you think it's useful in improving contact land。Oh extrapolation techniques yeah。

is that like kind of taking a pretrain model and then increase contact and then sort of fine tuning on a longer context window or I'm not quite sure what extrapolation means it could mean something else to。

I'm sorry I'm not sure about the details。 I think expert do you use any extra extrapo techniques in OPT model training like。

How to read this alibi AIB no I see yeah alibi we do not we did not use alibi so we used yeah。

learned absolute positional embeddings for that without any alibi。O谢。

Do we have more like general questions?Instead of those technical details。Actually。

as a member of BI I have a question concerning the research mechanism。

I wonder what's it like working in a research team in meta AI and how do you propose subject to research or is it。

Like very flexible。 the, the, the。A research mechanism think。Oh。

like how to choose your research direction and how much flexibility we have within Me for that。

I mean, likes what's like doing research at an AI lab?Yeah。

so so it's changed just in the short time I've been there you know there is a subset of fair that is you know very open ended research you kind of have a lot of flexibility and where you choose to take things and there's like a shared pool of compute that you can use and then there's a portion of the lab that is also very focused on specific projects kind of one of the examples is a lot of the blender bot work you know focused by very specific team similarly with like you know recent releases like Cicero back in November where specific team worked on just solving kind of diplomacy game with combination of RL and language modeling so there's like also some focused efforts there where you can kind of form these like very specific project teams and just go towards that but that's also sort of slowly shifting over as we there's a lot more focus on generative AI these days and so more resources are going into a lot of generative AI research as well。

O。😔,There's one more technical question, do we have any extra optimizer except Adam to speed up or stabilize?

Yeah there's been a lot of efforts in trying this I still have yet to see much that can scale well past you know a few billion parameters where the gains are actually meaningful I'm sure there is still work to be done in this space to you know sort of improve this for large scale training but I just personally haven't been able to get you know many to work above say like 30 billion parameters。

😊,O。How do Matt think of Laura for fine tuning L L Ms。It is definitely useful。

and there's many folks exploring applying Laura to fine tune。Okay。嗯。

What's the purpose of trading OP T, Where do you think the emergence ability of LLM come from?

Actually, I want to ask that about that, too, Like。

what do you think about the emergence ability of the model, Yeah, so。

You have to remember the context here back in like August and September 2021 right back when like nobody was releasing large models。

there was this kind of aura of like you know it is risky that especially coming out of GPT2 when they did a stage release just for a 1 billion parameter model so a lot of focus for OPT was a combination of testing out new hardware and making sure we can actually sort of set ourselves up for more scaling efforts going forward and sort of make things like Lama be almost trivial to work on and then as well as sort of being to open sources so that people can start prototyping at larger scales and making these larger models more inference efficient and so on so you know there's a lineage of work that came out of that right sparse GPT Fl strand etc cetera these are all kind of different works testing whether or not you can make these large models kind of have the same performance as they were with like full precision and whatnot but then say you know lower precision or more sparse or whatnot and seeing if that can actually run a commodity hard。

So the release of OPT, especially with all the smaller models you 125 million parameters up to 66 billion and then also obviously the ones thatve had a billion parameter is to sort of enable people to study scaling laws seeing if we can actually scale up and transfer kind of methodologies from the smaller scale to a larger scale thatll work just as well it also obviously sort of showcase the kind of complexity of actually operating at the scale on new hardware by releasing a logbook showing kind of the excruciating detail it takes to do this for the first time on new compute environments and also releasing the codebase as it was for training these models which and I think megaron tuuring in theory sort of did but it was very hard to get that really running anywhere and it wasn't clear if that was exactly the codeb that was used to train their largest models whereas we literally just open source codebase that we use so this is one of those scenes where like you see a lot of papers published details about training and whatnot a lot of folks it's not intentional if they like left out details it's just there's so many implementations in details included in code that may。

I'Be captured in papers and we just wanted to make sure that everything was out there。

OhAnd then oh I think there was another question emergence right so that's the other part where I think that's the bitter lesson of scale no matter what you know as you say you know if you have you know unbounded data。

if you have the compute to train a much larger model on the same amount of data as you would on a smaller model for capabilities to emerge or whatever that capabilities looks like purely just from modeling having more capacity to model any kind of underlying pattern in the data you will see more clinical emergence in that sense。

but a lot of that is also I think contingent on what you're using to measure and whether or not that know emergence is actually some kind of hockey stick or kind of a smooth curve so I think the ruler by which we're kind of claiming there's emergence is also kind of illdefined term。

Okay, the future, the future technical path of large language models。

do you think it's a decoder only or it's rather encoder decoder?

This is tricky since from a lot of kind of discussions with researchers in the space when you if you were to kind of fiddle with the number of encoding layers and decoder layers in the limit you sort of see that if you reduce the number of encoding layers you actually get better performance in certain use in the generative case now of course I think that' subject to what you mean by performance if you're trying to like adapt this model and fine tune it and sort shift the domain distribution of the pretrain model to some new data distribution maybe you do need to encoder to help with that sort of tuneability of the model which is very different than say train the most capable model so in the final use case here it's also very unclear which if there's a material difference in architecture that results in better results as a functional kind of data that you have from all of our experimentation for ease of scalability decoder only definitely scales much better on the hardware that we have but I do see a world in which you can probably make encoder decoder work just as well。

But I think in that sense, skills matters more than say the specific details of architecture here。

So decoder only an encoder decoder, they excel in different specific tasks right what from what I've seen so far yes。

and maybe that's just completely you know, that's also arbitrary because weve only have very few tasks that we look at here too。

Okay。Yes, we have a general question and any experience to share when developing and managing such a huge project。

for example, how to set benchmarks and deadlines。Yeah。

so overall there was this kind of tradeoff right that you see between like research risks that you can take where you know trying out new things for the first time at scale versus kind of engineering risk they can take in like just pretty much purely executing on what is already out there right not actually trying novel things so there's that tradeoff of like if given some amount of compute that you have to produce a model at the end whatever that looks like right you probably don't have the luxury of trying a lot of novel new architecture。

new optimizes and whatnot and mostly the complexity comes in the form of actual engineering execution and getting things to run on new hardware so I would say even in the kind of halfhazard way it looks like when we're kind of changing these hyperparameter settings we're also doing like we're only fiddling with a very small number of variables now in theory that can like combinatorial explosion can be you somewhat you know abouted but still right we're not taking too many risks there and that's mostly because each time we restart。

And these tests are not cheap to run, we're still aiming towards getting kind of the runs back on track as soon as possible。

I would say, you, in the essence, like a lot of this work comes down to minimizing of kind of unknowns and trying our best to get to the train model as quickly as possible。

and it might look a little bit halfphard, but that's mostly because we have a very bounded amount of compute time to spend。

Okay。I can see the time is taking。 So let's take three more questions。

The first is we should make LL Ms remember the knowledge in the pre training phase。

Does this mean we can't use drop out in the pre training phase。

I think the two are pretty decoupled dropout was introduced as this like form of regularization and it helps kind of you break up correlations in the model so that maybe you can like learn a most more robust representation anecdotally we sort of see even from palm those 540 billion parameter model that Google trained they start removing dropout and it seems like you know there was there was no difference I believe Lama did as well to double check if that was true but even for a 546 billion run we took dropout out and we didn't really see a difference so it's one of those things where I think it might make a significant difference at smaller scales similar to like there was a prim paper that came out a couple of years ago that show that value squared is a great activation function made things train a lot faster but that's one of those things where at scale with numerical instability issues that just wasn't feasible for us to actually apply so things that look promising at smaller scales may not extrapolate a larger scale so yeah I would say like the two for dropout。

In this case, kind of unrelated to I forgot what the other thing was, but yeah。Okay。

What are the main challenges and limitations faced when applying auto regressive models such as GT and Lama model to African languages。

Oh, I don't have any experience in African languages, unfortunately。

so I can't comment on this at all。Okay, what do you think of generative models versus joint embedding architectures mentioned by yellow queen。

Oh, yeah, I don't think I could comment on this either more of an engineer here than speculating on research direction。

Okay, so I think I think it's about time and really thank you again。

Susan for giving us this presentation。 And next time I hope you can really join offline for our conference。

Yeah, I would love to Thanks so much for having me。 Yeah, this is great。 Thank you so much yeah。😊。

Yes, so thank you again。

然后我们今天的第一个报告来自于呃fordman教授报告的题目。advances in models介绍man is an associate professor in Stanford university his research on machine learning and AI he won multiple best was and outstanding top conferences likeI and and he is very famous because of his work on models. So it time so let's welcome to give the talk okay.

So it time so let's welcome to give the talk okay。啊, can you can share your screen。O, thank you。

thank you so much。Perfect, can you see my slides?Oh yeah, yeah, yeah, Okay, great。

and you can hear me okay。Yeah, everything works。Okay perfect yeah great thanks for the introduction yeah it's a pleasure to be able to give this talk remotely and yeah this is John work with my former PhD student Yangs who is actually a alumni from Tinhua University and it's gonna be a assistant professor in computer science at Calch soon in order to do this know and here's another example of the kind of things you can do you know perhaps you don't want to control the geneative process through a caption。

maybe you want to provide a sketch what painting should look like you know this is the kind of image I would be able to to provide myself and then you can ask a generative model to make create a painting or create a beautiful image that is consistent with this kind of sketch provided by the user and you might get an image like the one you see at the bottom again you can see it's kind of like consistent advice provided by the user but that is much more beautiful。

う。And underlying this technology is a model that is able to understand the structure of natural images and is to be able to understand what kind of sequences of pixels are likely。

they're reasonable and which ones are not。And so the underlying assumption is that there is some underlying data distribution。

there is some function that assigns hyper probability to images that are consistent or sort of like the pixels make sense and the objects of the right structure and they look reasonable。

they they're physically consistently with the kind of things we might see in the real world or the kind of images that we can get on the internet。

And the issue that this function is unknown and the only thing we have access to is a large number of samples or examples of let's say images are harvested from the internet。

And the goal is to come up with a model of this kind of distribution。

come up with some kind of function that can be computed that can essentially aside tell us which images have high probability and which ones don't。

And if we are able to come up with this model distribution。

then we can do many of the interesting things we've seen before, we can sample from it。

we can ask the computer to generate sequences of pixels that would be assigned high probability by the model and by doing so we can generate new images that are have the right structure。

they are similar than ones we've seen during training, but they are different, they are new。

there is some aspect of creativity of generating new content here。

And another thing we can do is we can。That's the model。

you know whether a given input image is likely or not and perhaps we can do this to detect adversarial attacks or figure out whether there's something wrong with the inputs that are provided to our machine learning system so there's a lot of。

Interesting use cases for generative AI kind of tools like a probabilistic model a generative model of data distribution over natural images。

啊, now。The problem and why we've not been able to do this before is that building a complexgenrative model is challenging because this probability distribution that is shown here as a pretty simple object that you see on the left it's in fact actually very complicated because it's a probability distribution or a very high dimensional space if you think about the spaceable possible images。

images have many, many pixels and so there's many。

many different possible combinations of the colors that you can assign to the pixels in an image and so what it means is that the model needs to be able to assign probabilitybabilities to an extremely large number of possible objects。

And that's challenging。And so the question is how do we construct a distribution that is sufficiently flexible to be able to capture the complexity of a complicated distribution like they one other natural images?

And so one thing we can do is we can pick simple statistical models like a Gaussian distribution and you can think of a Gaussian distribution as some kind of like really simple neural network that will essentially take as input and image X。

you think of it as a vector of pixels and will map it to a scalar value。

which is the probability of that input according to a Gaussian distribution。

which is a relatively simple formula where you just subtract of the mean and then you use the standard normal expression to calculate the probability of this data point X。

And you can think of this as a very shallow neural network。And you know。

Gaussian distributions are great, but you can kind of imagine。

you know they're not sufficiently flexible to represent something complicated and complex like a data distribution of natural images。

And so ideally we need something more complicated, we need to introduce the power of deep learning of deep neural networks to represent this complicated function that takes an image' the input and maps it to a probability value and so the idea is that perhaps we can use a deep neural network to construct this probabilistic model and that's kind of like the whole problem is behind the whole idea of using deep generative models using deep neural networks to construct a generative AI kind of tool。

Now the reason this is not entirely straightforward is that probability distributions and probability density functions are not arbitrary functions。

but if I take an arbitrary neural network the output that it might give you for a given input image X might be。

for example a negative value and we know that probabilities cannot be cannot be negative so if we take an arbitrary neural network let's say call it a theta the output might be negative so we have to somehow change the structure or the neural network to make sure that the output is not negative and that's relatively easy to do for example we can take an exponential or something like that to make sure that the outputs are non negative。

The more complicated constraint is that we need to make sure that if we sum the probabilities of all possible inputs to this function。

we sum the probability over all possible images, we get one。

And enforcing this normalization constraint is not easy in order to make sure that this function normalized。

we basically have to divide by this normalizedizing constant this data。

which is just the integral of the unnormalized probability of all possible inputs。

And this object is easy to compute if you have something simple like a Gaussian distribution。

but in general, it's very hard to compute for an arbitrary neural network。

And it involves an integral or sum or a very high dimensional space。

and this is provably computational and intractable, even if you have a discrete input space。

this is sharply complete is believed to be even harder than NP complete problems。And you know。

there's been a lot of work in the past few centuries and from you different。

Fields and disciplines including statistical physics, statistics。

a lot of smart people have thought about ways to deal with this partition function and come up with approximational algorithms and ways to sort of like compute this quantity or approximate this quantity efficiently。

And so the way that we are going to make progress is to instead of working with probability density functions。

which have to be normalized, we're going to work with their gradient。

just called the score function, just the gradient of the log density function。And so intuitively。

you can think of the probability density function as this function that takes an input x and maps it to a scalar。

which is the probability under the model。The score is just the gradient of that function。

so if the density of axis is a mixture of two Gaussians like you see here here the color here represents the likelihood under the model。

here I have two Gausians, one on the top right, one on the bottom left。

The corresponding score is a vector field of gradients that are basically pointing towards the high probability regions every point the score is giving you the direction that you should follow if you want to increase the probability most rapidly。

And the interesting thing is that。As we'll see this will allow us to define a very flexible class of models allow us and will allow us to directly bypass the issue of the normalization constant at the same time it will allow us to get very high quality images and this is kind of like the technology behind a lot of the advances that we've seen in generative models of images。

video speech and so forth。And it will allow us to do a number of interesting other application built on top of generative models like evaluating probabilities doing outlier detection。

controllable generation, and so forth。So let's start by looking at why modeling the scores is a better alternative than modeling directly the underlying probability density function。

So as we discussed before, when you're modeling the probability density function。

you're really using a neural network to map inputs to probability values。And as we've seen。

this basically means that the integral overall all possible inputs have to be one in the one dimensional case means that you need to choose curves such that the area under the curve is fixed to be one。

And so in order to do it with that arbitrary neural network。

you basically have to divide by the area under the curve。The normalization constant。

which is typically hard to compute。Now the nice thing is that if you start looking at the score。

the gradient of the logity function, this object doesn't have to satisfy this kind normalization constraint。

it's a much easier function to model。And we can see it mathematically, if we look at the。

we take the log of the expression on the left and we take the gradient with respect to the input of this quantity。

which in this case x is just one dimensional so this is just a standard derivative。You know。

by taking the log of the expression on the left, we got two components, we got F the。

which is just the output of the neural network and then we got this the log of the partition function。

The interesting thing is that the log partition function is just a constant is the area under the curve and does not depend on x。

it's the same value regardless of what is the value of x when we take the gradient of a constant with respect to x。

we get zero。So, we see that。By taking the gradient we've eliminated the dependency on the partition function on normalization constant and so we got an object that is going to be much easier to model using a neural network。

we no longer are constrained to make sure that the area is one that the object is normalized somehow can directly use an arbitrary neural network to model the function on the right。

And that's going to be the score model and this is kind of like the key innovation that has allowed us to use more powerful neural networks to develop probabilistic models of images。

this is really the key machinery that is enabled a lot of the success that we're seeing in the score based diffusion models。

Now, the question is that when we're given some training data。

the usual we have a training set of samples that are sampled from some unknown data distribution。

Typically when we fit a generative model is we and we work directly with the density function。

we know how to choose the parameters of a density function to match the data distribution as close as possible。

typically what you would do is you would do maximum likelihood estimation。

you would try to choose the parameters data, such that the likelihood of the observed data points is as high as possible。

The question is now that we no longer work with the density directly, but we work with the gradient。

how do we somehow estimate the score of the data distribution。

how do we come up with a score function that is a good approximation to the score function of the true data generating process when we only have access to samples。

We only have access to a bunch of training training examples。

and we want to estimate the underlying vector field of grades。So it turns out that this can be done。

so if we're given a set of examples that are sampled IID from some unknown data distribution。

how do we estimate the score of the data generating process of log P data?

So the idea is that we're going to define a score model。

this is going to be let's say a neural network vector which is a vector valued function for every input。

it will give us a vector, which is an estimate of the gradient of the。

Through data generating process at that point。And what we can do is we are going to try to find parameters for these neural networks so that this function approximates the true vector field of gradients as well as possible。

Now, in order to do that, we need to be able to compare the model。

the vector field obtained by the model to the vector field to the ground truth vector field。

And so you can imagine that there is a ground truth vector filled those scores。

There is an estimated vector field of scores for a particular choice of data with the parameters of the neural network。

How do we compare these two objects?A reasonable thing to do is to basically overlap the two vector fields at every point there is going to be a truearrow。

a true gradient and an estimated gradient, we can look at their difference and if these differences are small。

then we have a good approximation of the true vector field of scoress。And so in practice。

what we can do is we can look at all these errors and we can look at the norm of this grid of these errors。

We can average them and we get scalar value, which is capturing how similar these two objects are。

Mathematically, this is known as the fis divergence。

it's just like the average difference between the ground truth score and the estimated score at every point。

And it's a reasonable metric to compare to vector field of scores。

if you can drive this quantity to zero, then we have a perfect model for the two vector field gradient。

The question is that it looks like it's a quantity that we don't know how to evaluate。

Of course we don't know how to optimize because it depends on this ground truth vector field of gradients that we don't know we only have access to samples from the data distribution。

the goal is to estimate that through the ground truth data score, how can we evaluate this quantity。

how do we optimize this quantity when it depends on something that we don't have access to。

Turns out that if you do integration by parts, you can basically rewrite this objective function into an equivalent form up to a constant。

but no longer depends on the unknown ground truth score, it only depends on stta。

which is our model of deep neural networks and so we get an objective function that is equivalent。

and now this is something that we can evaluate, we can approximate using our samples and we only depends model only depends on atta。

As you can see, we're basically trying to minimize the norm of the estimated score evaluated at the different data points。

while at the same time, trying to minimize the trace of the Jacobian of the score evaluated at the data points in the training set。

And so now this is a reasonable objective function and the challenge is that it still depends on this trace of the Jacobbian。

which is potentially expensive to compute when we're dealing with high dimensional data the trace of the Jacobbian is basically a sum of partial derivatives with respect to the different input dimensions and this might require a lot of back propagation passes through the network in order to be able to evaluate this quantity exactly luckly there are several ways to approximately compute this quantity。

One that I like is basically the idea of instead of comparing the vector field of gradients directly。

we can compare their random projections, we can pick a few directions and if the vector fields match。

then the random projections should also match。And if we pick up。

Larger number of different directions, then this should become a pretty good approximation。

And by comparing random projections, we're now comparing basically one dimensional objects。

And this leads to an objective function that' is much more computationally efficient。

It basically does not depend on the dimensionality of the data anymore。

so it scales to very high dimensional data sets like images。

And it still retains a lot of the nice properties score matching。

like consistency and a synthetic normality。I'm going to skip some of the details。

but the key takeaway is that there is a way to train a score model that is only depends on the data and it scales to high dimensional data sets either through slide score matching or other methods like deno and score matching that were developed by other researchers。

Now that we have talked about bypassing the normalization constant by working with the score instead of the density directly。

and we've seen how we can estimate the score from data。

let's talk about how we use the score models to generate new samples to do controllable generation and solve other interesting tasks。

Now, what we've seen so far is that it's possible to take samples from a data distribution and come up with a good approximation of the underlying vector field of gradients。

Now, the question is, how do we use this object to let's say, generate new samples?

How do we use estimated these arrows to essentially generate new samples。

or all we have access to is this vector field of gradients。And intuitively。

you could imagine a strategy where we initialize。Partracticles, randomdom。

And then we kind of like follow the arrows。To try to go towards high probability regions。

And intuitively this kind of works, it's not quite a valid sampling strategy because at some point you're going to get stuck in the in other local optimum and that's not quite what we want when we want to sample from a distribution。

But it turns out that a simple modification of this strategy gives a valid sampling algorithm。

And in particular, there is a sampling procedure called Lagervin dynamicss or Lagervin MCmC。

Which essentially works by following the gradient and adding noise at every step。

And it turns out that if you do this。If you instead of just following the gradient。

you add a little bit of noise on every step and you do this for a sufficiently long time。

then asytically you're going to produce samples from the underlying distribution and so to the extent that we have a good approximation of these arrows so we kind of like knowing which direction we should go to we know how to generate good samples from the model。

The challenge is that if you're trying to do that it doesn't work so here's an example of you learn a score based model on the C410 data set。

then you try to grant longar dynamics, you get the kind of samples that you see on the right so they don't even look like real images。

they look like just random gray graye with some color patches。

images that don't have the right structure, they don't look at all like the ones we've used for training the model。

So what is going on here?Well, what is going on here is that if you think about the training objective that we have。

We are training the model based on samples。And most of the samples will come from high probability regions under the data distribution。

So we're kind of like getting。A prettyty good estimate of what these arrows should look like。

When we're close to high data density regions。But we don't know we don't have a good approximation of the gradients of these arrows when we're far away from the high data density regions because we've never seen training points in those regions。

all our samples are coming from the data distribution that are like nice。

clean images and those are the only samples that we know how to deal with。

we've never trained on random noise for example。However, when we initialize our large chain。

our samples are going to be initialized all over the place and by following the arrows we are going to be able to go towards high probability regions。

So here you see again an example where we have this simple data density that we've been using as an example throughout the talk with two modes。

one at the top right, one at the bottom left, if you do score matching you can see the arrows are pretty accurate when we're close to the modes but then they are very inaccurate if you compare those arrows you can see that they're sort of pointing in the wrong direction they don't quite match the ones in the middle figure。

the true ones and the estimated ones they don't quite match the arrows look different and so what this means is that if you follow the arrows using larger dynamics。

you will have trouble exploring the low data density regions and you will get lost and you will not be able to produce good samples。

So one solution is to add noise through the data。If you add noise to the data。

then we have a new per of density that is now kind of like supported over the whole space。

And because we're going to see samples from this perturbed density all over the space。

The we you know。Practice, if you think about an image。

it means adding noise to the image you take this dog and you add noise to it so that it becomes。

It becomes a little bit fuzzer。If we do this, then we get a new density。

a perturb data density for which it's relatively easy to estimate the score。

Because we're going to see samples all over the space。

we're going to be able to estimate the scores all over the space。

Which means that our lingering dynamics procedure will be able to sample from this distribution。

However, we're no longer sampling from the clean data density。

we're now sampling from an approximation to it, we're sampling from a data density that has been artificially perturbed with let's say a noise。

or we're not going to generate clean images of the dog like you see on the left。

we're going to be generating images plus noise like the ones you see on the right。

which is not what we want。So the solution is to consider multiple noise levels。

is to consider different views of the data distribution perturbed with increasingly large noise levels。

say Sigma1, Sigma two, sigma 3, and so forth。And what we can do is we're going to try to。

whichch you can think of as adding increasingly large amounts of noise to, let's say。

images of this dog。Until the structure in the image is completely destroyed。

What we can do now is we can try to jointly estimate the vector fields of gradients of all these data distributions perturbbed with increasing large amounts of noise。

And we can do that using a single score network, just like before。

an neural network that takes x as an input, takes the noise level sigma。

and input and produces an estimate of the score for that x for that noise level。

What we can do is we can train this score based model using a score matching objective just like before by taking the fi divergence between the two distributions。

And again, basically,' just by using the same score matching loss that we talked about before。

And now that we've estimated all these vector field gradients。

We can produce samples by essentially using a variant of the La dynamics procedure that I talked about before。

what we can do is we can initialize our samples at random and we can start by following the gradients of the data distribution perturb with a very large amount of noise。

These gradients will be pretty accurate, and so we're going to start moving towards high probability regions。

Then we can use these samples to initialize a second large in chain where we now reduce the amount of noise and now again we follow these arrows。

but now we follow the gradients corresponding to a data distribution that has been perturbed with a smaller amount of noise。

And again, then we use these samples to initialize another long jump chain for a data distribution perturb that even smaller amount of noise。

Until the level of noise is so small that we're essentially sampling from a distribution that is indistinguishable from the true clean data density。

And this procedure actually works, here's an example of how it can generate examples on some common image data sets。

this was back in 2019 and we were very excited because we were able to finally get this procedure to work to generate Mist digits or to generate pretty realistic C410 kind of images and you can see how essentially the procedure is able to go from noise to data and it's essentially following these gradients it's following these arrows and it's trying to push random noise towards high probability regions and it's generating images with the right structure by following this procedure。

Now, scaling this up actually led to state of the art sample quality on CF 10 back then when we published this work in ICLR。

So for the first time, this kind of procedure was able to beat GANs generative adversarial networks。

which were the state of the art in generative modeling for a few years。

despite a lot of the engineering that went into GNS and a lot of investments from large tech companies。

we were actually able to for the first time beat GNS in terms of image quality on this academic data sets。

C can, which was very exciting because again it was growing the first time competing very different class of models was able to beat generative adversarial networks。

And you know a bit further by scaling up to bigger data sets。

higher resolution images we were able to generate, let's say。

faces like the ones you see here and this was is kind of like the key technology that is behind things like stable diffusion or Igen or Dli2 or midjoney。

all these excellent image text to image generative models are at the core based on this idea of estimating the score of a set of distributions。

so data distributions perturbed with increasingly large amount of noise。

it led to it started out as an academic project, it eventually ended up having a deep impact in industry and now it's used by a lot of users all over the world and it really has unlocked incredible new capabilities in terms of the kind of images we can generate with these models。

Now, I think soll have some time, so maybe I'll talk a little bit about how diffusion models are very useful because they allow us to control the generative process in a very natural way。

So let's say that with。Train a model that can generate images of cats and dogs。

Now let's suppose that now we wanted to only generate images corresponding to the class label dog and so let's say that we have a classifier that can tell whether an image is an image of a dog or a cat。

Is it possible to sample from the posterior distribution of images given that the corresponding class is the class y of a dog?

You know this is a well defined object, this inverse distribution。

this posterior distribution is basically just defined but through Bay as rule。

posterior distribution of images given the class label is something that we obtained from the prior distribution of our images P of x。

The likelihood P or Y given X, which could just be a classifier or some kind of way of assigning a label Y to an image X。

and then we have to normalize by this denominator PO Y。Which is typically intractable to compute。

Can you see that this behaves essentially like the normalization constant。

the partition function that we talked about before。

this is essentially a number that you have to divide the expression with to make sure that it's normalized and it's a valid probability distribution。

Computing with this denominator is what makes Bayesian inference so hard。

it's typically very difficult to compute because once again it involves some integration of a high dimensional space。

The good news is that if you look at what Bay rule is telling us if we start working with core function instead of densities directly。

AndSo if we take the log of the expression and then we take the gradients with respect to xs of the expression I have at the top。

We see that we get several pieces, but the important thing is that the denominator P y does not depend on x。

So when we take the gradient with respect to x, that term that we didn't know how to compute disappears。

And so the score of the posterior distribution is just the score of the prior model P of x plus the score of this likelihood PO Y given x that we might have access to directly。

So if we have a pretrained score model that is telling us what do the scores for images look like。

can combine it with the score of any forward model that we want could be a classifier。

or it could be anything else。And we can get a score for the posterior distribution。

so just by adding up these two components, we can get a model that can sample from a posterior distribution。

And this allows for many different applications。😊,So for example, why instead of a class label。

why could be a stroke painting?And now we can。With a given a pretrained generative model of images can combine it with a likelihood function。

POY given X, and we can get a model that can synthesize images from stroke paintings and here you see some of the examples that we can get when you can try to create an image that has right the structure of the stroke painting provided by the user but is photo realistic。

Another example is language guided image generation。

if you have a good image captioning model that can tell you which whether a caption Y is consistent with an image X。

We can ask, we can create a conditional generative model that will create an image given a caption。

and you can use this kind of machinery to build a language guided image generation so you can text to image。

generative models, you can take up image captioning model。

you can combine it without a generative model of images。

an unconditional model and you can get language guided image generation。

Another example is medical imaging air the idea is that we might want to reconstruct a medical image in the medical space like an MRI or a CT scan kind of setting。

The machine in this case will get some kind of like measurement of the patient。

And we can think about the medical imaging problem is that of reconstructing the crossal image of the patient。

given the measurement obtained from the machine。In this case。

the model that relates the cross sectional image of the patient to the measurement given by the machine is given by some physical simulation。

but it doesn't matter to us, we have the forward model。

P Y given x we can combine it with a prior model, P of X or medical images and we can get a powerful medical image reconstruction tool that is actually beating deep learning models that we' trained specifically for this task。

so it outperforms deep learning methods that were specific we're trained for this task even though our method is fully general and it's kind of like just obtained by combining a prior model with a likelihood in a very which means that it's much more general。

it can be applied to different number of projections, different kind of measurements。

And it gives better performance as even though it's more general and more powerful。And you know。

this kind of technology led to state of the art results in a variety of water data sets on audio。

material design, text to speech, shape generation, and many more。Now。

I think I might have a few more minutes, so I'll talk briefly about why these models are also called diffusion models。

And that connection arises if we start thinking about what happens if we were to consider a infinite number of noise levels。

Or recall that the。Underlying idea of this modeling framework is to not just model the data distribution。

but to model the data distribution perturbed with increasingly large amounts of noise。

And here I'm showing, let's say, the data distribution p0 and then several perturbed views of the data distribution and increasingly large amounts of noises。

Sigma 1, Sigma two, Sigma3。And you can imagine what happens if we were to consider more perturb distribution so and increasingly。

Fine grain sort of set of perturbed。Dta distribution perturbbed with different amounts of noise intensities。

And in particular, you can think about what happens if you were to consider an infinite number of noise distributions that interpolate between the clean data distribution at time zero。

And a data distribution perturbably a maximum amount of noise at time t equals capital t。

So you can kind of like imagine there is a sequence of distribution where we start with clean data on the left。

then as we move towards the right, we get data distribution pertbed with increasingly large amounts of norms。

And so in particular, you can kind of like imagine process that kind of like。

Behaves according to this set of distributions marginally。

And you can kind of think about what happens if we were to take clean data。

And perturb it by adding increasingly large amounts of noise。

Until the structure in the data is completely destroyed。

This can be described the stochastic process can be described by a simple differential equation。

stochastic differential equation where we basically just take data and add a little bit of noise at every step until after adding a sufficiently large amount of noise。

all the structure in the data is completely destroyed。

It turns out that it's possible to think about the reverse process。

Where instead of going from data to noise, we go from noise to data。

So instead of going from time zero to t, we go from time t to zero。And if we go from noise to data。😮。

This reverse process is basically solving the problem of generating data for us。

Going from noise to data is exactly the process of jet generative modeling。

And the interesting thing is that the reverse process of going from noise to data can again be described by a stochastic differential equation。

it's not important that you understand exactly what these equations mean。

the important bit is that this reverse stochastic differential equation can be described in terms of the score function。

So once again, if we can estimate the score functions of these data densities perturbbed with increasing the large amounts of noise。

then we can describe the reverse differential equation that maps noise to data。

And that's essentially the connection with the diffusion models。

it turns out that if we use score matching to estimate the score functions, we can reverse。

this stochastic differential equation, and we can get now essentially a diffusion model。

Time is continuous now that we can use to generate samples。And I think I'm running out of time。

but I'll briefly mention that this stochastic differential equation perspective is very helpful because。

It allows us to use a lot of different techniques from the numerical numerical methods kind of literature。

In particular, it's possible to use very fast solvers for essentially computing the solutions to these stochastic differential equations。

where it turns out that it's possible to convert this to an ordinary differential equation。

So where the process is no longer。tochastic is deterministic again the ordinary differential equation depends on the score function and so to the extent that we can estimate the score function。

we can define an ordinary differential equation that maps noise to data。

And we can use numerical methods to solve this differential mis skip this。

but we can use this differential equation to。Accceerate sampling。

we can try to solve this differential equation very efficiently using techniques from the numerical methods that have been developed over decades to solve ODEs fast。

for example we can coersely disreitize the timeax and kind of like take big steps achieving very high speed appss and comparable sample quality。

we can also use parallel ODE solving methods to again accelerate sampling we can use distillation。

we can kind of like screen a student model to do in one step what the ODE solver who do in two steps。

then you can repeat this process many, many times recursively until you can essentially get a student model that can generate samples in one step。

And these samples are extremely good here you can see some of the examples when we apply these techniques to stable diffusion。

We can actually generate very high quality images only requiring one or two steps。And again。

this is made possible by this interpretation of the sample procedure as solving an ordinary differential equation。

And using some collaborative techniques to accelerate sample。Maybe let's skip this and yeah, I think。

I think I'm out of time so yeah I conclude here, this is sort of like the highleve overview of diffusion models and the key ideas are using scores to model the distribution instead of likelihoods and that really enables us to use essentially arbitrary neural networks to model these vector field of gradients。

the fact that we can train these models without having to resort to adversarial methods and mini Maxs like ingenerative adversarial networks。

there is a stable proper scoring rule based on the phs dives that we can use to train the models。

And we can use these models to do controllable generation。

we can use a lot of ideas from the numerical method and numerical methods to sample from these models very efficiently and one thing that I skip is that we can also get likelihoods out of these models so not only we can generate samples by given a sample we can evaluate how likely that is under the model which is pretty useful to let's say do anomaly detection and a variety of other applications and yeah this is kind of like really the core technology behind a lot of the super exciting advances that we've seen in industry and we're seeing more and more applications of this to other domains I think one that is still open is whether these models can be competitive without aggressive ones some text that's going to be a big open challenge and hopefully we're going to see some progress on that soon to and yeah I think that's。

I'm happy to take questions。Okay, thank you so much。 thank you so much for your beautiful talk。 Ar。

maybe you cannot say it through the camera。 Actually。

the forum is full of people and many people stand on both sides。 Actually。

we have one or two questions in this session。 Oh Okay, can you, can you have the microphone。😊。

Do you realize build a general large model is image。Could you, could you repeat a question, Sorry。

do you realize build general large model is a image。Is it possible。

Is it impossible to build an image model general large model?General a large model, yeah。

What's a general large model?不是哦,我因为我不是通用大模型,现在的这个翻译或者然后GN或者是AGAAG现在不同一。

然后我就用一个okKI can translate into English。 The question is do you believe that we could build a general purpose model for for like AGI。

😊,Oh yeah that's a good question I think it's hard to say I don't think there is any constraint like I don't feel like there is any impossibility result or that we will prevent us from getting there I mean we have a develop through evolution a system that can do it and I feel like it should be possible to replicate it I don't know how close we are I personally don't think that we are very close I think there is going be a lot of work to be done to get to AGI but I don't think it's impossible I think we'll get there eventually and we just need to keep working on it and we might need new ideas。

we might need new models we might need new methods nobody knows how close we are but I do believe that will eventually get there I don't think there is anything fundamentally preventing us from from getting there。

可以。O, thank you。 we have another question。 Okay, the last one。 Okay, thank you。😊,Thank you。

So my question is that so the success success success of the scope bit model depends on the successful estimation of the score。

So does the score estimation re heavily on the architecture of the neural network。 I mean。

so currently, most model use the U net。 So is it possible for you net to estimate any distribution of general data。

Thank you。it's a great question and I think a lot of the advances were actually enabled by the fact that we could use more complex architectures that don't have to be either autoregressive or they don't have to be invertible the fact that we can just plug in a unit or really it is one of the key things that enable these models to work so well。

whether that could be better architectures, I think they're probably are out there and yeah we just need to to discover them I mean I think theres certainly a lot of architectures that don't work that we tried before getting units to work I know people have had some success with transformers too so B believe there's probably better ways to do it and depending on the modality you might want to use something else seems like units are pretty good。

On images at least, but yeah there might be other things on graph we've use the graphra neural neural network so it depends a little bit on the application and yeah it is deep learning so。

Yeah the architecture is super important, so there there's some beautiful math at the top but yeah the architecture matters and there is some magic here in terms of these neural networks being able to estimate scores and do it reliably that is really enabling the success of these models like a priority estimating the course could be hard and yeah there is some deep learning magic here going on for sure。

😊,Okay, thank you。 Thank you。 Okay, thank you again, Ima, for your time and nice talk。 Okay。

so thank you。 yeah。😊,呃,好,那么我们这个呃在的报告之后呢,是这个很荣幸的邀请到这个浙江大学啊赵州教授,为我们带来这个多模态生成式语音模型的这个最新进展。


谷歌学术引用啊8000多次,然后有很多多模态的生成式的工作啊和包括这个语音模型有这个呃nice speech啊等等。然后还有一些很有名的这个呃生成式视觉模型或者算法啊。




主要是介绍一下我们呃近期的呃。三个主要的一个工作。那么第一个工作是。呃,第一个工作是呃。第一个工作是我们基于我们的语音的 speech模型。第二个是我们呃呃语音生成歌声的啊。


那么音频呃生成的话也是一种呃cros model的一种呃生成,它是呃给定呃给定的文本,那我们生成它的音频的一个形式。比如说我们把呃荷塘月色这个文本我们转化为呃语音信号。那么这个是一个它有很多的一些应用。




那么输出是我们的呃语音audio语音。那么我们这次报告我们聚焦在哪个地方呢?我们这次报告我们聚焦在呃声学模型的呃生成式的应用。那么过去的比较几个比较著名的工作,一是呃自回归的呃t,包括呃 voice。

包括呃transform tT。那么呃这次报告我们主要是介绍一下我们的生成式模型在呃声学模型,也就是说我们从因素像频谱的映射的呃低延时的net speech,呃,包括是高表现的d,还有是呃基于开放语。

因为audio它的呃生它的可以有不同的一些生成。那么是make an audio。那么首先呢我们是看一下我们的呃net speech。那么net speech的话呃。


原始backbone基于呃transformer在2019年呃transform被用到了呃tex to呃 speech,那么取得了非常好的一些效果。但是它还是有非常多的一些嗯呃不足的地方。





我们能否我们的合成能有非常好的个性化的合成。也就是说我们希望把我们的model呃进行一系列的一些泛化。所以我们现在呃step by step。

那么第一个是transform tTS是一个非常好的一个工作。那么在2019年是用transform model来做tex to必这么一个呃合成的一个工作。那么。


因为transform的这个框架来说,它的推理速度呃相对来说是比较慢的。第二个是它transform因为是auto regressive的一个pre。那产生一个问题,它是会存在一些漏词的一些现象。

那为了同时我们提高呃它的推理速度和解决low词的情况呢呃做了一个non auto regressive的呃工作就是呃开端。第一个那么n auto regressive的工作的话。

它主要的思想在在于我们并不是在解码端的自回归的一个呃预测一个加一个而是我们采用了一个非自回归的一个预测形式。那我们可以看一下,在这的一个重要的block是叫less regulator。

less regulator是呃呃我们的输入模态和输出模态进行一个映射。我们可以看一下less这有一个。也就是说我们。pred每个因素在我们的m频谱的里面的一个dration,也就是长度。

那么这个是实际上是实现了一个模态到另外一个模态的一个呃alignment alignmentment一个学习的过程。

但是我们知道呃这个alment duration predict是呃非常相对来说是非常难train的。因为我们会遇到呃我们的呃从一个模态到另外一个modality映射的时候有一个多封性问题。


并没有用直接用ground truth来做,而是用呃autoregressive的一个transform tS作为一个teacher来教他呃抽取我们不同的一些。

那么经过这个来用ME loss来train。那么好了以后呢,我们可以看一下它的一个performance的一个呃进展,它的呃左边的话是一个momo是1到5的一个评分值。那么越高是越好。

我们可以看一下它的呃trans t是在3。88是在4。0。那fa达到了3。84是一个呃非常小的一个呃下降。那么呃fi是它的呃是有270倍的一个加速度。

270倍加速度可以看一下下图的一个 speech和之前的transform呃T的一个的一个呃程度。那么这个是一个第一个问题。第二个问题是在解决了推理加速的情况下。

还有什么遗留问题需要解决的遗留问题就是说虽然加速的时候,我们还是希望它的 performance性能可能是越来越好硬来越好。那performance我们做了一个呃在。前面做了一个详细extension。

也就是说我们之前是呃inco,后面是lanance regulator。那我们可以看一下,这是vari adapter。那除了我们做我们的lanance。

也就是ration的预测也是做了我们后面的一个音高能量等等其他的一些属性的预测。那么把 predict扩展到我们的 adapt之后,我们发现呃可以取得两个效果。第一个是它的ality是提升了提升非常多。

我们可以看一下它的version two的版本 two版本是比它的 one版本有一个很很大的提升呃,甚至它的在评测的时候已经transform t的 qualityality甚至要好。

那么依然是在inference的时,我们可以看一下它inference是如果只用 autoto regress的解码的话,它得负一次方的inence。那么合成一秒的。精品是需要1的负1次方级别。

但是在呃那alto regressive的话,从10的负1次方降到了呃10的-3次方。那么做了一个呃做了一个推理的一个提升,推理提升。那么呃这个还是会存在一个问题,所以呃会继续再挖掘一下这个问题。

那因为我们是发现就是说很多的时候我们是需要这个模型的size呃,尽可能小尽可能小。那保持它的推理的速度和我们的推理的推理出来的效果的时候,我们还是希望呃进一步压缩它 sizeize。

所以有呃所有port必这个工作port是可携带的,我们的希望是可能继续的小。那么我们看见呃我们这有呃两种的gen model。第一个是呃vari generation model呃。

我们我们通过实验发现如果用 generation model,它虽然参数不是特别大但是而且可以capture它的整个的合成的一些运力。但是由于它的lo的原因,它是带着一些的模糊性。

那么第二个是flowow based的呃可逆流 based model。那么虽然我们在参数足够的情况下,我们可以把它的。呃呃,效果做的比较逼真,但是它需要很多的参数。那么怎么办?

那我们把嗯呃VE based and flow based呃加起来,cascate起来,cascate起来就是port speech。那port speech呃做到一个什么样程度呢?

我们可以看一下port speech呃做了出两个实验,两个版本,一个是normal版本啊,一个是small版本。那nmal版本的时候呃呃VE和g模型加起来。

那么我们可以再看一下它的quality的上面是 art qualityality更好。那么呃第二个呃事情,我们是把它的size进行压缩的情况下,我们可以看一下。








因为现在呃呃大语言模型也是非常 popularular了,也做了一些呃基于呃我们的pe转化为呃token用语言模型来做。


那我们发现是从韵律的这个code做我们的离散的token的一个呃lan model会比较好的一个性能。所以呃泛化的情况下,我们可以看一下,这里是。这个是me spectrumgram的一个结偶。

我们只在韵律上面做了一个language model的个leage model。那么呃和呃reference的是呃音质,那么合成我们的一个呃t的一个形式。那么我们可以呃试一下这个我们看一下他的demo。

这个是呃呃,我们可以看一下这个是一个奥巴马的声音, workers lost their lives。17 others were injured。

And soon nearly a mile beneath the surface of the ocean, oil began spewing into the water,它一10个10ton。

Good afternoon, everyone。 Today, we are super excited to introduce you all to introduction to deep learning。

The course of Carnegie Mellon University in the first part of the course。

we will talk about the generative deep learning that are used to generate data never existed in reality。

Good afternoon everyone Today we are super excited to introduce you all to to deep learning course Carnegielonative中。

介绍一下呃歌声sing的一个模型。那sing的模型的话是非常有有意思啊,就是得益于呃diffusion model thanks to diusion model。

那么呃iffffusion modelapply到上面去可以做什么呢?可以做一些非常高表现力的一个合成的一些工作声音工作。

我们首先看一下就是呃ffusion model那么呃左边是diiffusion model右边是呃我们的plication当然是可以直接apply我们的个usion model我们的呃 voice synthesiss的话我们发现有什么样的一个呃interesting呃interest的一个呃ide呢?

因为第一种我们是可以从我们的conditional的一来进行生成来进行生成。那么第二种的话我们我们发现啊就是说沿用 speech的一个呃一个一个思想就是两个mod进行scade那么第一个mod是呃。😊。

第一个model是capture semantic,第二个是capture它的一个音质。那么这是也是一样的。那我们我们MmanifoldM是什么意?

M是我们的原始的origin dataorig data,我们进行呃一系列加造,我们加到T的 step,那么M撇也是什么呢?M撇是。我们用另外一个model,另外一个model。

另外一个之前的model我们生成出来的一个频谱。那么之前的model的话是用我们的n speech呃 speech to和 speech分别是生成了不同的一个频谱我们进行我们到这的时。

我们发现它都是verge到一个 noise white noise那么比较interesting的呃呃interest程度就是说我们发现他在第七步的时。

第步的时候有一个 overlap有个 overlap。那么也就是说我们呃可以换一种思路啊,就是说一种是我们用呃one single model去做这样一个事情。

第二个是我们是用两个mod models models我们是用第一个是用一个辅助 model辅助辅助 model辅助 model我们生成M撇撇撇撇的时候,我们可以用我们之前的n我们cap它的语义的信息。


那么这个是呃这个的一个 performanceform。那我们发现一个比较interesting呃一个事情,就是第一个是。第一个是这个策略的事情。这个第一个策略。

我们是可以把梯步梯步的一个降噪梯步降噪给它reduce成reduce成K步,red成K步。第二个我们发现呃两个model沿用我之前的 speech的一种思想。

它的 qualityality会比呃s model会更高sing model会更高,所以是一个呃cos to find的一个呃过程,我们可以听一下demo。


因为我们pe合成的语音来说,我们的音高抖动等等,都没有这么一个呃表现力。那么。接下来呢我们也是做了一个M fourM four是什么呢?

M four是呃一个一个呃ch我们呃把我们之前的d扩展到不同的一些ap。那我们可以看一下呃这个合成我们刚才展示过了,还有什么呢?还有是一个变调,我们可以听一下它的一个原始的音频我说其实你很。





这个是英文歌曲,wear beautiful like diamonds in the sky。We're beautiful like diamonds in the sky。那么除这个之外呢。




我们可以听一下他的这是一讲话的声音我们可以看一下他的ment。从我们的一个 speech到的一个一个呃一个映射。那么呃因为之前的是呃开源的。那么我们可以看一下。





这个是一个哎。🎼对,这个是一个呃搜索页面,大家可以呃 try一下,就是说呃有一些不同的一些呃第三方。那么用的呃工具,那么做出来的一个工作。我们可以看一下这个上面是一个呃fin的一个乐谱。








那么叫做呃make an audio makeake an audio。那么呃thanex to的 diffusionmod。那我们可以从什么呢?我们从可以从text呃上面给呃给我们的音频来进行配音。



因为呃对于我们的短的一些text来说,我们是。OK我们我们是我们是生成一个的一个音频。但是我们对于音频来说,它可以类比于我们的video videode。那么video的话呃有一个非常大的一个问题。

就是说它是有tempal信息。所以呃所以为了考虑这个tempal信息。那么我们有一个make an audiomake an audio那么呃我们通过我们可控我们输入文本的一个信息。

比如说我们先输出一个鸟叫再输出一个卡车声音再输出什么,它有一个个信息的一个一个建模再进行make an audio to那么make an audio的话是make an audio的一个升级版。

那么在make an audio的时候呢,我已经可以支持我们的一种不同模态的一个过程。所以呃在这个上面呢做了make an make an voice那make voice的话除了我们把make audio之后的文本包括其他的进行离散化之外。

那我们通过离散化和音频表征的。双重的一个解偶,那么先map到semantic model,再呃map到acoustic再进行合成,不仅是解偶也是进行离散化。在离散化之后离散化之后。

第三个就是做auio因为auio的话我们无论是输入呃 speech无论输入呃t。那么呃有不同的一些任务。

不同的一些任务有不同一些res就是说呃可以是让他产生au可以是唱歌可以是做呃 speech translation可以是做呃pe to talking face的呃 synthesis那么不同的一些任务。

那不同也一些任务。那不同一些任务好了以后呢,呃这个是auio g one audio two呢是使得auioP one再更进一步。因为auioT呃auP呃one的话。

它based on主要based on。GBT来这个呃foundation model进行构建不同的呃生成式的一些模型。

那么audioGPT two在呃是呃做了一个unified language model,可以支撑呃不同的模态到不同的模态的一个呃translation translation和 synthesis的一个呃一个一个过程。

那么我们首先看一下这个make making audio audio的话是一个在呃我们的音频生成里面的一个呃第三个赛道,它因为是有的非常的广阔的这么一个呃开放的这么一个事情。



我们呃创了一个classify free的呃呃laclass的class free的 model。那么对于我们的音频来说,呃,最大一个问题在于我们对于音频来说,我们需要非常非常强的一些data。

非常强的一些da。以至于我们可以呃生成的呃音频更加有开放性。但是呃我们在我们的 website来说,我们并没有这么强的一些da那么这么多的一些da。那怎么办?在m an audio one的时候。

我们是做了一个呃基于 enhancement。那么这里是呃design了非常多的一些 rule非常 rule。那么对于我们的audio和text来说,可以进行不断的一些拼接。比如说呃这里是呃鸟的叫声。


我们是呃做了一个呃s prompt enhancement。那么呃通过拼接的形式产生更多一些data。那么最终来train这个model的时候。

是用了3000个小时和100万个audio taxs呃来呃来来做这么一个呃model的一个训练。那我们可以看一下它的making audio making an audio one making an audio one呃。




那么点击下这个是呃我们的烟火作为一个呃shot video生成。对,这个是支持呃他的一个video呃 videode作为pro呃来产生这种audio。那么呃音频呢就没有放在这。



所以有一个make audio make audio的一个版本。那make audio的版本呃基本上是沿用了make an audio one。但是呢在对于我们的增强的数据增强的时候呢,并没有。

并没有基于规则的形式,而是用了现在大语言模型。那么进行大语言模型的时候,呃,大语言模型来进行增强。第一个第二个是对于我们的呃我们的一个我们可以看一下,这个是man speak,呃。

首先然后是呃狗叫do bark,然后是呃 birdr trip。那么这个生成是man speaking then a dog back with birth creeping in the background。

那么这个是呃对我们的prot来说,比之前的要复杂的多复杂多。这个是make on to的一个pro。那么呃总的来说是呃用了3。7K的 hours呃,3。7K的 hours data。

那么呃perform我们不看了,我们看一下几个呃量例。那我们可以看一下这个是一个呃。그는。Yeah。但是A man followed by goating then mental gateing as dark pass and window to myphone那么这个是一个呃比 making audio one更加comp的一来进行支那么呃呃这里也可以看一下这个是 vehicle engine那么le than呃。

Okay。No。对他他是对于我们making audio one的一个一个增强。那么这里面是用了很多的一些呃呃理解以及技巧,包括是理解顺序它的呃时间发生的事件的先后的顺序,先后的顺序。


那么不仅是我们的tex包括呃我们的我们的pe进入mantic token和ous token进行一个呃非常好的一个解偶。那么跟我们的ide一样man token我们是capture我们之前的语义的信息。

那么这个是音频的一个信息。那么呃跟我们的 tS和我们的 speech一样做语义的方面。解耦和音频方面的解耦,而控制的话基本就是加在我们的音频的acoustic呃 conditiondition。

accoustic condition然semantic meaning的话是是固定的一个解耦的一个方式。

那我们可以呃呃可以呃看一下他的一个呃呃 with a start then I remembered how I lived alone是 writing bad poems and eating out The head of the patchwork Gro was the most curious part of her。

呃,support不同的一些呃 task,同时support呃呃zeal short的呃text to speech,包括zero short voice conversion。

包括zeal short呃s voice synthesis。我们可以呃先听一下歌声。梦也不自由哦,这个是输进去的一个pro这个progenative因为在00年后这一 dear short voice conversion这 wouldn engineer long这 source prompt nothing is more lugubriious than the contemplation。

thus in its nudity in the broad light of thought of the horrible swarming of slang。

We wouldn't engineer alone。对,这个是一个呃 makeake voice的一个版本。那么最后我们是呃呃介绍一下呃。

最后续的工作是audio gPTauio gPT呃的一个工作是把之前的我们之前的工作进行集成过去,它是可以s不同的一些task。就像我们这可以看一下。

我们是实呃一个是从au to呃 text audio to audio audio to event audio to呃 video以及是text to audio包括au to text image to audio以及是呃music score to audio的一些不同的一些工作。

它是一个呃一个能力。那么呃这的话我们可以呃呃放一个呃呃example呃,来 show一下他的一些能力。啊,请播放一下这个呃这个example。你说你不懂为何在这时牵手。😔。

这个是generate一个呃music,这个是generate audio。那么这个是一个呃right caption about一个auio。变数板。

I'm happy to help you here we go。I'm happy to help you。 Here we go。Here we go。🎼,'。Yeah。哎。

这是我们刚才看了一个demo,它是一个呃从我们的呃text to speech的语音合成模型,跨越到更加泛化和更加通用的model。那么在一个呃基于 speech对话场景的情况下。

我们可以呃让他可以做不同的一些呃task。那么这些呃我们的工作呢也是放在了呃get up,大家也可以在呃 face try一下。那么这里也是有呃demo配置。那么后面的话呃。




一个应用的一些歌声和低于diffusion model的应用的make an audio。那么歌声的话呃有不同的表现力的呃工作。audio的话有不同的一些开放域的一些工作。那么谢谢谢谢大家。



呃,在这个就是通过这个VQ嘛,就是通过这个incode的VQ。那其实。我就在好奇,就是说这个VQ它的会不会损失一些信息什么的。然后刚好您在前面说mega tTS的时候。

发现这个token是呃适合在韵律上面去建造呃整个语音的。呃呃呃谢谢你这个问题。呃,其实是呃有一种做法是现在做法的是,比如说我们直接把spech转化为呃呃 token,但是呢我们发现有一个问题。


但是音频来说它比文本要呃要更加复杂。它有语义的一些呃信息,它有audio的一些信息,包括audio信息,比如说是有韵律信息,包括它时长能量呃音高等等等等等等的一些一些其他的一些 attribute。





比如说像呃。duration这种 durationration的话,我们依然是用我们之前的nt speech这种框架来做这样子一个预测。所以说在我们的工作里面。

我们是呃先呃我们是不同的呃先进行解偶不同的一些属性。那么作为ration继续用dration prediction来进行预测。那么音高或包括韵律的话,我们是用token来进行预测。



这个我们呃这个下一个报告是来自于这个呃北京智源人工智能研究院NLP与多模态研究中心的啊刘广研究员啊,刘广博士呢它是flag AI的核心贡献者。他的主要研究方向是大语言模型和多模态文声生图的方向。然后呢。




但是open eye他没有open他的那个source code和他的model。所以说后来有很多公司在follow这个工作的时候,都是呃闭然的状态。就比如说百度。

谷歌呃me journey他们呃这个效果是非常好,就是当时但是效果非常好呢,同时也带动了一大帮那个社区的一个用户跟他们进行交互,然后帮助他们的那个质量提升。但是呢没有一个开源开放的一个代码。

所以说在这种情况下,就是呃stable diffusion横空出世。他把他的模型权重和所有所有的代码都已经公开出去。然后效果非常惊艳。然后从去年的9月份开始到现在的话。


所以说现在很多的这个开源社区基于s diusion做了非常非常多的一些改进和大家都是用s diion的一些衍生的产品或者是衍生的一些模型做很多有意思的一些应用。




所以说。这个cep模型它实际上是提供了一个condition,就是让这个图片知道往哪个方向去做D noise,然后使得生成质量符合这个我们文本的输入。除了这两个组件之外。

就还有一个叫做呃oping hold的一个组件,就是把一张图片压缩到一个影空间,从一个影空间再还原成一张图片,就基本上这是这是这主要的三个组成部分。对,现在那就是纹身图这方面,我们呃跟进了很长的时间。















刚才有两个主要的组件,一个是那个cep模型,一个是那个unit,就是deno那个模块。所以说我们就先为了训练一个or diffusion的呃就是一个diffusion的多语言版本。

我们先训练了一个多语言版本的cep。就把可粒模型通过一种呃叫做可以叫做 teach learning或者是蒸馏的方式。把一个。本身需要大量图文队去经过训练的这么一个呃文图表征模型。






所以说我们就做了一个多语言的c力模型。做了可粒基模型之后。我们在把这颗粒模型接到原来的diffffusion模型上,就做了一个扩展。相当于是把呃原版的sable diffusion的2。







北京比较著著名的那个叫做什么胶圈,这些你用翻译成英文,其实可能就完全不是不是他原来的意思。对,然后。呃,下面介绍一下,有了all diffusion呃,M18之后,我们做了一些事情。就是我们分析发现。


我们去呃想去接入到那个开源的生态,就是把我们的呃。at diffusionM18可以接入到contl net,接入到laura,我们也受了一些case,其实可以完完全无缝的兼容。












同时还接入了一个叫做多步可控编辑的一个模块。那就可以。看一下效果。他其实就是在。编辑的同时能够很大程他他做了两个事情。第一个事情就是che sort,就把一个复杂的多步的呃指令。

























那么之前有这种P two PHD的方法,它就给可以给一个这种语意图,然后作为一个嗯输入,然后它就相当于每一个颜色都代表了这个你想这个区是车,然后这个是数,然后可以生成一个接景图,然后现在也有这种大模型。

比如说打力 two,然后以及一系列的这种嗯文本到图片的生成,然后你可以给一个pro,然后它可以生成嗯对应的图片。嗯,当然这些生成结果都非常不错。但是它还是存在很多问题。















其实我我我们组很早以以前就就就就有相应的工作。这个 network这个VPN这个工作,其实是当时我还在MIT的时候快毕业的时候带了一个实习生潘博文,然后做的一个实习的一个他实习的一个工作。




他他他其实是可以嗯嗯成为一个好的工作的。就现在其实做这个BEV perception的都会去引用我们这一篇工作。嗯嗯这是一个题外话。然后我们回到我们这里想做的这个BEV任务,就是之前的工作室。

就相当于我们给入呃输入的图片,然后生成这个鸟瞰图。然后我们这里呃希望把做这个的反问题,就相当于我们想啊做这个BEV generation啊,我们这里就呃生成,然后就相当于输入是一个鸟瞰图。







然后呃我们分别对它的鸟瞰图以及这个呃图片生成这个进行进行一个这个学习,然后再把2块呃联系起来。所以这个model其实跟这种打力 two就图文本到图片生成,其实有有一些类似。



所以就做了一个这样一个小的一个呃设计,就position enco的时候把。不同视角,它的这个特征相关性,然后把它放到这个啊self attention。啊,这样就可以可以让他确保它的这个一致性。嗯。

然后我们这里是嗯出来的一些嗯结果,就左边是啊我们放的这个top down啊这个这个鸟瞰图的输入,然后右边是我们分别在嗯6个视角下面啊产生的这个图片,嗯,其实效果还是嗯比较不错的。

然后我们这里用的deder其实就是用的这个BQBAE two。啊后我们学校也实验室也没有资源可以去让这种啊扩散模型,所以就就直接用了这个BQBAE的这个结果。所以它图片上面质量其实也还是有很多瑕疵。啊。

我们觉得把这个deder如果换成更好的deder的话啊,这个图片效果可以啊进一步提升。啊,不过这这并不是这个工作的重点,重点还是我们希望能把这个问题首先提出来,然后建立一个 baselineline。



然后我们这里啊再再给大家看一个这个视频,然后这里左上角是一个鸟瞰图的输入,然后嗯上面一行是我们生成出来的三个视角的结果,然后下面一行嗯是是这个呃ground choose。





就是我们这里鸟瞰图其实是一个比较简洁的一种表达。我们可以用这个仿真器来产生它这个鸟瞰图。然后这里两个结果,就是左边我们是从我们这个自研的一个呃驾驶仿真器叫m drive的一个仿真器。

然后后面我会给大家再介绍一下这个m drive,就从仿真器里面拿到的这样鸟瞰图的呃呃输入,然后再放到我们这个模型里面去,然后就可以用利用这个模型,然后把第一视角的这个呃图片给给生成出来。啊。






因为nerf其实原始是拿来做重建的,它其实并没有生成的能力。所以我们是希望呃但是但是nerf model它自身带有很多这种3D的信息。所以我们想把这两者进行一个融合。

所以我们这里就想做一个这样的一个拍出来,就当我们有呃二维的这种鸟瞰图过后,然后我们可以生成这种呃三维的这种结构图。然后再从三维的结构图里面进行这种神经的呃 rendering这种渲染。



然后它象征了呃物体对应的位置,然后我们可以放到这个呃gen objectject generator里面去。然后我们背景也对单独处理。这样就可以把前景跟背景,两者都结合起来。

然后再利用了一个nererf里面常见的这种啊,然后嗯给它啊vol render出来,然后再再把图片这个 sample出来,然后这里然后我们也结合了一些干的一些东西。





然后我们也跟之前的一一些方法进行一些对比吧,就是在clever这种3D front,然后wemo数据库,然后上面都进行了对比。然后这里我们选的就是这种value干呃一级3D然后girae。

然后我们在这些场景里面效果都是嗯都是最好的吧。就对在这种3D呃3D3D aware generation这个这个细分领域的话都是最好的一个结果。目前。嗯,然后然后接下来因为这个这个生成模型。


然后这里图片它其实是这个呃vol rendering出来的,就把这个图片,然后我们可以改变自由的改变这个场景的呃这个结构。嗯,然后。然后这里我们也可以对对物体进行一些这种显示的一个编辑。



然后在我们输入前面给它这个给它这个呃去掉。那么那么它对应的这个呃位置,这辆车,然后就被去掉。嗯,这样如果你只是在嗯两维的图片上面通过比如说diffusion model进行这种intending的话。












比如说这种ance free的一些设计,然后以及对它的呃 decoder进行一些这种ing训练。然后对它的这个分类器,然后也也进行一些操作。那后就可以得得到相对视觉上面能看得过去的一个一个生成结果。



















然后我们实验室嗯一直在开发这个模拟器,一个叫这个mat drive的一个呃驾驶模拟器。然后我们这里强调了它的这个相对于之前的模拟器。比如说它的一个长处就是它是非常有效率,就是在单机的这个PC上面。


然后这个 drive目前已经开源啊出来了,就感兴趣的同学可以去看一下。然后基于这个 drive,然后我们可以导入这种正式数据。然后这里我们是导入的一个new的一个驾数据。
















然后可以把它未来轨迹,然后生成出来。然后这样就可以对这个场景进行仿真,然后我们可以进一步把这个鸟瞰图放到我们的这个m drive里面去,然后就呃进一步可以交互的这种物理的一个仿真结果。嗯。


交交通流。嗯,然后我们也用到把它用到一些现有的一些编辑。比如说这种iningin paintinging,我们这里的ining啊,就是说可以可以去延伸它断了的这些轨迹。







然后我们现这个traffic站的这个模型也是呃开源在呃这个呃met drivers这个report里面,就感兴趣的啊同学跟老师可以去嗯关注一下。啊,然后我们接下来再做了一些插件。




然后这里是嗯感兴趣的同学可以关注我们这个呃嗯研究方向。我我是把它命名成一个metter drivers,matter drive跟universe meta metaverse两者结合起来。


然后我们这个时间关系还是有一个问题的这个提问的时间。好,我们麦克风给到这位。😊,哎,周老师你好,非常感谢您的这个这么好的工作。然后我的问题是因为现在呃很多自动驾驶的,它存在一些con case嘛。


嗯,昨天刚刚有一篇OL的一个投稿工作,就是在做这种poor case。我们这里说是sfety critical scenario的一个生成。然后这里就我们把它建模成一个这种对抗生成的一个感觉。


我们再次感谢周柏磊老师给我们带来的这个精彩的报告。然后也请大家多多关注这个啊matta driver呃这个这个sorry不知道是不是说的特别对哈。这个我们一个非常创新性的一个词啊。好,谢谢周柏磊老师啊。


他这个研究方向是极其感知推理与物质力世界的交互啊,从人类的知识中啊汲取灵感。然后呢,他曾经在在加入斯坦福之前在gogle research啊,担任这个访问的教研究员。然后呢,他在MIT获得博士学位啊。

导师都是大佬哈bi freeman还然后呢,并且在清华大学或博士学位的时候啊,在MA和这个图中恩老师有非常非常密切的合作。然后呢呃我们可以看到家俊老师吴家俊老师他就是各种师从大佬。



报告的题目是understanding the visual world through naturally super code。好,我们欢迎吴教授。😊,嗯。现在这样行吗?能能可以的可以的。





so today I'm going to talk about understanding of visual world through naturally superized code。

So so visual world, I guess it's easy to understand, right, So we live in this world。

and we use our we use our human vision to see right all these patterns, geometry。

object textures and code as well every day we code using Python using whatever。

But so I guess we all understand what code means although although I feel like know hopefully at the end of this talk。

I will be able to show that know we can interpret code or symbols when people say neurosymbiotic AI code or symbols or programs in a broader sense。

it's just not just like python or photos loops we can actually have a much broader interpretation of what code is as well as what is natural supervision So what do we mean by code can be naturally super。

I'll give you a few examples throughout the talk。 hopefully we can have more clarity on that as well。

😊,Okay。So you know the question is, how can we really leverage the kind of rich structure symbols programs that exist in the natural world that exist in our visual world for better perception。

better seen understanding So to begin with there are a lot of rich structure in this visual world if you look at scenes like this the corridors or the buildings and you realize it's not just pixels right although you know these geometry models they always model scenes as pixels but theyre actually richer structure than just pixels for example if you look at the scenes they realize there are planes scenes made of planes and there' seeing。

there's floor walls and there are symmetry the things is reflectionalsymmetric in there are repetitions you can see there are lights at the top of the scene they repeating themselves and you can see if you look at the buildings then there are kind of windows and floors to kind of repeat themselves。

So now the question is, is it possible for us to leverage such kind of structural information just' beyond pixels you to。

With pixels for smart sea understanding and editing。 So here。

let me give an example of what it means。So here's a video and we put it in this I only be Photoshop gui but it is really underlying algorithm is ours。

so what we want to do or what we can do is you know we first can do interactive segmentation given the building the user have one interaction right so this is standard everyone can do it right you have interactive segmentation to get the building and you can compute a vanishing point in 3D。

But then what we want the users to do is just through one more interaction that is okay how would building look like if want to make it taller the user can just drag and to make the building taller using one single interaction one single step or as well as how to make the building wider so the problem looks kind of simple but in practice you have to it's actually not as simple because you have to really have understanding of the scenes at multiple levels of abstraction at the lowest level you have to understand the scene has textures what is the texture。

what is the color of the building so the things should look kind of similar in the media level you have to understand there is 3D geometry right the buildings are in 3D and every facade has its surface numbers it's facing a particular direction so if you make the building taller then the phase of the building should still face the same direction and the highest level you have to understand there is repetition the floor are repeating themselves if you want to make the building taller of course there's no perfect answer but if have to pick you have to guess I would say the floors just keep repeating themselves。

And the wes should keep repeating themselves, so such kind of higher level structures and repetitions should also be kept in your answer to this question。

So how can we do that?So we're inspired very much inspired by these kind of earlier work you know people trying to use program synthesis for visual data so they start with very simple images if you have kind kind of a sketch of these line joins and now there are clearly some pan so what it did is they first use a combination of learning a satochastic search to identify the entry level putitives in the scene in this case it's just lines and rectangles right so thinking about it as you know you're trying to factorize the image right in a raw image we know it's all like raized it's 200 by 200。

300 by 300 pixels whatever so it's kind of very high dimensional space。

It pose a lot of challenge for program synthesis algorithms because program synthesis methods usually work with very low dimension it's kind of hard to scale up。

So you're saying, okay now let's first try to factorize it so that you're turning a PG file let's say into a P file or into SVG file so you're vectorizing image so you're turning this higher dimensional space into a much lower dimensional space and now you the scene just have a collection of primitive lineizing rectangles and then you can use arguably learning based program synthesis methods you search for program that explain this lower dimensional space and once you have the program you can do small things like extrapolation right so this is earlier work where the methods now kind you may feel like oh the old machine learning method but you can also imagine replacing all these things with G4 right so you can do the same thing by per G if you look at CP 203 there are old ideas being reimplemented and realized with just newer tools but fundamentally you can kind of do similar things that you can extrapolate this。

patternss that make it, larger。So this is what they did back in 2018 and on 2D line joins and a clear limitation as you may have noticed is it is assuming you have the library of objects in the image in this case you know okay the world is just made of lines and rectangles I just want to find these lines and rectangles to backize the image and then I can search for program to explain these lower- dimensional space but the world is not made of lines and rectangles if we want to generalize from these sketches to natural images like if we have a bowl of milk with a lot of serials then realize yeah okay they're clearly some structure maybe you're kid or we're just playing with these serials and we make this triangular shape but what is an object here so it is the serial but how would you represent it it's not as simple as lines and rectangles。

So especially you know, is that possible if we can find a way to identify these entry level primitive objects。

you know, even without risk requiring a lot of kind of prior knowledge。

So we were inspired by these kind of classic computer vision work on internal learning or single image learning。

which is led by know Meharrai from Israel where they have been working on this topic for more than a decade or actually only two decades so or internal learning or single image learning they rely on this key observation that is if you look at a single image。

even just a single image, they realize the patches within the single image are very likely to repeat themselves such kind of repetition happen at the same scale at these kind of red boxes。

but it can also happen across different scales at these green boxes right so but why would these kind of rep happen。

why would they exist this is because if you look at a scene like this。

there are all these classeses, they realize okay it is fundamentally these grasses they' are kind of the same type of objects definitely they have to look similar because just like they're same species in the same category。

but they' are just different instances of the same object category but because the real world is in 3D。

And you're seeing objects in 2D in a 2D image, there's perspective projection。

you're having the 3D space, your perspective projecting into 2D。

that's why objects that are closer do appear to be larger and such kind of repetition of similarity may happen across scales。

you, in addition to it may happening at the same scale。😊,O。😡。

So now we have this observation how can we leverage that So what people have observed or hope have tried is to combine that with each So for example。

if you have a picture like this where there's kind of some repetitive patterns so if you send this picture to a pretraining neural network。

let's say an image I appreciate at that and then you can take the feature maps。

these activation maps, and then can compute you know for every possible displacement if you shift this feature maps and horizontally by X pixels and vertically by Y pixels you can shift by certain number of pixels and then how likely are these future maps to be correlated with themselves what is the selfcor of these future maps after this kind of different displacement。

😊,And then you can do arcms and find x and Y that maximizes such kind of correlation or selfcorrelation。

So what does these x and Y means or they probably just indicate the most likely gap between two neighboring repeating objects right because if you ship objects by certain the picture by certain pixels。

the feature map are likely to repeat themselves, which means these know after to ship the objects displaced objects may look very similar。

That's why x and y may very likely be the gap or the distance between two repeating objects。

And the reason you use these feature maps instead of just RGB pixels is these recognition networks are trained you're supposed to be invari to all the noises in natural images capturing imaging noises or occlusions or lighting changes and stuff like that So hopefully it would just be more robust we take a version of this kind of idea。

but we make some changes to it, which allows us without prior training。 you take。

Ali bit prior training because you'll take this preaching network。

but without you training a particular data set which has to。

you can take these kind of off the shelf methods and then you can just test it on a single image and from a single image you will be able to identify the central release of these repeating objects。

So this you can think about it as you're now trying to vectorize a natural image it kind of a crazy thing to do。

you have this know pixels of you a rasterized in natural pictures。

but then you're trying to vectorize and you're trying to represent this image in a much more dimensional space。

which is the same choice of these objects。😊,And then once you have this lower dimensional space you can view what people can do before that is searching for a program that explains these ss right and now you can just search for a program to explain where the objects are。

but the thing that is not captured is because you no longer assume the word is made of ion and rectangles。

you don't know what this object is anymore, you can say okay at this point there's a line or at this point there's this rectangle or you know at this point there's something but what is that thing。

how to parameterize that thing you can actually parameterize it with a neural network with a gene neural network So this is kind of a way of you have it is neuroymotic representation or geneticrative representation of the pictures which allows you to do kind of interesting things。

for example。If you have this picture of a lower process and its kind of a missing patch and then our question is like how can I feel thats missing patch intuitively humans will say。

okay, there are all these crosses next to it。 So I I will assume I should put across there too So the power of this kind of hybrid method or representation is these kind of programmatic structure or the code tells you where to look at it tells you okay。

these are the centuriesries of the other objects and these are the patches you should look at but then how to use these patches to fill in this missing region。

you you delegate to a neural network。 So neural network takes all these reference patches as smartly do imaging painting。

ly random So you can look at it and you find okay oh it's not like I'm simply copying a neighboring patch It's not like copy and paste。

but it is actually you know looking at where I should look at and using neural network to put in the lower textures so the output image looks realistic。

but intuitive。 it really match our intuition that we should put across there。

Then you can do extrapolation。 you can have another row。

but not another row of rectangles and another row of cross of natural images。

you can do extrapolation on natural images。And because we know there is no perfect pictures that are in the natural world right so every cross must be you know the program tell you where they're supposed to be at。

but of course there must be some slight deviations。

you can actually identify these deviations and then you can magnify these deviations you can magnify the deviations from of these objects from where they're supposed to be or you can say you can magnify these irregularities which is potentially very useful for you know defect detection in。

😊,Industrial production。You can go back to this mucan serial example。

you can find the centuries of these objects, you can do imaging painting。

putting back a missing serial, you can do extrapolation。

it can add another column of serials and you can do regulargularity and。

it can magnify odd irregularities。Okay, this all looks nice。

but there's a big difference now we were in natural images。

Theres this big difference between that picture, the serial picture or and this corridor picture I showed at very early now the serial picture you're sort of assuming it it's an natural picture sure。

but you're sort of assuming that everything is on the single 2D plane and you're seeing this plane from a top down view。

😊,But this is not a case in the natural pictures because for example, in this corridor。

it's not like everything is on a single plane, right,Clearly there are multiple place orre seating。

there are four, there are two walls right, So you have all these different place。

So now the question is it possible for us to generalize this kind of you know kind of structured representation from a single plane to multiple images。

😊,This, you know shouldn't be too hard because all you need is of course you need a camera parameter。

you need where the active vector is, you have to set a camera parameters。

and then you have to find a far way of you Part an image into multiple planes。

and then for every plane you have to estimate their pole or a six off pole。

which is their positions in their surface almost。And once you have these things。

you will be able to rectify the plane because if you know where the plane is。

what is its surface normal, then think about it you can kind of rectify this image so that you're seeing the plane from a top down view and once you have that you're reducing problem the problem to one problem that you know how to solve already so you can search for programs that explains it right so you of you're able to generalizing from a single plane program to a multiplane program。

😊,This is a hard problem。 of course we're still going to rely on bottom up visual cues you know。

for example, we're going to estimate the vanishing point where the vanishing point is as well as the wireframes know the3D wireframe estimation is really hard problem it was kind of not that robust back then when we're doing it 2020 it is still not very robust right now I feel like it of first getting better so we kind of using 2D wireframes right but 2D wireframe is better in the sense that give you a lot of correct answers but also give you a lot of noises that are most positives but anyway're talk about it later but before move on I would like to first do analogy that is you we always try to go in this kind of bottom up way we go from broad pixels and then we try to identify some kind of bottom up no level visual cues and then we go all the way up to high level structured programs。

repetitions So this has been a case for line joins where people have to first identify the lines and shapes and then you search for a program for a single play images。

that's the case too you're trying to。you know going beyond these lines and shapes but you try to find the centrals of these repeating objects and then you search for a program to explain it and here you're just having one more step you know this bottom up process where you have one more step where you try to use vanishing point and wireframes to help you do the plane partition。

Okay so you can draw this kind of analogy here and try to guide your thought process about how this is being done。

but as the problem gets harder and harder we go from synthetic images to natural images actually multiplan images you can see the problem also gets harder and harder this kind of multiple possible explanations is not what could happen know for example in this particular image。

they say okay the vanish point is estimated pretty accurately but the plane the wireframes so there's just kind of so many false positives。

so based on these wireframes there seems to be many different explanations about where the planes could be right okay you know where is the wall。

where is the seeding and where is the floor, theres just kind of numerous explanations to it。

So now the question is which one is correct as humans we have a lot of fire knowledge and we say okay。

candidate2 is correct candidate two is correct because you know we have seen so many corridors and we have seen so many walls and floors and we know how they look like。

😊,But that's not the case for machines, especially if the machine is only given this single image if the machine has only seen this picture and then how would you know right candidate2 is's better than candidate one or candidate3 especially they all satisfy the vanishing point constraint and wireframe constraint right so you can see that when we go from more and more complex pictures。

you know these kind of virtual cues have become more and more limited and the problem is become harder。

how because this space is larger due to so many more uncertainties。😊。

The Hong code address his problem。I think there's kind of kind of fundamental understanding that is required that is。

Well know we have to again think about you know why would these structure exist in the first place。

why were these you objects or planes to be regular to be。

you know why would these kind of symmetry exist, why would these repetition they exist。

why would the lights repeating, why would the lights repeat themselves?

And it's fundamentally because we have this human preference and humans。

know when we introduce preferences sometimes kind of explicitly, but sometimes very slowly。

for example, for this particular corridor, it could be the case that know because we like such kind of regularity。

we like such kind of structures when we're trying to construct disputing or trying to construct this corridor。

know we introduce this kind of prior that is, okay the whole thing has to be symmetric。

the lights has to repeat themselves with a fixed integral。😊。

So because of this kind of fundamental human preference, this structure exists。😊。

What does that mean or how does that help us in solving this inverse problem?

That means if I can actually identify or solve this low level problem really well。

then the high level problem should also become easier to solve So let me give you a concrete example let's just you know take this example and go forward you know here I don't know which one of the candidate part is the best but I also say okay yeah sure I cannot really tell but let's just assume theyre all good and let's proceed and see what's going to happen。

RightIf I assume candidate one is the correct one or two or three is the correct one。

what's going to happen know we can move forward, we can assume they are correct。

we can estimate your surface novels and positions and then we can use theest surface novel and positions to rectify each of these plane and then we can run algorithm with we know before about okay。

I want to search for program to explain them and what you is you know。😊。

If you have the correct plane partition, then the estimated plane and theified plane will be very regular because that's where human preferences exist and therefore if you want to identify a search for a program that explains these you know I erecttify the planes。

then the identified program the in for the program will be much simpler。

you can see us in the middle one and the if you use the program to reconstruct the planes。

the reconstruction will be much better。😊,Right on the other hand, you if your plane estimation。

the noable problem is not solved very well, you get incorrect planes。

you estimated surface numbers incorrectly, then after you rectify it。

the planes will look kind of awkward and you won't be able to get a very good program or simple program to explain them program you' get what be emotional complex。

the reconstruction will be much worse right so the fundamental observation here。

which is I think very deep is the bottom up problems things that we typically think about Oh a surface novel or vanishing flow in whatever estimation or plane partition in this particular case and they're not totally independent from this high level program search problems but the plane partition and program think this methods they can and they should really help each other because we we really connects them is human preferences it' human colon requires。

😊,Okay, so this bottom up problem, visual perception pop down program synthesis and reasoning problem should really help each other。

and in this particular case we can use program synthesis to tell us, okay。

what is the best playing partition which is in an object a candidate2?😊。

And you will be able to get the right program plane partition。

get the right program for each planes and then have this kind of programmatic nuing boundary rep for this scene。

And what you can do afterwards is you can say okay what would happen if I move forward So compared what purely autoregressive methods this back in 2020 So now you can probably do a bit better but even do very long rangech long-term program。

you will be able to see that seeing you should keep the structure instead of just getting blurriier and blurriier Well also even beyond that not just prediction。

but what about extrapolation if you're seeing standing in this corridor。

but you say I'm not moving forward I'm moving backward what would you expect to see if you're moving backward then you should expect this change model to tell you there should be lies key coming in if you move backward then another light should come in because all these signs of repeat themselves try to expect another lie coming in instead of just producing a kind of a blurry picture which make the existing scene look smaller and further。

if you want to say okay what would happen given a single image。

But if I look around if I turn to my back why I'm going to see you know if I'm only seeing the front of a corridor and you ask him what is behind me。

of course I don't know and there are infinite possibilities but I have to pick one I would say yeah I'll just maybe producing an infinite corridor of course there are no infinite corridors in the world but if I pick one I would say that seems the most possible explanation I just have right so I just turn around and producing an infinite corridor where the lights and shadows and all these things keep repeating themselves instead of you know just producing something that's very very blurary。

O。嗯。So you can go from this a corridor example3 beauty example。

which is everything is the same except that in corridor。

you know you can think about it as a box where you're standing in the box。

but if you're looking at a building, then it is the same box but then you're just looking at a box from the outside and therefore compared with corridors you only have one where you only have one vanish points if you look at a building example you have two managing points everything else is the same so you can do an extrapolation as we showed earlier。

but compared with the existing methods where some of them they can keep the structure really well。

But they don't really respect the input that well and some of them really respect the input。

but then the reconstruction looks not as good, but now you can just you know have instruction methods to do both。

AndBut one final thing I want to say is oh we were like oh but now we have these kind of you know diffusion method taking you much better。

which is a issue So I think it will still be very interesting to think about right how these pixel level based methods can be more effectively integrated with of structure representation for example。

a common issue with these very powerful now geometry models。

especially those in sweet often may have multiple hats right so that's because the pers data in the pictures is very likely for humans to take pictures from the front of the dog or something So therefore every dog。

every pictures you have of the dog very likely to have a head and dog is facing toward you So you have generating dogs in sweet with all multiple hats right so having some kind of structured knowledge will hopefully be able to help you address this problem。

And I should say this is really done by two fantasticies students。

although now they work on very different things now。

but I like the work a lot I still talk about it and with Shaenmalo and who is now a PhD at MIT and EKi。

who is a PhD at Stanford。Okay now I want to move on to some more recent though。

so we have talked about program synthesis for visual data and we start with sketches to natural images so single image learning and then we go from single plane to multiple plays so what's next。

You know, you can think about this line drawings or single plane images as we're doing everything in 2D and if we have multiple plane。

then then it's like you know you have 2D, but then you know every plane you kind of have a surface novel you are whopping it and you're kind of putting them together right it's number of detailed geometry so which is not enough for I was say two and halfD but still you know you sort of have an envelope representation for the scene So I say that's arguably a little bit less than two and a half but I just calling it two and halfD。

😊,So naturally what is next is we when to move to 3D and there's kind of fundamental difference I want to emphasize that is you know when you just like you know in human perception as well。

you know when you say 2D and2 and half the, it is often the case that you have this viewercentric repetition right everything scene it's all about a scenes and the camera is always at your eyes and coordinate system is centered around your eyes as well it's just the word coordinate system but when we talk about 3D it often has a big change that is now you centered your around objects you know and objects has their own coordinate system and you just see objects from a totally different perspective which is the origin now it's not centered around your eye but centered around the center of the objects。

Okay, but in particular, obviously are very interesting because。

They so that requires kind of a sorry, but scenes versus objects that requires kind of a whole paradigm shift in a lot of the methods or things that representation we talked about before。

you know conceptually it should be transferable but theyre kind of deep discrepancies between them which I think works a lot of further studies but in this particular case now we can look at 3D shapes where they often have these very abstract and program like structure know this is again because when humans in a way we try to make these objects。

especially for these human artifacts we just really want them to be regular right at half of the table and these the chair the lack of the chairs we just we just have these kind of strong preference for them to be regular for them to be repetitive but also there's kind of a pragmatic concerns as well because if you have the table lags that are equally long then the table won't be stable right so you want it to be stable you want to be cheap efficient to make so you have all these considerations that makes or suggest that all these shapes they have to have these kind of structure。

So due to time constraintstrain I won't be able to talk about how we'll be able to do this in detail。

but let me just quickly show you the results in the sense that you know we were trying to use learning methods to actually take a shape and you will be able to infer the programmatic representation for the shapes which were very much inspired by work in both program synthesis ball in computer graphics kind of be a huge line of work on how you can use procedural models for computer graphic for shapes。

😊,We're using neural networks for inference and then because there's kind of very limited annotations on shape programs。

we also have a neuraler network as a program execut so that you can do mostly selfs training So but here you know if you do this in a very simple way you say okay oh the shapes has these regular structures and the legs of the chair is often like a cuboid right at top of the table is like a cylinder then you fall back to this existing limitation。

which I talk about very early I was saying oh the line look at line noise it's just made of rectangles and lines and the world is not made of rectangles than lines。

the word is much more complex Q3 you have the same issue that is and you have this table and you're saying okay the table looks nice because top of table is a cylinder and the lag table is a cubeboid that's kind of funny because the world is not made of just cylinders and cubo if you look at the chairs you're sitting right now then it's sort of having this kind of structure like oh it has to be stable there's repetitions on the lag。

Like that but the detailed geometry of the leg of the back of the bottom of the table or the chair must be you have these kind of flying beautiful curvature and stuff like that。

which is not captured by simple geometrymetric primitives so most recently at this neural conference we had a more recent work where we try to incorporate integrate these shape program representation which is highly structured symbolic with neural primitives because just like especially nowa these days which are parameterized by implicit representations neuroimplicit representation now gets very。

very popular in particular because of nerve people use it forvari for appearance but you may know before that people have been using implicit representations for geometry first and with deep networks like deepPSDf and those kind of work so concept is very similar to what we did before in 2D that is okay you still have this syndron structure program but what is an object what is the serial is parameterized the neural network here it's the same story you have airplane。

Cha, you have this programmatic representation for the airplanes for the chairs here。

what it means is, okay what is the left wing and the right wing of the airplane right So all I know or the program structure tells me they have to be the same thing because it has to be symmetric if there's a wing of the airplane then the left wing and the right wing has to be the same otherwise I don't want to sit in the airplane so for all these reasons these have to be the same this is what a programmatic rep can tell you the repetition of these parts but what is the wing just like the question of what is the serial what is the wing is not simply parameterized by lines or cuids which is now parameterized by implic in neural network they be imp in neural networks parameterized what is the wing or what is the engine of the airplane and a programmatic structure representation tells you the wing should be repeated to be reused and the only exception to change is its pulse right So it has to be reused on the left and the right and engine the well。

So you have this kind of programmatic。what the program tells you the structure or the repetition of symmetry of objects。

while the neuro perimeters parameterize the actual detailed geometry and of course you can do appearance as well like nerve。

but which is here is do geometry of the parts。😊,Right。And compare with earlier work。

Having these symbolic representation for shapes where they mostly use different ways of to parameterize the entry level prims。

but if you use a neural network to learn the entry level prim then you can see that okay it has much higher fidelity but also the symbolic structure enforces。

for example, the symmetry of the airplanes as well as the regularities lags and stuff like that we're happy to talk about this work offline is now I think we're out of time so I should move on to the final but I feel like the most exciting piece of the work。

😊,Okay so I have been suggesting this a few times right so we have these all these virtual programs in。

but we have been talking about how we can get them。

but there's a more fundamental question that is why would these structure exist in the first place I've been suggesting a lot of these program like structure they originally from human preferences in the fabrication process know for example。

if you look at this ways kind of beautiful right so this kind of RGB pictures but the way the ways look like the way it is。

it's because there are intrinsic images right So they are kind of underlying components that put together they got put together and produce the final RGB pictures so this includes。

for example, the geometry of the object surfaced almost abe。

which is the texture in the material which is how the object reflects the light right So for example。

this ways looks like it's a p line it is because I know how the way it reflecting lights makes me feel like it is a p line right so these are underlying components that got put together and produced the image and。

😊,Comp graphics, the process called rendering, that's why a large part of computer vision is you know when people say oh inverse graphics or inverse rendering。

right, you try to invert this process and get underlying components of what is there in the。😊。

An image。But the problem is so hard because you know think about it you not have you a lot of components A B and C and you know that when they got time together。

the final output is the image, but you're given the picture how can you tell what is the underlying ABC that's just impossible so you have to rely on different the problems is so ambiguous that you have to rely on different levels of inductive biases So what are the inductive biases we may have I would say it is like the program like structure or the human preferences exist in these intrinsic image。

😊,In the case of this particular base, you know, let's look at it one by way。

if you look at surface normal is irregular or nonregular I would say here, the surface normals。

I say it has this very strong regularity because when we made it。

we want it to be rotationally symmetric。 So we have this kind of strong preference about making the things to be regular。

This is the same case for the materials as well because I know the object is homogeneous is made up same material everywhere。

therefore every point on this object reflects the light in the same way。

therefore the surface normal and the materials of the object should have very strong regularity it's rotational symmetric。

it is homogeneous。 So Ill call this explicit regularity。😊,Okay。And what about abeo。

what about the texture of the objects?The texture of the object is kind of implicitly regular。

I would say, because it is not like okay every pixel on this object on this vase is having the same color right it's not like home。

it's not like pure color, but it does look to me that matcheses you know look on this a map。

they kind of look similar, the vase kind of having similar textures everywhere so has this obviously say implicit regularity。

although I don't know how to really enforce it as equations。

but does have this kind of regularities that are implicit to us。😊,Then there's lining components。

They use spec components, environment maps。 you know they。

they sometimes may have a little bit regularity, but especially when things are indoor。

it's just so complex。 I would just say, no, they're not regular。 Okay。

let's just don't worry about it。 So now we can see that, you know, even these intrinsic images。

they look kind of so complex they。对。😊,We now actually have a little bit of signals about their asymmetries or the structures or the biases in these intrinsic images。

which may allow us to solve this seemingly impossible problem that is to disentangle and infer these intrinsic images just from RGB image as infinite。

So we were this work was really done by a very talented student。

I Shaang Z Wu he was a PhD student at EGG from Oxford back then。

that was when I was at Google and he came here and data the internship So without thought okay iss that possible to leverage that kind of structure for image dendering that is taking this picture and leveraging the asymmetries or the structures we already know know different levels of inductive biases we may have in these intrinsic images so that we can dender infr them from the input and once you have that you will be able to do fancy applications。

for example, you can do novel view synthesis of how the vase will look like turning the v from a single picture into3D seeing it from different views。

but also because you been modeling the material of the after, you modeling how you reflects light。

you can re the objects you can imagine okay just going from an unalnotated picture of the vase how the v will look like under different lining conditions。

So let me talk about how we're able to enforce these different construct of strikes。The first one。

the shape is very simple right we have this explicit regularularities in the geometry we know it's rotation symmetric so you can parameter the shape using a solid revolution representation with the height which is scalar and radius which is vector the radius at different heights right so you can have this kind of structure representation or parameterization of the object shape and once you have that you can render COA and compare that with the ground truth COOS that's your first mouse that's how you enforce the regularities in geometry in surface almost。

😊,Once you have the shape。It can unwrap the object so that you're going to get a surface normal and texture。

but now you unwrap it in Q, you can think about it。

every do to the standard intrinsic image decomp in lighting of materials, in orbitbeo。😊。

And you can put it back you know during the re renderndering computer lighting components and putting back ob beles so that you can reconstruct the texture and then you can put in that object pose and shape。

you can reconstruct the original image and that's your second loss in the pixel photometric loss but here you're assuming you can parameterize object materials using the same parameters after a row everywhere where you assume every point on this object have the same material parameters they're reflecting light in the same way so this is how you enforce explicit regularities and object materials。

Okay, these are kind of all you know not that surprising。

but I think the harder problem is how you will be able to enforce this regularity or structure or code。

but very in code in terms of audioobbedo, right abedo looks kind of similar everywhere。

but how we be able to enforce that because it's not like this pixel has the same RGB value at that pixel。

But we did it in two ways, first is if it looks similar everywhere。

then if I compute me a beeto and then I put it back。

you know the mean abeeto reconstructive picture to look kind of similar to the original picture too。

so that's kind of an easy way of encing it。But more exciting way you're enforcing it is you know when we say okay the abeome maps are similar to us。

now what does that are perceptly similar to humans。

what does that mean that means no matter which abeo patch the object of the image the patch is coming from no matter where it is coming from either it's coming from this highly spec region of the base or it iss coming from this nonspeccular region base。

then aid maps should look similar to humans everywhere。

know that means if the thing if the patches are looking similar to humans。

that should also look similar to the machine is so you can enforce that by sampling patches from this obidome maps and they can send it to what we call a selfsupped aidal discriminator and the discriminator should not be able to tell where this abeo patch is coming from。

So specifically right you can predict all these aiddome maps it can predict specity maps and now the question is no even the aino is coming from a region that is highly specative。

or coming from a region that is not spec at all, if you send it to a discriminator that the whose goal is trying to classify whether where the aiddo patch is coming from。

they should not be able to do very well and the goal of your generator is actually to produce aidome maps that are consistent everywhere no matter whether there is specity or not。

😊,RightTo confuse the discriminator。 And now this is important because this generator itself you know has to solve this very challenging problem because when you do intrinsic the image composition for highly spec region。

it's really hard to do because it's overexposed the regions pure white。

But if you tell the generator that even if the input is pure white。

you have to generate a be map that looks similar to regions that are you know non spec that enforces the system to have this kind of implicly regularity to produce a consistent aid map。

This is really important and for us to achieve very good results。😊。

And so here are the final results about you know by putting different levels of inductive deviceses or different intrinsic image components。

we're able to turn a single image during testing, just a single image and also during training between a collection of images of vases。

but there's no annotation just a collection of images, no3 annotations at all。

So it's purely unsurupped training a collection of basiss during testing from a single v。

you can infer all these intrinsic image components a video surface normal and diffuse spectral components and materials and you can go in during testing going from a single image。

you can virtualize it, you can see in the v from different views。😊,And you can rely them。

Here are more results from a different data set, which is a bit more complex with more background。

but you can see that we can do equally well and again even for paintings right。

you can turn in the picture into3D and then you can see it from different view you can do all of these synthesis and you can do relying。

Yes。Okay, finally, let me wrap up by going back to the original story know at first I was saying you have line joints and line joins have lines and rectangles which are not general。

so you want to apply it generalize it to natural images and then say you have 3D shapeves。

the 3D shapes you know made of rectangles and cylinders that's not good enough that you want to do for general optics。

But now here for interesting image composition I'm saying that is great。

you have all these video components, but then you assume this thing has to be rotation symmetric。

right which is cool, but you know it's not like everywhere every object in the world is it rotational symmetric right there are very few objects that are purely rotationly symmetric How can we going from this assumption of objects are symmetric to more general objects if you show you。

let's say this beautiful bouquet of all these roses And now the question is how is that possible for us to generalize what we can do from things that are huge recly symmetric to general objects。

😊,Again, we have to really think about why these structure exists in the first place。

know human has this strong preference about making things to be regular to be repetitive。

but here for the roses it's not made by humans it's made by nature。

the nature also have this very strong in bias as I said early that is all these grasses look similar because they're the same species and all these roses sort of look similar have the regularities because they belong to the same object category。

they're just different instances of the same object category right So if you look at these roses。

they kind of have this you know similarity but these similarity exists because fundamentally they're the same object same type of object。

they're mindful instances from the same object category。

there's a reason that we call them same object class。

there's the reason we give them a name so what they're really sharing is not a rotationally symmetric object representation what they're really sharing it's thing that you belongs to them which is their object intrinsics know every rose。

they're color rows because。Share this distribution of their intrinsics。

including their geometry shape, including their texture, how they look like。

including their material, how they reflect the light of course, also including their physics。

you know how heavy they are stuff like that, But here for vision graphics purposes。

we care about how can we learn a genetic distribution of their geometry of their texture of their material so that you can get rid of this rotational symmetric assumption。

but now you really learn a genetic distribution of object intrinsics。😊。

So we basically adopt the pipeline we had before, but now we get rid of this assumption that everything has to be re symmetric。

but we still have this regularity or structure or code。

where is the code coming from the code coming from natural supervision because these things they all share the same intrinsic distributions provided to us by nature so they have to share the same geometry the same distribution after geometry of a video and you enforce that by learning to generate approximate distribution but coupled with extrinsics from the world。

including object pole including lighting so they can get the shape the shading appearance representations and you can put it produce a picture and the goal is this picture should look like a natural picture from the real world。

😊,So by enforcing this kind of more generalized constraint or code。

you're able to again learn from a single image, all we need is a single picture of okay of roses。

maybe there are 20, 30 roses, but just from a single image。

we'll be able to learn intrinsic distribution, a distribution of their intrinsics which include geometry texture of material which allows you to we can see that okay capturing doing up to novel mus as in the row at the bottom and novel view doing relating as in the second row to the bottom。

these are things we know how to do before, but also capturing a generative distribution of these roses in the middle row。

you can see that these roses we can sample roses of different sizes because we have learned the generative distribution of their intrinsics and at the same time not only learning to sample different roses。

different geometry but still doing novel synthesis relating as we can do before。

And here are more results, different types of down planes or cranes and stuff like that。Yeah。

all just need all you need is from a single image。

you are going to learn a gently distribution of these having26, you can do no syns。

you can do lighting。Okay, so to wrap up you know we talk about things in 2D and to2 and halfD and then in 3D we talk about you know kind of like how we can generalize the things we already can do to 3D。

but more fundamentally right why would these program structure exist in the first place and how we exploit them to even do fancier things like now these things that is realizing or even capturing gender distribution doing26 something I don't have time to talk about at all is time and kind of clear a programmatic structure in for example human motion。

but that's kind of an interesting topic to talk about it offline。

So to summarize right the key innovationlevation behind this line work everyone will talk about anything that is not purely neural is of some kind of neural network for recognition and then you have a symbolic representation of code for general right but here I want to say the first you have to have a very broad interpretation of what code is right and beginning we say code can be full loops。

but more importantly code can just be different types of inductive biases you put into neural network So the whole thing is still a neural network it's still entry and trainable So when people think about code they often like oh wait that's not good full loops are not very generalizable but full loops may indeed be not very generalizable except in some specific domains。

but code doesn't happen in full loops, code can be very general inductive biases which you just need to introduce into neural networks but the whole thing is still a neural network。

it is still entry and trainable as we show at end in the very example in the row example。

And now the question is where are these code coming from, right, Sure。

you can put the inductive deviceses, but which one should I pick。

I would say these are these are the things that we should be naturally supervised。

And when I say these are naturally supervised code。

I would say fundamentally there are just two courses or two reasons of this code One is you have humans right human have this strong preferences。

And when we're doing fabrication, we were making things, we introduce our own preferences。

that's the prior or the code coming from human。And the second is the code comes from nature natural supervision comes from nature because when if you believe in evolution right so when there's after instance or there's a species。

then the reason that they share these fundamental。

similar intrinsic properties geometry reflectance right and when there's3D and looking at pictures in2ity。

there's perspective projection, these are kind of fundamental。

naturally supervised code that we may consider incorporating and we like them because they're universal right It's kind of universal is true that applicable everywhere So every picture we take。

they should be applicable, therefore there's no reason to believe that they're not really they will hurt genetic instead they should really help genetic。

And in the future, we can think about how to extend it to more complex scenes with complex background with you know complex interactions between lighting and objects and background。

some are more programmatic, some are less。And if you want to do more generalizable representationation learning。

you know what's really the role of symbols and if you do want to do anytime。

something that is not purely neural, then how can we have more efficient influence algorithms because that makes optimization problems much harder as well as how can we go from passive perception to interact action to interact with the disease and you can see that a lot of these are kind of cognitive inspired。

so how can we draw connection to human cognition and to natural language because language or how we referring to things。

how we are talking about things is another source of important natural supervision。

which are very interesting but I don't have to talk about here。that's it。 Thank you。好。





就是说我们确实有 assumption但是不完全是比说举个其个对我实其我在 assumption不是果你是看这个ros的话。

其实我们更多是一个是一个 assumption设个是一个的场景但是你是以它的s我果你看前面我们个 based的sure但有一个 based flexibility但我你说的是个非常好的fu就是说我们怎么样能够不是im可但简单的是我怎么能个。

😊,whatever you care。但是呢你同时能够 capture的 dynamics,能 capture出的这个相当于你有你有一个 additional一个 dynamics how how things will evolve你有怎么样有一个dynamic reputation能够啊把握它的这个变化啊。


就是怎么样能够从col image中可以学ard objects,怎么样能够把这个马或者什么样的这个video中能够把这个马它有很多不同部分,怎么样能够把这些不同的ulation能够学出来。

on top of that最大家可能在看说不怎么样能够学出 parts还能够animate吧能让它动起来,那这个可能是我觉得是非常好的future其实或者也不是futurego。

就甚至是ongoing direction大家可能很多很多的group都在都在look。好的,我们再次感谢这个吴教授精彩的报告。然后我请那个家俊刘步哈,我们还后面还有一个呃连续的一个拍al的环节。








感谢大家对这个呃就首先我感谢一下大家对这个BAI conferenceence支持。然后刚才我也听了这个好几个报告,包括那个报告我也听了。后呃还有嘉俊的刚才精彩的报告。啊。对我刚才回到这个问题的话。












会非reistic,尤其是现在可能说vi还不太好。但是将来你说video肯定也会做的越来越好。似乎因为video你有很多 data在网上。

你好像没有理由说你 dynamics至少在从appance的角度啊不是appance的角度,就看起来真实的角度,它肯定是很。那我觉得最后他就是那就不是一个 scientificific问题。









就是你你你相信我们还会有下一个更power的一个geny modeling framework嘛。对。对,这这个这个问题其实挺难回答的。但如果你问我相不相信,我肯定是相信会有的。

因为呃因为我们都在做这个做做sS词,因为这但是。呃,会会充满surprise吧,就是就像这个d model出来之前,大家说这个干已经非常好了,就是可能domin很多东西。

但突然有一天这个ion出来之后就干。然后我现在基本上大部分很多在拥抱这个ion model。但是其实 model从从这个本质上他也不是说没有局限。但现在其实我们也看到很多问题吧包括大家听在讲报告的时候。








然后这个现在两个方法可能都有他的问题,但他有多特的优势。然后你怎么看待这两个流派,或者说未来特别在3D上的一个发展的一个前景。😊,对有就是说我我我不觉得说他们有什么 difference。

就是说我觉得他们可能是 same goal,但是他从不同的出发点,然后就例比如说ner,对吧?那就说它就有个 notion space其实他有一个很强的就是上有东西叫 space有like有light transport。

那怎么去然后你根据这个lige来做对吧?那么然那你后面那就说就是说其实他就是说嗯所以我觉得呃并不见得说这些东西它本质上就有一个包括现在其实还有一些啊我觉得比较有意思 work就是怎么样能够什么是一个什么是一个 conceptcept啊。


他就是朱军老师就是说就是说就是说这个这个到底是个什么意思后怎么让李轩老师成为了李红轩老师对那就是说他其实它里面有一些interest concept但是这他有很多 application那就说我比如说我想生成我我有些李轩老师图。


以其实最近其实很多好几篇就今天就有好几篇就是其实常相关的个你可以说他们是同一个 driven perspective他们是但实种上说就是很大程度说 notion an concept。

就似像之 notion space啊,其实是可能。😊,但非常ins但是非常重要的这个ductive bias。然后就像我前面说的,就是说所谓的ductive bias或者谓 structure对吧?

就是说theres name thats important structure而并不是说啊这个structure一定要可需要ductive bias也就比说我真的要去我要来那我可要 program面就我不仅具unioneration但你就越来越具体真的要做越来越但你。









我们是呃希望是把理解understanding和呃生成一起放在一个model里面。那么也就是说我们呃接触的理解的模态,比如说以以人机交互为例,那么我们呃输入的话有呃talking face呃,有有它的。


那么可以做一个更好的一个motymod的一个呃dialogmod的 dialogue。那么使得我们的呃生成的,包括是合呃呃人进行交互的。那么更加的一个逼真,而且对人的一个呃粘性会更加的呃强。








两短期内最觉觉得接下来很可能首先第我觉得没有理由不能有一。😊,也好,怎么样能够effect way可以做hi interaction,然后去啊就是这里面需要什么是最好的一个handle啊。

能够briach human和 machine。那这个还是我觉得有很多的。就当然这也涉到很多 social science的一些问题啊,但这可能就比较 long term。嗯,好的好的,谢谢佳军。

那我最后请朱鑫老师做一个呃相当一个展望和总结吧。也不不能算什么总结吧。我觉得那个刚才两位老师讲的我完全同意啊,就是就就回到这个问题里面,就是未来这个啊觉得比较 startinging,或者是觉得去做的。











