bert一些思考
bert结构
首先是embdding lookup,【batch * seq】-->[batch, seq, hidden]
然后是加个mask embdding和type embdding和postion embdding作为最终
然后到transformers,transformers是24层的self attention + dense(intermediate,layer_norm, residual)
再看attention_layer
首先是query,key,value都是当前的term,先做一个线性变换,到