bge embbding实现
先看 RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder
The above designs of RetroMAE are favorable
to the pre-training effectiveness thanks to the following reasons. Firstly, the auto-encoding is made
more demanding on encoding quality. The conventional auto-regression may attend to the prefix
during the decoding process; and the conventional
MLM only masks a small portion (15%) of the
input tokens. By comparison, RetroMAE aggressively masks most of the input for decoding. As
such, the reconstruction will be not enough to leverage the decoder’s input alone, but heavily depend
on the sentence embedding. Thus, it will force
the encoder to capture in-depth semantics of the
input. Secondly, it ensures training signals to be
fully generated from the input sentence. For conventional MLM-style methods, the training signals
may only be generated from 15% of the input tokens. Whereas for RetroMAE, the training signals
can be derived from the majority of the input. Besides, knowing that the decoder only contains onesingle layer, we further propose the enhanced decoding on top of two-stream attention (Yang et al.,
2019) and position-specific attention mask (Dong
et al., 2019). As such, 100% of the tokens can be used for reconstruction, and each token may
sample a unique context for its reconstruction.