GPU下train 模型出现nan
When training on GPU, the error "Model diverged with loss = NaN" is often caused by a sotmax that's getting a symbol larger than vocab_size
When training on GPU, the error "Model diverged with loss = NaN" is often caused by a sotmax that's getting a symbol larger than vocab_size