金融领域预训练模型用于分类任务,大模型应用参考

在bert的基础上加了一个分类层:

代码实现:

1
2
3
4
5
6
7
8
9
output = bert.model.output
output = Lambda(lambda x: x[:, 0], name='CLS-token')(output)
output = Dense(
    units=num_classes,
    activation='softmax',
    kernel_initializer=bert.initializer
)(output)
 
model = keras.models.Model(bert.model.input, output)

 

然后就是利用bert的输出训练一个分类任务了!!!

完整代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
#! -*- coding:utf-8 -*-
#FinWoBERT:中文金融领域增强预训练模型
'''
康明. 深度学习预训练语言模型(案例篇) ——中文金融文本情绪分类研究[M]. 北京: 清华大学出版社, 2022.
Ming Kang. Pretraining Language Models in Deep Learning: A Case Study of Chinese Sentiment Classification for Financial Text.
Beijing: Tsinghua University Press, 2022.
'''
 
import os, json
import numpy as np
from bert4keras.backend import keras, set_gelu
from bert4keras.tokenizers import Tokenizer
from bert4keras.models import build_transformer_model
from bert4keras.optimizers import Adam,extend_with_piecewise_linear_lr
from bert4keras.snippets import sequence_padding, DataGenerator
from bert4keras.snippets import open
from keras.layers import Lambda, Dense
 
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
 
import jieba_fast as jieba
jieba.initialize()
 
num_classes = 3
maxlen = 512
batch_size = 32
 
# bert配置
# path = "/Users/sssdjj/bert_source/"
config_path = 'data/chinese_wobert_L-12_H-768_A-12/bert_config.json'
checkpoint_path = 'data/chinese_wobert_L-12_H-768_A-12/bert_model.ckpt'
dict_path = 'data/chinese_wobert_L-12_H-768_A-12/vocab.txt'
 
 
labels = {"其他":0,"利多":1,"利空":2}
 
stop_words = []
 
# 加入停用词
# with open("data/cn_stopwords.txt") as f:
#     for i in f:
#         stop_words.append(i.strip())
 
def load_data(filename):
    """加载数据
    单条格式:(文本, 标签id)
    """
    D = []
    with open(filename, encoding='utf-8') as f:
        for l in f:
            if len(l.strip().split('|||')) == 2:
                label,text = l.strip().split('|||')
                # 去除停用词
                # for i in stop_words:
                    # text = str(text).replace(i," ")
                D.append((text, labels[label]))
    return D
 
path = "data/"
# 加载数据集
train_data = load_data(path+'train.txt')
valid_data = load_data(path+'test.txt')
 
# 增加自定义词库  word.txt 元词表  word_zhengf.txt 加入正负词
jieba.load_userdict(path+"word_zhengf_buzai_vocab.txt")
 
# 建立分词器
tokenizer = Tokenizer(
    dict_path,
    do_lower_case=True,
    pre_tokenize=lambda s: jieba.cut(s, HMM=False)
)
 
 
class data_generator(DataGenerator):
    """数据生成器
    """
    def __iter__(self, random=False):
        batch_token_ids, batch_segment_ids, batch_labels = [], [], []
        for is_end, (text, label) in self.sample(random):
            token_ids, segment_ids = tokenizer.encode(text, maxlen=maxlen)
            batch_token_ids.append(token_ids)
            batch_segment_ids.append(segment_ids)
            batch_labels.append([label])
            if len(batch_token_ids) == self.batch_size or is_end:
                batch_token_ids = sequence_padding(batch_token_ids)
                batch_segment_ids = sequence_padding(batch_segment_ids)
                batch_labels = sequence_padding(batch_labels)
                yield [batch_token_ids, batch_segment_ids], batch_labels
                batch_token_ids, batch_segment_ids, batch_labels = [], [], []
ctokens = []
with open(path+"word_zhengf_buzai_vocab.txt") as f:
    for i in f:
 
        ctokens.append(tokenizer.encode(i.strip())[0][1:-1])
 
bert = build_transformer_model(
    config_path,
    checkpoint_path,
    return_keras_model=False,
    compound_tokens=ctokens
)
 
output = bert.model.output
output = Lambda(lambda x: x[:, 0], name='CLS-token')(output)
output = Dense(
    units=num_classes,
    activation='softmax',
    kernel_initializer=bert.initializer
)(output)
 
model = keras.models.Model(bert.model.input, output)
model.summary()
# AdamLR = extend_with_piecewise_linear_lr(Adam, name='AdamLR')
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=Adam(learning_rate=1e-6),  # 用足够小的学习率
    metrics=['accuracy'],
)
 
# 转换数据集
train_generator = data_generator(train_data, batch_size)
valid_generator = data_generator(valid_data, batch_size)
 
def norm_index(y_true,y_pred):
    acc = accuracy_score(y_true, y_pred)
    macro_prec = precision_score(y_true, y_pred, average='macro')
    micro_prec = precision_score(y_true, y_pred, average='micro')
 
    macro_recall = recall_score(y_true, y_pred, average='macro')
    micro_recall = recall_score(y_true, y_pred, average='micro')
 
    macro_f1 = f1_score(y_true, y_pred, average='macro')
    micro_f1 = f1_score(y_true, y_pred, average='micro')
 
    cm = confusion_matrix(y_true, y_pred)
 
    return acc, macro_prec,micro_prec, macro_recall, micro_recall,macro_f1,micro_f1, cm
 
 
def evaluate(data):
    total, right = 0., 0.
    pred_list,true_list = [], []
    for x_true, y_true in data:
        y_pred = model.predict(x_true).argmax(axis=1)
        y_true = y_true[:, 0]
        # total += len(y_true)
        # right += (y_true == y_pred).sum()
        pred_list.extend(y_pred)
        true_list.extend(y_true)
    return norm_index(true_list,pred_list)
 
 
class Evaluator(keras.callbacks.Callback):
    def __init__(self):
        self.best_val_acc = 0.
 
    def on_epoch_end(self, epoch, logs=None):
        val_acc, macro_prec,micro_prec, macro_recall, micro_recall,macro_f1,micro_f1, cm = evaluate(valid_generator)
        if val_acc > self.best_val_acc:
            self.best_val_acc = val_acc
            model.save_weights('train/best_model_sentiment.weights')
         
        print(
            u'val_acc: %.15f, best_val_acc: %.15f,loss:%s\n' %
            (val_acc, self.best_val_acc,logs)
        )
        print(
            u'macro_prec: %.15f, micro_prec: %.15f\n' %
            (macro_prec, micro_prec)
        )
        print(
            u'macro_recall: %.15f, micro_recall: %.15f\n' %
            (macro_recall, micro_recall)
        )
        print(
            u'macro_f1: %.15f, micro_f1: %.15f\n' %
            (macro_f1, micro_f1)
        )
        print(cm)
 
 
if __name__ == '__main__':
 
    evaluator = Evaluator()
 
    model.fit_generator(
        train_generator.forfit(),
        steps_per_epoch=len(train_generator),
        epochs= 100,
        callbacks=[evaluator]
    )
 
else:
 
    model.load_weights('best_model_sentiment.weights')

  

为了提升金融领域的领域大模型,还可以针对预训练加入金融领域特有的语料库:

 

 关键技术:

灾难性遗忘

 

 

 

几个文章可以深入阅读下:

1
2
3
4
5
6
Yuqing Zhao, Divya Saxena, Jiannong Cao. Revisiting Parameter Reuse to Overcome Catastrophic Forgetting in Neural Networks.
arXiv:2207.11005v1 [cs.LG], 2022.
Matteo Boschini, Lorenzo Bonicelli, Angelo Porrello, et al. Transfer without Forgetting // Computer Vision – ECCV 2022: 17th European Conference,
Tel Aviv, Israel, October 2327, 2022, Proceedings, Part I. Cham: Springer, 2022.
Yabin Wang, Zhiwu Huang, Xiaopeng Hong. S-Prompts Learning with Pre-trained Transformers: An Occam's Razor for Domain Incremental Learning.
arXiv:2207.12819v1 [cs.CV], 2022.

  

 另外,为了增强可解释性,预训练的语料库要和分类任务保持一致。

 

最后为了增强模型的健壮性,还可以加入GAN:

Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. Explaining and Harnessing Adversarial Examples. arXiv:1412.6572v3 [stat.ML], 2015.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, et al. Intriguing Properties of Neural Networks. arXiv:1312.6199v4 [cs.CV], 2014.
TensorFlow. Adversarial example using FGSM. https://tensorflow.google.cn/tutorials/generative/adversarial_fgsm, 2021.
Nathan Inkawhich. Adversarial Example Generation. https://pytorch.org/tutorials/beginner/fgsm_tutorial.html, 2021.

样本生成用的是该文章的方法:

 

 另外,为了防止过拟合,在输出层可以加入L1正则化!

 

 

 

 

posted @   bonelee  阅读(221)  评论(2编辑  收藏  举报
相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」
历史上的今天:
2022-09-24 UAC提权的姿势——看来还是依赖特定OS版本,很少有一招鲜吃遍天的做法
2021-09-24 勒索软件--todo
2019-09-24 python为什么要使用闭包
2019-09-24 字符串解压缩问题——贪心算法
2019-09-24 有意义的单词分割——经典dfs题目
点击右上角即可分享
微信分享提示