python应用主题模型——lda,牛刀小试

说明:

1.数据来源:WoS文献数据

2.python读取excel中存储的数据

3.通过分句、分词、去停用词、词形还原分析TI(篇名)与AB(摘要)中的文本

4.lda可采用的库有sklearn,gensim(本文采用的这个)等;sklearn基于EM算法,gensim基于Gibbs采样的MCMC算法

对excel中的数据进行操作

#读取excel数据
from pprint import pprint
import xlrd
path = r"D:\02-1python\2020.08.11-lda\data\2010-2011\usa\us1.xlsx"#修改路径
data = xlrd.open_workbook(path)
#第一列,第二列
sheet_1_by_index = data.sheet_by_index(0) title = sheet_1_by_index.col_values(0) abstract = sheet_1_by_index.col_values(1) n_of_rows = sheet_1_by_index.nrows doc_set = []#空列表 for i in range(1,n_of_rows):#逐行读取 doc_set.append(title[i] + '. ' + abstract[i])
doc_set[0]
'The impact of supply chain integration on performance: A contingency and configuration approach.This study extends the developing body of literature on supply chain integration (SCI), which is the degree to which a manufacturer strategically collaborates with its supply chain partners and collaboratively manages intra- and inter-organizational processes, in order to achieve effective and efficient flows of products and services, information, money and decisions, to provide maximum value to the customer. The previous research is inconsistent in its findings about the relationship between SCI and performance. We attribute this inconsistency to incomplete definitions of SCI, in particular, the tendency to focus on customer and supplier integration only, excluding the important central link of internal integration. We study the relationship between three dimensions of SCI, operational and business performance, from both a contingency and a configuration perspective. In applying the contingency approach, hierarchical regression was used to determine the impact of individual SCI dimensions (customer, supplier and internal integration) and their interactions on performance. In the configuration approach, cluster analysis was used to develop patterns of SCI, which were analyzed in terms of SCI strength and balance. Analysis of variance was used to examine the relationship between SCI pattern and performance. The findings of both the contingency and configuration approach indicated that SCI was related to both operational and business performance. Furthermore, the results indicated that internal and customer integration were more strongly related to improving performance than supplier integration. (C) 2009 Elsevier B.V. All rights reserved.'

列表中就可以直接使用了,如需存储一个txt文本到本地的话,见下。

#存储为txt到指定路径下
file_path = 'D:/02-1python/2020.08.11-lda/data/2010-2011/china/2695.txt' with open(file_path,'a') as file_handle: # .txt可以不自己新建,代码会自动新建 file_handle.write(str(doc_set[0:])) # 写入 file_handle.write('\n') # 有时放在循环里面需要自动转行,不然会覆盖上一条数据

数据预处理部分

import nltk
#分句
from nltk.tokenize import sent_tokenize
#分词
from nltk.tokenize import word_tokenize
#去停用词
from nltk.corpus import stopwords
#词形还原
from nltk.stem import WordNetLemmatizer
#词干提取
from nltk.stem.porter import PorterStemmer
english_stopwords = stopwords.words("english")
#自定义英文表单符号列表
english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '!', '@', '#', '%', '$', '*', "''"]
texts = []
#对每篇文献进行处理
for doc in doc_set:
    #分词
    text_list = nltk.word_tokenize(doc)
    #去停用词1
    text_list0 = [word for word in text_list if word not in english_stopwords]
    #去停用词2自编,这里是我自己觉得需要去掉的词
    english_stopwords2 = ['c', 'also', '2009', '2010', '2011', "'s"]#修改停用词:年份
    text_list1 = [word for word in text_list0 if word not in english_stopwords2]
    #去标点符号
    text_list2 = [word for word in text_list1 if word not in english_punctuations]
    #词形还原
    text_list3 = [WordNetLemmatizer().lemmatize(word) for word in text_list2]
    #词干化
    text_list4 = [PorterStemmer().stem(word) for word in text_list3]
   #最终处理好的结果存放于text[]中 texts.append(text_list4)
#查询文档(我这里是文献的数目)
M
= len(texts) print('文本数目:%d个' % M)

lda部分

#利用 gensim 库构建文档-词频矩阵
import gensim
from gensim import corpora
#构建字典,把刚刚处理好的词都存进去
dictionary = corpora.Dictionary(texts)
#构建文档-词频矩阵,得到的是词袋矩阵,也可以进一步使用TF-IDF,这里未使用
corpus = [dictionary.doc2bow(text) for text in texts]
print('\n文档-词频矩阵:')
#pprint(corpus)
pprint(corpus[0:19])
#for c in corpus:
    #print(c)
#转换成文档词频稀疏矩阵
from gensim.matutils import corpus2dense
corpus_matrix=corpus2dense(corpus, len(dictionary))
corpus_matrix.T

类似于这种[0 1 3 0 2 2;···]

#使用gensim来创建 LDA 模型对象
Lda = gensim.models.ldamodel.LdaModel
#在文档-词频矩阵上运行和训练 LDA 模型
num_topics = 10#主题个数,参数可修改
ldamodel = Lda(corpus, num_topics=num_topics, id2word=dictionary, passes=100)#修改超参数,主题个数,遍历次数
doc_topic = [doc_t for doc_t in ldamodel[corpus]]
print('文档-主题矩阵:\n')
#pprint(doc_topic)
pprint(doc_topic[0:19])
#for doc_topic in ldamodel.get_document_topics(corpus):
    #print(doc_topic)
print('主题-词:\n')
for topic_id in range(num_topics):
    print('Topic', topic_id)
    pprint(ldamodel.show_topic(topic_id))

什么主题数才是最适合的呢,可以采用一致性评分或困惑度。

#一致性评分
print('一致性评分:\n')
coherence_model_lda = gensim.models.CoherenceModel(model=ldamodel,texts=texts,dictionary=dictionary,coherence='c_v')
coherence_lda=coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

 references:

[1]谢婷. 基于LDA模型的人工智能领域前沿识别研究[D].南京航空航天大学,2019.

[2]https://datartisan.gitbooks.io/begining-text-mining-with-python/content/

[3]更多参考:https://t.zsxq.com/R7QbQrv

posted on 2020-08-24 21:10  cookie的笔记簿  阅读(1403)  评论(0编辑  收藏  举报