【346】TF-IDF
Ref: 文本挖掘预处理之TF-IDF
Ref: sklearn.feature_extraction.text.CountVectorizer
Ref: TF-IDF与余弦相似性的应用(一):自动提取关键词
Ref: TF-IDF与余弦相似性的应用(二):找出相似文章
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | >>> from sklearn.feature_extraction.text import TfidfTransformer >>> from sklearn.feature_extraction.text import CountVectorizer >>> corpus = [ "I come to China to travel" , "This is a car polupar in China" , "I love tea and Apple " , "The work is to write some papers in science" ] >>> vectorizer = CountVectorizer() >>> transformer = TfidfTransformer() >>> tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus)) >>> print (tfidf) ( 0 , 16 ) 0.4424621378947393 ( 0 , 15 ) 0.697684463383976 ( 0 , 4 ) 0.4424621378947393 ( 0 , 3 ) 0.348842231691988 ( 1 , 14 ) 0.45338639737285463 ( 1 , 9 ) 0.45338639737285463 ( 1 , 6 ) 0.3574550433419527 ( 1 , 5 ) 0.3574550433419527 ( 1 , 3 ) 0.3574550433419527 ( 1 , 2 ) 0.45338639737285463 ( 2 , 12 ) 0.5 ( 2 , 7 ) 0.5 ( 2 , 1 ) 0.5 ( 2 , 0 ) 0.5 ( 3 , 18 ) 0.3565798233381452 ( 3 , 17 ) 0.3565798233381452 ( 3 , 15 ) 0.2811316284405006 ( 3 , 13 ) 0.3565798233381452 ( 3 , 11 ) 0.3565798233381452 ( 3 , 10 ) 0.3565798233381452 ( 3 , 8 ) 0.3565798233381452 ( 3 , 6 ) 0.2811316284405006 ( 3 , 5 ) 0.2811316284405006 >>> print (vectorizer.get_feature_names()) [ 'and' , 'apple' , 'car' , 'china' , 'come' , 'in' , 'is' , 'love' , 'papers' , 'polupar' , 'science' , 'some' , 'tea' , 'the' , 'this' , 'to' , 'travel' , 'work' , 'write' ] |
说明:其中 (0, 16) 表示第一行文本,索引为 16 的词,对应的是“travel”,以此类推。
继续上面的信息,获取对应 term 的 tfidf 值,tfidf 变量对应的是 (4, 19) 矩阵的值,对应不同的句子,不同的 term。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | >>> tfidf_array = tfidf.toarray() #获取array,然后遍历array,并分别转为list >>> names_list = vectorizer.get_feature_names() #获取names的list >>> for i in range ( 0 , len (corpus)): print (corpus[i], '\n' ) tmp_list = tfidf_array[i].tolist() for j in range ( 0 , len (names_list)): if tmp_list[j] ! = 0 : if len (names_list[j])> = 7 : print (names_list[j], '\t' ,tmp_list[j]) else : print (names_list[j], '\t\t' ,tmp_list[j]) print ('') I come to China to travel china 0.348842231691988 come 0.4424621378947393 to 0.697684463383976 travel 0.4424621378947393 This is a car polupar in China car 0.45338639737285463 china 0.3574550433419527 in 0.3574550433419527 is 0.3574550433419527 polupar 0.45338639737285463 this 0.45338639737285463 I love tea and Apple and 0.5 apple 0.5 love 0.5 tea 0.5 The work is to write some papers in science in 0.2811316284405006 is 0.2811316284405006 papers 0.3565798233381452 science 0.3565798233381452 some 0.3565798233381452 the 0.3565798233381452 to 0.2811316284405006 work 0.3565798233381452 write 0.3565798233381452 >>> |
获取 TF(Term Frequency)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | >>> X = vectorizer.fit_transform(corpus) >>> X.toarray() array([[ 0 , 0 , 0 , 1 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 2 , 1 , 0 , 0 ], [ 0 , 0 , 1 , 1 , 0 , 1 , 1 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 ], [ 1 , 1 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 ], [ 0 , 0 , 0 , 0 , 0 , 1 , 1 , 0 , 1 , 0 , 1 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 1 ]], dtype = int64) >>> vector_array = X.toarray() >>> for i in range ( 0 , len (corpus)): print (corpus[i], '\n' ) tmp_list = vector_array[i].tolist() for j in range ( 0 , len (names_list)): if tmp_list[j] ! = 0 : if len (names_list[j])> = 7 : print (names_list[j], '\t' ,tmp_list[j]) else : print (names_list[j], '\t\t' ,tmp_list[j]) print ('') I come to China to travel china 1 come 1 to 2 travel 1 This is a car polupar in China car 1 china 1 in 1 is 1 polupar 1 this 1 I love tea and Apple and 1 apple 1 love 1 tea 1 The work is to write some papers in science in 1 is 1 papers 1 science 1 some 1 the 1 to 1 work 1 write 1 >>> |
分类:
AI Related
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· .NET10 - 预览版1新功能体验(一)
2018-01-03 【285】ArcPy 暗色窗体设置