【464】文本转字符向量bag of words
利用 sklearn.feature_extraction.text 中的 CountVectorizer 来实现
- 首先获取所有的文本信息
- 然后将文本信息转化为从 0 开始的数字
- 获取转换后的字符向量
参见如下代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | >>> text_01 = "My name is Alex Lee." >>> text_02 = "I like singing and playing basketball." >>> text_03 = "I also like swimming during leisure time." >>> texts = [text_01, text_02, text_03] >>> texts [ 'My name is Alex Lee.' , 'I like singing and playing basketball.' , 'I also like swimming during leisure time.' ] >>> import sklearn >>> from sklearn.feature_extraction.text import CountVectorizer >>> vect = CountVectorizer().fit(texts) >>> x = vect.transform(texts) >>> x < 3x15 sparse matrix of type '<class ' numpy.int64 '>' with 16 stored elements in Compressed Sparse Row format > >>> vect.get_feature_names() [ 'alex' , 'also' , 'and' , 'basketball' , 'during' , 'is' , 'lee' , 'leisure' , 'like' , 'my' , 'name' , 'playing' , 'singing' , 'swimming' , 'time' ] >>> vect.vocabulary_ { 'my' : 9 , 'name' : 10 , 'is' : 5 , 'alex' : 0 , 'lee' : 6 , 'like' : 8 , 'singing' : 12 , 'and' : 2 , 'playing' : 11 , 'basketball' : 3 , 'also' : 1 , 'swimming' : 13 , 'during' : 4 , 'leisure' : 7 , 'time' : 14 } >>> x < 3x15 sparse matrix of type '<class ' numpy.int64 '>' with 16 stored elements in Compressed Sparse Row format > >>> print (x) ( 0 , 0 ) 1 ( 0 , 5 ) 1 ( 0 , 6 ) 1 ( 0 , 9 ) 1 ( 0 , 10 ) 1 ( 1 , 2 ) 1 ( 1 , 3 ) 1 ( 1 , 8 ) 1 ( 1 , 11 ) 1 ( 1 , 12 ) 1 ( 2 , 1 ) 1 ( 2 , 4 ) 1 ( 2 , 7 ) 1 ( 2 , 8 ) 1 ( 2 , 13 ) 1 ( 2 , 14 ) 1 >>> x.toarray() array([[ 1 , 0 , 0 , 0 , 0 , 1 , 1 , 0 , 0 , 1 , 1 , 0 , 0 , 0 , 0 ], [ 0 , 0 , 1 , 1 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 1 , 1 , 0 , 0 ], [ 0 , 1 , 0 , 0 , 1 , 0 , 0 , 1 , 1 , 0 , 0 , 0 , 0 , 1 , 1 ]], dtype = int64) |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· .NET10 - 预览版1新功能体验(一)
2016-05-16 【205】C#实现远程桌面访问