【389】Implement N-grams using NLTK
Ref: n-grams in python, four, five, six grams?
Ref: "Elegant n-gram generation in Python"
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | import nltk sentence = """At eight o'clock on Thursday morning Arthur didn't feel very good.""" # 1 gram tokens = nltk.word_tokenize(sentence) print ( "1 gram:\n" , tokens, "\n" ) # 2 grams n = 2 tokens_2 = nltk.ngrams(tokens, n) print ( "2 grams:\n" , [i for i in tokens_2], "\n" ) # 3 grams n = 3 tokens_3 = nltk.ngrams(tokens, n) print ( "3 grams:\n" , [i for i in tokens_3], "\n" ) # 4 grams n = 4 tokens_4 = nltk.ngrams(tokens, n) print ( "4 grams:\n" , [i for i in tokens_4], "\n" ) outputs: 1 gram: [ 'At' , 'eight' , "o'clock" , 'on ', ' Thursday ', ' morning ', ' Arthur ', ' did ', "n' t", 'feel' , 'very' , 'good' , '.' ] 2 grams: [( 'At' , 'eight' ), ( 'eight' , "o'clock" ), ( "o'clock" , 'on' ), ( 'on' , 'Thursday' ), ( 'Thursday' , 'morning' ), ( 'morning' , 'Arthur' ), ( 'Arthur' , 'did' ), ( 'did' , "n't" ), ( "n't" , 'feel' ), ( 'feel' , 'very' ), ( 'very' , 'good' ), ( 'good' , '.' )] 3 grams: [( 'At' , 'eight' , "o'clock" ), ('eight ', "o' clock", 'on' ), ( "o'clock" , 'on ', ' Thursday '), (' on ', ' Thursday ', ' morning '), (' Thursday ', ' morning ', ' Arthur '), (' morning ', ' Arthur ', ' did '), (' Arthur ', ' did ', "n' t"), ( 'did' , "n't" , 'feel '), ("n' t", 'feel' , 'very' ), ( 'feel' , 'very' , 'good' ), ( 'very' , 'good' , '.' )] 4 grams: [( 'At' , 'eight' , "o'clock" , 'on '), (' eight ', "o' clock", 'on' , 'Thursday' ), ( "o'clock" , 'on ', ' Thursday ', ' morning '), (' on ', ' Thursday ', ' morning ', ' Arthur '), (' Thursday ', ' morning ', ' Arthur ', ' did '), (' morning ', ' Arthur ', ' did ', "n' t"), ( 'Arthur' , 'did' , "n't" , 'feel '), (' did ', "n' t", 'feel' , 'very' ), ( "n't" , 'feel ', ' very ', ' good '), (' feel ', ' very ', ' good ', ' .')] |
Another method to output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | import nltk sentence = """At eight o'clock on Thursday morning Arthur didn't feel very good.""" # 1 gram tokens = nltk.word_tokenize(sentence) print ( "1 gram:\n" , tokens, "\n" ) # 2 grams n = 2 tokens_2 = nltk.ngrams(tokens, n) print ( "2 grams:\n" , [ ' ' .join( list (i)) for i in tokens_2], "\n" ) # 3 grams n = 3 tokens_3 = nltk.ngrams(tokens, n) print ( "3 grams:\n" , [ ' ' .join( list (i)) for i in tokens_3], "\n" ) # 4 grams n = 4 tokens_4 = nltk.ngrams(tokens, n) print ( "4 grams:\n" , [ ' ' .join( list (i)) for i in tokens_4], "\n" ) outputs: 1 gram: [ 'At' , 'eight' , "o'clock" , 'on ', ' Thursday ', ' morning ', ' Arthur ', ' did ', "n' t", 'feel' , 'very' , 'good' , '.' ] 2 grams: [ 'At eight' , "eight o'clock" , "o'clock on" , 'on Thursday' , 'Thursday morning' , 'morning Arthur' , 'Arthur did' , "did n't" , "n't feel" , 'feel very' , 'very good' , 'good .' ] 3 grams: [ "At eight o'clock" , "eight o'clock on" , "o'clock on Thursday" , 'on Thursday morning ', ' Thursday morning Arthur ', ' morning Arthur did ', "Arthur did n' t", "did n't feel" , "n't feel very" , 'feel very good' , 'very good .' ] 4 grams: [ "At eight o'clock on" , "eight o'clock on Thursday" , "o'clock on Thursday morning" , 'on Thursday morning Arthur ', ' Thursday morning Arthur did ', "morning Arthur did n' t", "Arthur did n't feel" , "did n't feel very" , "n't feel very good" , 'feel very good .'] |
获取一段文字中的大写字母开头的词组和单词
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | import nltk from nltk.corpus import stopwords a = "I am Alex Lee. I am from Denman Prospect and I love this place very much. We don't like apple. The big one is good." tokens = nltk.word_tokenize(a) caps = [] for i in range ( 1 , 4 ): for eles in nltk.ngrams(tokens, i): length = len ( list (eles)) for j in range (length): if eles[j][ 0 ].islower() or not eles[j][ 0 ].isalpha(): break elif j = = length - 1 : caps.append( ' ' .join( list (eles))) caps = list ( set (caps)) caps = [c for c in caps if c.lower() not in stopwords.words( 'english' )] print (caps) outputs: [ 'Denman' , 'Prospect' , 'Alex Lee' , 'Lee' , 'Alex' , 'Denman Prospect' ] |
分类:
Python Study
标签:
NLP
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· .NET10 - 预览版1新功能体验(一)
2013-03-28 【112】生活新体验