alex_bn_lee

导航

< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5

统计

【389】Implement N-grams using NLTK

Ref: Natural Language Toolkit

Ref: n-grams in python, four, five, six grams?

Ref: "Elegant n-gram generation in Python"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import nltk
 
sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good."""
 
# 1 gram
 
tokens = nltk.word_tokenize(sentence)
 
print("1 gram:\n", tokens, "\n")
 
# 2 grams
 
n = 2
 
tokens_2 = nltk.ngrams(tokens, n)
 
print("2 grams:\n", [i for i in tokens_2], "\n")
 
# 3 grams
 
n = 3
 
tokens_3 = nltk.ngrams(tokens, n)
 
print("3 grams:\n", [i for i in tokens_3], "\n")
 
# 4 grams
 
n = 4
 
tokens_4 = nltk.ngrams(tokens, n)
 
print("4 grams:\n", [i for i in tokens_4], "\n")
 
outputs:
1 gram:
 ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
 
2 grams:
 [('At', 'eight'), ('eight', "o'clock"), ("o'clock", 'on'), ('on', 'Thursday'), ('Thursday', 'morning'), ('morning', 'Arthur'), ('Arthur', 'did'), ('did', "n't"), ("n't", 'feel'), ('feel', 'very'), ('very', 'good'), ('good', '.')]
 
3 grams:
 [('At', 'eight', "o'clock"), ('eight', "o'clock", 'on'), ("o'clock", 'on', 'Thursday'), ('on', 'Thursday', 'morning'), ('Thursday', 'morning', 'Arthur'), ('morning', 'Arthur', 'did'), ('Arthur', 'did', "n't"), ('did', "n't", 'feel'), ("n't", 'feel', 'very'), ('feel', 'very', 'good'), ('very', 'good', '.')]
 
4 grams:
 [('At', 'eight', "o'clock", 'on'), ('eight', "o'clock", 'on', 'Thursday'), ("o'clock", 'on', 'Thursday', 'morning'), ('on', 'Thursday', 'morning', 'Arthur'), ('Thursday', 'morning', 'Arthur', 'did'), ('morning', 'Arthur', 'did', "n't"), ('Arthur', 'did', "n't", 'feel'), ('did', "n't", 'feel', 'very'), ("n't", 'feel', 'very', 'good'), ('feel', 'very', 'good', '.')]

 Another method to output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import nltk
 
sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good."""
 
# 1 gram
 
tokens = nltk.word_tokenize(sentence)
 
print("1 gram:\n", tokens, "\n")
 
# 2 grams
 
n = 2
 
tokens_2 = nltk.ngrams(tokens, n)
 
print("2 grams:\n", [' '.join(list(i)) for i in tokens_2], "\n")
 
# 3 grams
 
n = 3
 
tokens_3 = nltk.ngrams(tokens, n)
 
print("3 grams:\n", [' '.join(list(i)) for i in tokens_3], "\n")
 
# 4 grams
 
n = 4
 
tokens_4 = nltk.ngrams(tokens, n)
 
print("4 grams:\n", [' '.join(list(i)) for i in tokens_4], "\n")
 
outputs:
1 gram:
 ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
 
2 grams:
 ['At eight', "eight o'clock", "o'clock on", 'on Thursday', 'Thursday morning', 'morning Arthur', 'Arthur did', "did n't", "n't feel", 'feel very', 'very good', 'good .']
 
3 grams:
 ["At eight o'clock", "eight o'clock on", "o'clock on Thursday", 'on Thursday morning', 'Thursday morning Arthur', 'morning Arthur did', "Arthur did n't", "did n't feel", "n't feel very", 'feel very good', 'very good .']
 
4 grams:
 ["At eight o'clock on", "eight o'clock on Thursday", "o'clock on Thursday morning", 'on Thursday morning Arthur', 'Thursday morning Arthur did', "morning Arthur did n't", "Arthur did n't feel", "did n't feel very", "n't feel very good", 'feel very good .']

 

获取一段文字中的大写字母开头的词组和单词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import nltk
from nltk.corpus import stopwords
a = "I am Alex Lee. I am from Denman Prospect and I love this place very much. We don't like apple. The big one is good."
tokens = nltk.word_tokenize(a)
caps = []
for i in range(1, 4):
    for eles in nltk.ngrams(tokens, i):
        length = len(list(eles))
        for j in range(length):
            if eles[j][0].islower() or not eles[j][0].isalpha():
                break
            elif j == length - 1:
                caps.append(' '.join(list(eles)))
 
caps = list(set(caps))
caps = [c for c in caps if c.lower() not in stopwords.words('english')]
print(caps)
 
outputs:
['Denman', 'Prospect', 'Alex Lee', 'Lee', 'Alex', 'Denman Prospect']

 

posted on   McDelfino  阅读(186)  评论(0编辑  收藏  举报

编辑推荐:
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
阅读排行:
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· .NET10 - 预览版1新功能体验(一)
历史上的今天:
2013-03-28 【112】生活新体验
点击右上角即可分享
微信分享提示