【448】NLP, NER, PoS
目录:
- 停用词 —— stopwords
- 介词 —— prepositions —— part of speech
- Named Entity Recognition (NER)
3.1 Stanford NER
3.2 spaCy
3.3 NLTK - 句子中单词提取(Word extraction)
ref: Removing stop words with NLTK in Python
ref: Remove Stop Words
1 2 3 4 5 6 7 | import nltk # nltk.download('stopwords') from nltk.corpus import stopwords print (stopwords.words( 'english' )) output: [ 'i' , 'me' , 'my' , 'myself' , 'we' , 'our' , 'ours' , 'ourselves' , 'you' , "you're" , "you've" , "you'll" , "you'd" , 'your' , 'yours' , 'yourself' , 'yourselves' , 'he' , 'him' , 'his' , 'himself' , 'she' , "she's" , 'her ', ' hers ', ' herself ', ' it ', "it' s", 'its' , 'itself' , 'they' , 'them' , 'their' , 'theirs' , 'themselves' , 'what' , 'which' , 'who' , 'whom' , 'this' , 'that' , "that'll" , 'these ', ' those ', ' am ', ' is ', ' are ', ' was ', ' were ', ' be ', ' been ', ' being ', ' have ', ' has ', ' had ', ' having ', ' do ', ' does ', ' did ', ' doing ', ' a ', ' an ', ' the ', ' and ', ' but ', ' if ', ' or ', ' because ', ' as ', ' until ', ' while ', ' of ', ' at ', ' by ', ' for ', ' with ', ' about ', ' against ', ' between ', ' into ', ' through ', ' during ', ' before ', ' after ', ' above ', ' below ', ' to ', ' from ', ' up ', ' down ', ' in ', ' out ', ' on ', ' off ', ' over ', ' under ', ' again ', ' further ', ' then ', ' once ', ' here ', ' there ', ' when ', ' where ', ' why ', ' how ', ' all ', ' any ', ' both ', ' each ', ' few ', ' more ', ' most ', ' other ', ' some ', ' such ', ' no ', ' nor ', ' not ', ' only ', ' own ', ' same ', ' so ', ' than ', ' too ', ' very ', ' s ', ' t ', ' can ', ' will ', ' just ', ' don ', "don' t", 'should' , "should've" , 'now ', ' d ', ' ll ', ' m ', ' o ', ' re ', ' ve ', ' y ', ' ain ', ' aren ', "aren' t", 'couldn' , "couldn't" , 'didn ', "didn' t", 'doesn' , "doesn't" , 'hadn ', "hadn' t", 'hasn' , "hasn't" , 'haven ', "haven' t", 'isn' , "isn't" , 'ma ', ' mightn ', "mightn' t", 'mustn' , "mustn't" , 'needn ', "needn' t", 'shan' , "shan't" , 'shouldn ', "shouldn' t", 'wasn' , "wasn't" , 'weren ', "weren' t", 'won' , "won't" , 'wouldn ', "wouldn' t"] |
2. 介词(prepositions, part of speech)
ref: How do I remove verbs, prepositions, conjunctions etc from my text? [closed]
ref: Alphabetical list of part-of-speech tags used in the Penn Treebank Project:
1 2 3 4 5 6 7 8 9 10 11 | >>> import nltk >>> sentence = """At eight o'clock on Thursday morning ... Arthur didn't feel very good.""" >>> tokens = nltk.word_tokenize(sentence) >>> tokens [ 'At' , 'eight' , "o'clock" , 'on ', ' Thursday ', ' morning', 'Arthur' , 'did' , "n't" , 'feel ', ' very ', ' good ', ' .'] >>> tagged = nltk.pos_tag(tokens) >>> tagged[ 0 : 6 ] [( 'At' , 'IN' ), ( 'eight' , 'CD' ), ( "o'clock" , 'JJ '), (' on ', ' IN'), ( 'Thursday' , 'NNP' ), ( 'morning' , 'NN' )] |
3. Named Entity Recognition (NER)
ref: Introduction to Named Entity Recognition
ref: Named Entity Recognition with NLTK and SpaCy
- Standford NER
- spaCy
- NLTK
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | article = ''' Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2 week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the European Union over Brexit, British Prime Minister Theresa May said on Monday.''' import nltk from nltk.tag import StanfordNERTagger print ( 'NTLK Version: %s' % nltk.__version__) stanford_ner_tagger = StanfordNERTagger( r "D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\classifiers\english.muc.7class.distsim.crf.ser.gz" , r "D:\Twitter Data\Data\NER\stanford-ner-2018-10-16\stanford-ner-3.9.2.jar" ) results = stanford_ner_tagger.tag(article.split()) print ( 'Original Sentence: %s' % (article)) for result in results: tag_value = result[ 0 ] tag_type = result[ 1 ] if tag_type ! = 'O' : print ( 'Type: %s, Value: %s' % (tag_type, tag_value)) output: NTLK Version: 3.4 Original Sentence: Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a sharp drop in oil prices and political risks in Europe pushed the dollar to 16 - month highs as investors dumped riskier assets. MSCI’s broadest index of Asia - Pacific shares outside Japan dropped 1.7 percent to a 1 - 1 / 2 week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $ 1.286 after three straight sessions of losses took it to the lowest since Nov. 1 as there were still considerable unresolved issues with the European Union over Brexit, British Prime Minister Theresa May said on Monday. Type : DATE, Value: Tuesday Type : LOCATION, Value: Europe Type : ORGANIZATION, Value: Asia - Pacific Type : LOCATION, Value: Japan Type : PERCENT, Value: 1.7 Type : PERCENT, Value: percent Type : ORGANIZATION, Value: Nikkei Type : PERCENT, Value: 3.1 Type : PERCENT, Value: percent Type : LOCATION, Value: European Type : LOCATION, Value: Union Type : PERSON, Value: Theresa Type : PERSON, Value: May |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | import spacy from spacy import displacy from collections import Counter import en_core_web_sm nlp = en_core_web_sm.load() doc = nlp(article) for X in doc.ents: print ( 'Value: %s, Type: %s' % (X.text, X.label_)) output: Value: Asian, Type : NORP Value: Tuesday, Type : DATE Value: Europe, Type : LOC Value: MSCI’s, Type : ORG Value: Asia - Pacific, Type : LOC Value: Japan, Type : GPE Value: 1.7 percent, Type : PERCENT Value: 1 - 1 / 2 , Type : CARDINAL Value: Australian, Type : NORP Value: 1.6 percent, Type : PERCENT Value: Japan, Type : GPE Value: 3.1 percent, Type : PERCENT Value: Apple, Type : ORG Value: 1.286 , Type : MONEY Value: three, Type : CARDINAL Value: Nov. 1 , Type : NORP Value: the European Union, Type : ORG Value: Brexit, Type : GPE Value: British, Type : NORP Value: Theresa May, Type : PERSON Value: Monday, Type : DATE |
标签含义:https://spacy.io/api/annotation#pos-tagging
Type | Description |
---|---|
PERSON |
People, including fictional. |
NORP |
Nationalities or religious or political groups. |
FAC |
Buildings, airports, highways, bridges, etc. |
ORG |
Companies, agencies, institutions, etc. |
GPE |
Countries, cities, states. |
LOC |
Non-GPE locations, mountain ranges, bodies of water. |
PRODUCT |
Objects, vehicles, foods, etc. (Not services.) |
EVENT |
Named hurricanes, battles, wars, sports events, etc. |
WORK_OF_ART |
Titles of books, songs, etc. |
LAW |
Named documents made into laws. |
LANGUAGE |
Any named language. |
DATE |
Absolute or relative dates or periods. |
TIME |
Times smaller than a day. |
PERCENT |
Percentage, including ”%“. |
MONEY |
Monetary values, including unit. |
QUANTITY |
Measurements, as of weight or distance. |
ORDINAL |
“first”, “second”, etc. |
CARDINAL |
Numerals that do not fall under another type. |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import nltk from nltk import word_tokenize, pos_tag, ne_chunk nltk.download( 'words' ) nltk.download( 'averaged_perceptron_tagger' ) nltk.download( 'punkt' ) nltk.download( 'maxent_ne_chunker' ) def fn_preprocess(art): art = nltk.word_tokenize(art) art = nltk.pos_tag(art) return art art_processed = fn_preprocess(article) print (art_processed) output: [( 'Asian' , 'JJ' ), ( 'shares' , 'NNS' ), ( 'skidded' , 'VBN' ), ( 'on' , 'IN' ), ( 'Tuesday' , 'NNP' ), ( 'after' , 'IN' ), ( 'a' , 'DT' ), ( 'rout' , 'NN' ), ( 'in' , 'IN' ), ( 'tech' , 'JJ' ), ( 'stocks' , 'NNS' ), ( 'put' , 'VBD' ), ( 'Wall' , 'NNP' ), ( 'Street' , 'NNP' ), ( 'to' , 'TO' ), ( 'the' , 'DT' ), ( 'sword' , 'NN' ), ( ',' , ',' ), ( 'while' , 'IN' ), ( 'a' , 'DT' ), ( 'sharp' , 'JJ' ), ( 'drop' , 'NN' ), ( 'in' , 'IN' ), ( 'oil' , 'NN' ), ( 'prices' , 'NNS' ), ( 'and' , 'CC' ), ( 'political' , 'JJ' ), ( 'risks' , 'NNS' ), ( 'in' , 'IN' ), ( 'Europe' , 'NNP' ), ( 'pushed' , 'VBD' ), ( 'the' , 'DT' ), ( 'dollar' , 'NN' ), ( 'to' , 'TO' ), ( '16-month' , 'JJ' ), ( 'highs' , 'NNS' ), ( 'as' , 'IN' ), ( 'investors' , 'NNS' ), ( 'dumped' , 'VBD' ), ( 'riskier' , 'JJR' ), ( 'assets' , 'NNS' ), ( '.' , '.' ), ( 'MSCI' , 'NNP' ), ( '’' , 'NNP' ), ( 's' , 'VBD' ), ( 'broadest' , 'JJS' ), ( 'index' , 'NN' ), ( 'of' , 'IN' ), ( 'Asia-Pacific' , 'NNP' ), ( 'shares' , 'NNS' ), ( 'outside' , 'IN' ), ( 'Japan' , 'NNP' ), ( 'dropped' , 'VBD' ), ( '1.7' , 'CD' ), ( 'percent' , 'NN' ), ( 'to' , 'TO' ), ( 'a' , 'DT' ), ( '1-1/2' , 'JJ' ), ( 'week' , 'NN' ), ( 'trough' , 'NN' ), ( ',' , ',' ), ( 'with' , 'IN' ), ( 'Australian' , 'JJ' ), ( 'shares' , 'NNS' ), ( 'sinking' , 'VBG' ), ( '1.6' , 'CD' ), ( 'percent' , 'NN' ), ( '.' , '.' ), ( 'Japan' , 'NNP' ), ( '’' , 'NNP' ), ( 's' , 'VBD' ), ( 'Nikkei' , 'NNP' ), ( 'dived' , 'VBD' ), ( '3.1' , 'CD' ), ( 'percent' , 'NN' ), ( 'led' , 'VBN' ), ( 'by' , 'IN' ), ( 'losses' , 'NNS' ), ( 'in' , 'IN' ), ( 'electric' , 'JJ' ), ( 'machinery' , 'NN' ), ( 'makers' , 'NNS' ), ( 'and' , 'CC' ), ( 'suppliers' , 'NNS' ), ( 'of' , 'IN' ), ( 'Apple' , 'NNP' ), ( '’' , 'NNP' ), ( 's' , 'VBD' ), ( 'iphone' , 'NN' ), ( 'parts' , 'NNS' ), ( '.' , '.' ), ( 'Sterling' , 'NN' ), ( 'fell' , 'VBD' ), ( 'to' , 'TO' ), ( '$' , '$' ), ( '1.286' , 'CD' ), ( 'after' , 'IN' ), ( 'three' , 'CD' ), ( 'straight' , 'JJ' ), ( 'sessions' , 'NNS' ), ( 'of' , 'IN' ), ( 'losses' , 'NNS' ), ( 'took' , 'VBD' ), ( 'it' , 'PRP' ), ( 'to' , 'TO' ), ( 'the' , 'DT' ), ( 'lowest' , 'JJS' ), ( 'since' , 'IN' ), ( 'Nov.1' , 'NNP' ), ( 'as' , 'IN' ), ( 'there' , 'EX' ), ( 'were' , 'VBD' ), ( 'still' , 'RB' ), ( 'considerable' , 'JJ' ), ( 'unresolved' , 'JJ' ), ( 'issues' , 'NNS' ), ( 'with' , 'IN' ), ( 'the' , 'DT' ), ( 'European' , 'NNP' ), ( 'Union' , 'NNP' ), ( 'over' , 'IN' ), ( 'Brexit' , 'NNP' ), ( ',' , ',' ), ( 'British' , 'NNP' ), ( 'Prime' , 'NNP' ), ( 'Minister' , 'NNP' ), ( 'Theresa' , 'NNP' ), ( 'May' , 'NNP' ), ( 'said' , 'VBD' ), ( 'on' , 'IN' ), ( 'Monday' , 'NNP' ), ( '.' , '.' )] |
ref: An introduction to Bag of Words and how to code it in Python for NLP
1 2 3 4 5 6 7 8 9 10 11 12 | import re def word_extraction(sentence): ignore = [ 'a' , "the" , "is" ] words = re.sub( "[^\w]" , " " , sentence).split() cleaned_text = [w.lower() for w in words if w not in ignore] return cleaned_text a = "alex is. good guy." print (word_extraction(a)) output: [ 'alex' , 'good' , 'guy' ] |
分类:
AI Related / NLP
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!
· .NET10 - 预览版1新功能体验(一)