jieba
import jieba
Eng = open("/Users/war/Desktop/NLP/Experiment2/English.txt").read()
Ch = open("/Users/war/Desktop/NLP/Experiment2/Chinese.txt").read()
print(Eng)
Trump was born and raised in the New York City borough of Queens and received an economics degree from the Wharton School. He was appointed president of his family's real estate business in 1971, renamed it The Trump Organization, and expanded it from Queens and Brooklyn into Manhattan. The company built or renovated skyscrapers, hotels, casinos, and golf courses. Trump later started various side ventures, including licensing his name for real estate and consumer products. He managed the company until his 2017 inauguration. He co-authored several books, including The Art of the Deal. He owned the Miss Universe and Miss USA beauty pageants from 1996 to 2015, and he produced and hosted The Apprentice, a reality television show, from 2003 to 2015. Forbes estimates his net worth to be $3.1 billion.
print(Ch)
央视315晚会曝光湖北省知名的神丹牌、莲田牌“土鸡蛋”实为普通鸡蛋冒充,同时在商标上玩猫腻,分别注册“鲜土”、注册“好土”商标,让消费者误以为是“土鸡蛋”。3月15日晚间,新京报记者就此事致电湖北神丹健康食品有限公司方面,其工作人员表示不知情,需要了解清楚情况,截至发稿暂未取得最新回应。新京报记者还查询发现,湖北神丹健康食品有限公司为农业产业化国家重点龙头企业、高新技术企业,此前曾因涉嫌虚假宣传“中国最大的蛋品企业”而被罚6万元。
精确模式
seg_list = jieba.cut(Eng, cut_all=False)
print(" ".join(seg_list))
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.704 seconds.
Prefix dict has been built successfully.
Trump was born and raised in the New York City borough of Queens and received an economics degree from the Wharton School . He was appointed president of his family ' s real estate business in 1971 , renamed it The Trump Organization , and expanded it from Queens and Brooklyn into Manhattan . The company built or renovated skyscrapers , hotels , casinos , and golf courses . Trump later started various side ventures , including licensing his name for real estate and consumer products . He managed the company until his 2017 inauguration . He co - authored several books , including The Art of the Deal . He owned the Miss Universe and Miss USA beauty pageants from 1996 to 2015 , and he produced and hosted The Apprentice , a reality television show , from 2003 to 2015 . Forbes estimates his net worth to be $ 3.1 billion .
seg_list = jieba.cut(Ch, cut_all=False)
print(" ".join(seg_list))
央视 315 晚会 曝光 湖北省 知名 的 神丹 牌 、 莲田牌 “ 土 鸡蛋 ” 实为 普通 鸡蛋 冒充 , 同时 在 商标 上 玩 猫腻 , 分别 注册 “ 鲜土 ” 、 注册 “ 好土 ” 商标 , 让 消费者 误以为 是 “ 土 鸡蛋 ” 。 3 月 15 日 晚间 , 新 京报 记者 就 此事 致电 湖北 神丹 健康 食品 有限公司 方面 , 其 工作人员 表示 不知情 , 需要 了解 清楚 情况 , 截至 发稿 暂未 取得 最新 回应 。 新 京报 记者 还 查询 发现 , 湖北 神丹 健康 食品 有限公司 为 农业 产业化 国家 重点 龙头企业 、 高新技术 企业 , 此前 曾 因涉嫌 虚假 宣传 “ 中国 最大 的 蛋品 企业 ” 而 被 罚 6 万元 。
全模式
seg_list = jieba.cut(Eng, cut_all=True)
print(" ".join(seg_list))
Trump was born and raised in the New York City borough of Queens and received an economics degree from the Wharton School . He was appointed president of his family ' s real estate business in 1971 , renamed it The Trump Organization , and expanded it from Queens and Brooklyn into Manhattan . The company built or renovated skyscrapers , hotels , casinos , and golf courses . Trump later started various side ventures , including licensing his name for real estate and consumer products . He managed the company until his 2017 inauguration . He co - authored several books , including The Art of the Deal . He owned the Miss Universe and Miss USA beauty pageants from 1996 to 2015 , and he produced and hosted The Apprentice , a reality television show , from 2003 to 2015 . Forbes estimates his net worth to be $ 3 . 1 billion .
seg_list = jieba.cut(Ch, cut_all=True)
print(" ".join(seg_list))
央视 315 晚会 曝光 湖北 湖北省 知名 的 神丹 牌 、 莲 田 牌 “ 土鸡 鸡蛋 ” 实为 普通 鸡蛋 冒充 , 同时 在 商标 标上 玩 猫腻 , 分别 注册 “ 鲜 土 ”、 注册 “ 好 土 ” 商标 , 让 消费 消费者 误以为 以为 是 “ 土鸡 鸡蛋 ”。 3 月 15 日 晚间 , 新 京报 记者 就此 此事 致电 湖北 神丹 健康 食品 有限 有限公司 公司 方面 , 其 工作 工作人员 作人 人员 表示 不知 不知情 知情 , 需要 了解 清楚 情况 , 截至 发稿 暂 未取 取得 最新 回应 。 新 京报 记者 还 查询 发现 , 湖北 神丹 健康 食品 有限 有限公司 公司 为 农业 农业产业 产业 产业化 国家 重点 龙头 龙头企业 企业 、 高新 高新技术 技术 企业 , 此前 曾 因涉嫌 涉嫌 虚假 宣传 “ 中国 最大 的 蛋品 企业 ” 而 被 罚 6 万元 。
搜索引擎模式
seg_list = jieba.cut_for_search(Eng)
print(" ".join(seg_list))
Trump was born and raised in the New York City borough of Queens and received an economics degree from the Wharton School . He was appointed president of his family ' s real estate business in 1971 , renamed it The Trump Organization , and expanded it from Queens and Brooklyn into Manhattan . The company built or renovated skyscrapers , hotels , casinos , and golf courses . Trump later started various side ventures , including licensing his name for real estate and consumer products . He managed the company until his 2017 inauguration . He co - authored several books , including The Art of the Deal . He owned the Miss Universe and Miss USA beauty pageants from 1996 to 2015 , and he produced and hosted The Apprentice , a reality television show , from 2003 to 2015 . Forbes estimates his net worth to be $ 3.1 billion .
seg_list = jieba.cut_for_search(Ch)
print(" ".join(seg_list))
央视 315 晚会 曝光 湖北 湖北省 知名 的 神丹 牌 、 莲田牌 “ 土 鸡蛋 ” 实为 普通 鸡蛋 冒充 , 同时 在 商标 上 玩 猫腻 , 分别 注册 “ 鲜土 ” 、 注册 “ 好土 ” 商标 , 让 消费 消费者 以为 误以为 是 “ 土 鸡蛋 ” 。 3 月 15 日 晚间 , 新 京报 记者 就 此事 致电 湖北 神丹 健康 食品 有限 公司 有限公司 方面 , 其 工作 作人 人员 工作人员 表示 不知 知情 不知情 , 需要 了解 清楚 情况 , 截至 发稿 暂未 取得 最新 回应 。 新 京报 记者 还 查询 发现 , 湖北 神丹 健康 食品 有限 公司 有限公司 为 农业 产业 产业化 国家 重点 龙头 企业 龙头企业 、 高新 技术 高新技术 企业 , 此前 曾 涉嫌 因涉嫌 虚假 宣传 “ 中国 最大 的 蛋品 企业 ” 而 被 罚 6 万元 。
自定义词典
jieba.load_userdict("/Users/war/Desktop/NLP/Experiment2/userdict.txt")
SnowNLP
from snownlp import SnowNLP
s_ch = SnowNLP(Ch)
s_eng = SnowNLP(Eng)
print(s_eng.words)
['Trump', 'was', 'born', 'and', 'raised', 'in', 'the', 'New', 'York', 'City', 'borough', 'of', 'Queens', 'and', 'received', 'an', 'economics', 'degree', 'from', 'the', 'Wharton', 'School.', 'He', 'was', 'appointed', 'president', 'of', 'his', "family's", 'real', 'estate', 'business', 'in', '1971,', 'renamed', 'it', 'The', 'Trump', 'Organization,', 'and', 'expanded', 'it', 'from', 'Queens', 'and', 'Brooklyn', 'into', 'Manhattan.', 'The', 'company', 'built', 'or', 'renovated', 'skyscrapers,', 'hotels,', 'casinos,', 'and', 'golf', 'courses.', 'Trump', 'later', 'started', 'various', 'side', 'ventures,', 'including', 'licensing', 'his', 'name', 'for', 'real', 'estate', 'and', 'consumer', 'products.', 'He', 'managed', 'the', 'company', 'until', 'his', '2017', 'inauguration.', 'He', 'co-authored', 'several', 'books,', 'including', 'The', 'Art', 'of', 'the', 'Deal.', 'He', 'owned', 'the', 'Miss', 'Universe', 'and', 'Miss', 'USA', 'beauty', 'pageants', 'from', '1996', 'to', '2015,', 'and', 'he', 'produced', 'and', 'hosted', 'The', 'Apprentice,', 'a', 'reality', 'television', 'show,', 'from', '2003', 'to', '2015.', 'Forbes', 'estimates', 'his', 'net', 'worth', 'to', 'be', '$3.1', 'billion.']
print(s_ch.words)
['\ufeff', '央视', '315', '晚会', '曝光', '湖北省', '知名', '的', '神丹', '牌', '、', '莲', '田', '牌', '“', '土', '鸡蛋', '”', '实', '为', '普通', '鸡蛋', '冒充', ',', '同时', '在', '商标', '上', '玩猫', '腻', ',', '分别', '注册', '“', '鲜', '土', '”、', '注册', '“', '好', '土', '”', '商标', ',', '让', '消费者', '误', '以为', '是', '“', '土', '鸡蛋', '”。3', '月', '15', '日', '晚间', ',', '新京', '报', '记者', '就', '此事', '致电', '湖北', '神', '丹', '健康', '食品', '有限公司', '方面', ',', '其', '工作', '人员', '表示', '不', '知情', ',', '需要', '了解', '清楚', '情况', ',', '截至', '发稿', '暂', '未', '取得', '最新', '回应', '。', '新京', '报', '记者', '还', '查询', '发现', ',', '湖北', '神', '丹', '健康', '食品', '有限公司', '为', '农业', '产业化', '国家', '重点', '龙头', '企业', '、', '高新技术', '企业', ',', '此前', '曾', '因', '涉嫌', '虚假', '宣传', '“', '中国', '最', '大', '的', '蛋品', '企业', '”', '而', '被', '罚', '6', '万', '元', '。']
THULAC
import thulac
thu = thulac.thulac(seg_only=True) #默认模式
s_ch = thu.cut(Ch) #进行一句话分词
print(s_ch)
Model loaded succeed
[['\ufeff央', ''], ['视', ''], ['315', ''], ['晚会', ''], ['曝光', ''], ['湖北省', ''], ['知名', ''], ['的', ''], ['神丹牌', ''], ['、', ''], ['莲田牌', ''], ['“', ''], ['土鸡蛋', ''], ['”', ''], ['实', ''], ['为', ''], ['普通', ''], ['鸡蛋', ''], ['冒充', ''], [',', ''], ['同时', ''], ['在', ''], ['商标', ''], ['上', ''], ['玩', ''], ['猫腻', ''], [',', ''], ['分别', ''], ['注册', ''], ['“', ''], ['鲜土', ''], ['”', ''], ['、', ''], ['注册', ''], ['“', ''], ['好', ''], ['土', ''], ['”', ''], ['商标', ''], [',', ''], ['让', ''], ['消费者', ''], ['误', ''], ['以为', ''], ['是', ''], ['“', ''], ['土鸡蛋', ''], ['”', ''], ['。', ''], ['3月', ''], ['15日', ''], ['晚间', ''], [',', ''], ['新', ''], ['京报', ''], ['记者', ''], ['就', ''], ['此事', ''], ['致电', ''], ['湖北', ''], ['神丹', ''], ['健康', ''], ['食品', ''], ['有限公司', ''], ['方面', ''], [',', ''], ['其', ''], ['工作', ''], ['人员', ''], ['表示', ''], ['不', ''], ['知', ''], ['情', ''], [',', ''], ['需要', ''], ['了', ''], ['解', ''], ['清楚', ''], ['情况', ''], [',', ''], ['截至', ''], ['发稿', ''], ['暂', ''], ['未', ''], ['取得', ''], ['最新', ''], ['回应', ''], ['。', ''], ['新', ''], ['京报', ''], ['记者', ''], ['还', ''], ['查询', ''], ['发现', ''], [',', ''], ['湖北', ''], ['神丹', ''], ['健康', ''], ['食品', ''], ['有限公司', ''], ['为', ''], ['农业', ''], ['产业化', ''], ['国', ''], ['家', ''], ['重点', ''], ['龙头', ''], ['企业', ''], ['、', ''], ['高新技术', ''], ['企业', ''], [',', ''], ['此前', ''], ['曾', ''], ['因', ''], ['涉嫌', ''], ['虚假', ''], ['宣传', ''], ['“', ''], ['中国', ''], ['最', ''], ['大', ''], ['的', ''], ['蛋品', ''], ['企业', ''], ['”', ''], ['而', ''], ['被', ''], ['罚', ''], ['6万', ''], ['元', ''], ['。', '']]
s_eng = thu.cut(Eng) #进行一句话分词
print(s_eng)
[['Trump', ''], [' ', ''], ['was', ''], [' ', ''], ['born', ''], [' ', ''], ['and', ''], [' ', ''], ['raised', ''], [' ', ''], ['in', ''], [' ', ''], ['the', ''], [' ', ''], ['New', ''], [' ', ''], ['York', ''], [' ', ''], ['City', ''], [' ', ''], ['borough', ''], [' ', ''], ['of', ''], [' ', ''], ['Queens', ''], [' ', ''], ['and', ''], [' ', ''], ['received', ''], [' ', ''], ['an', ''], [' ', ''], ['economics', ''], [' ', ''], ['degree', ''], [' ', ''], ['from', ''], [' ', ''], ['the', ''], [' ', ''], ['Wharton', ''], [' ', ''], ['School', ''], ['.', ''], [' ', ''], ['He', ''], [' ', ''], ['was', ''], [' ', ''], ['appointed', ''], [' ', ''], ['president', ''], [' ', ''], ['of', ''], [' ', ''], ['his', ''], [' ', ''], ['family', ''], ["'", ''], ['s', ''], [' ', ''], ['real', ''], [' ', ''], ['estate', ''], [' ', ''], ['business', ''], [' ', ''], ['in', ''], [' ', ''], ['1971', ''], [',', ''], [' ', ''], ['renamed', ''], [' ', ''], ['it', ''], [' ', ''], ['The', ''], [' ', ''], ['Trump', ''], [' ', ''], ['Organization', ''], [',', ''], [' ', ''], ['and', ''], [' ', ''], ['expanded', ''], [' ', ''], ['it', ''], [' ', ''], ['from', ''], [' ', ''], ['Queens', ''], [' ', ''], ['and', ''], [' ', ''], ['Brooklyn', ''], [' ', ''], ['into', ''], [' ', ''], ['Manhatta', ''], ['n', ''], ['.', ''], [' ', ''], ['The', ''], [' ', ''], ['company', ''], [' ', ''], ['built', ''], [' ', ''], ['o', ''], ['r', ''], [' ', ''], ['renovated', ''], [' ', ''], ['skyscrapers', ''], [',', ''], [' ', ''], ['hotels', ''], [',', ''], [' ', ''], ['casinos', ''], [',', ''], [' ', ''], ['and', ''], [' ', ''], ['golf', ''], [' ', ''], ['courses', ''], ['.', ''], [' ', ''], ['Trump', ''], [' ', ''], ['later', ''], [' ', ''], ['started', ''], [' ', ''], ['various', ''], [' ', ''], ['side', ''], [' ', ''], ['ventures', ''], [',', ''], [' ', ''], ['including', ''], [' ', ''], ['licens', ''], ['ing', ''], [' ', ''], ['his', ''], [' ', ''], ['name', ''], [' ', ''], ['for', ''], [' ', ''], ['real', ''], [' ', ''], ['estate', ''], [' ', ''], ['and', ''], [' ', ''], ['cons', ''], ['umer', ''], [' ', ''], ['products', ''], ['.', ''], [' ', ''], ['He', ''], [' ', ''], ['managed', ''], [' ', ''], ['the', ''], [' ', ''], ['company', ''], [' ', ''], ['until', ''], [' ', ''], ['his', ''], [' ', ''], ['2017', ''], [' ', ''], ['inauguration', ''], ['.', ''], [' ', ''], ['He', ''], [' ', ''], ['c', ''], ['o', ''], ['-', ''], ['authored', ''], [' ', ''], ['several', ''], [' ', ''], ['books', ''], [',', ''], [' ', ''], ['including', ''], [' ', ''], ['The', ''], [' ', ''], ['Art', ''], [' ', ''], ['of', ''], [' ', ''], ['the', ''], [' ', ''], ['Deal', ''], ['.', ''], [' ', ''], ['He', ''], [' ', ''], ['owned', ''], [' ', ''], ['the', ''], [' ', ''], ['Miss', ''], [' ', ''], ['Universe', ''], [' ', ''], ['and', ''], [' ', ''], ['Miss', ''], [' ', ''], ['USA', ''], [' ', ''], ['beauty', ''], [' ', ''], ['pageants', ''], [' ', ''], ['from', ''], [' ', ''], ['1996', ''], [' ', ''], ['t', ''], ['o', ''], [' ', ''], ['2015', ''], [',', ''], [' ', ''], ['and', ''], [' ', ''], ['he', ''], [' ', ''], ['produced', ''], [' ', ''], ['and', ''], [' ', ''], ['hosted', ''], [' ', ''], ['The', ''], [' ', ''], ['Apprentice', ''], [',', ''], [' ', ''], ['a', ''], [' ', ''], ['reality', ''], [' ', ''], ['television', ''], [' ', ''], ['show', ''], [',', ''], [' ', ''], ['from', ''], [' ', ''], ['2003', ''], [' ', ''], ['t', ''], ['o', ''], [' ', ''], ['2015', ''], ['.', ''], [' ', ''], ['Forbes', ''], [' ', ''], ['estimates', ''], [' ', ''], ['his', ''], [' ', ''], ['net', ''], [' ', ''], ['worth', ''], [' ', ''], ['t', ''], ['o', ''], [' ', ''], ['be', ''], [' ', ''], ['$', ''], ['3', ''], ['.', ''], ['1', ''], [' ', ''], ['billion', ''], ['.', '']]
PyNLPIR
import pynlpir
pynlpir.open()
pynlpir.segment(Ch,pos_tagging = False)
['央',
'视',
'315',
'晚会',
'曝光',
'湖北省',
'知名',
'的',
'神',
'丹',
'牌',
'、',
'莲',
'田',
'牌',
'“',
'土',
'鸡蛋',
'”',
'实',
'为',
'普通',
'鸡蛋',
'冒充',
',',
'同时',
'在',
'商标',
'上',
'玩',
'猫腻',
',',
'分别',
'注册',
'“',
'鲜',
'土',
'”',
'、',
'注册',
'“',
'好',
'土',
'”',
'商标',
',',
'让',
'消费者',
'误',
'以为',
'是',
'“',
'土',
'鸡蛋',
'”',
'。',
'3月',
'15日',
'晚间',
',',
'新京报',
'记者',
'就',
'此事',
'致电',
'湖北',
'神',
'丹',
'健康',
'食品',
'有限公司',
'方面',
',',
'其',
'工作',
'人员',
'表示',
'不',
'知',
'情',
',',
'需要',
'了解',
'清楚',
'情况',
',',
'截至',
'发稿',
'暂',
'未',
'取得',
'最新',
'回应',
'。',
'新京报',
'记者',
'还',
'查询',
'发现',
',',
'湖北',
'神',
'丹',
'健康',
'食品',
'有限公司',
'为',
'农业',
'产业化',
'国家',
'重点',
'龙头',
'企业',
'、',
'高新技术',
'企业',
',',
'此前',
'曾',
'因',
'涉嫌',
'虚假',
'宣传',
'“',
'中国',
'最',
'大',
'的',
'蛋品',
'企业',
'”',
'而',
'被',
'罚',
'6万',
'元',
'。']
pynlpir.segment(Eng,pos_tagging = False)
['Trump',
'was',
'born',
'and',
'raised',
'in',
'the',
'New',
'York',
'City',
'borough',
'of',
'Queens',
'and',
'received',
'an',
'economics',
'degree',
'from',
'the',
'Wharton',
'School',
'.',
'He',
'was',
'appointed',
'president',
'of',
'his',
'family',
"'s",
'real',
'estate',
'business',
'in',
'1971',
',',
'renamed',
'it',
'The',
'Trump',
'Organization',
',',
'and',
'expanded',
'it',
'from',
'Queens',
'and',
'Brooklyn',
'into',
'Manhattan',
'.',
'The',
'company',
'built',
'or',
'renovated',
'skyscrapers',
',',
'hotels',
',',
'casinos',
',',
'and',
'golf',
'courses',
'.',
'Trump',
'later',
'started',
'various',
'side',
'ventures',
',',
'including',
'licensing',
'his',
'name',
'for',
'real',
'estate',
'and',
'consumer',
'products',
'.',
'He',
'managed',
'the',
'company',
'until',
'his',
'2017',
'inauguration',
'.',
'He',
'co',
'-',
'authored',
'several',
'books',
',',
'including',
'The',
'Art',
'of',
'the',
'Deal',
'.',
'He',
'owned',
'the',
'Miss',
'Universe',
'and',
'Miss',
'USA',
'beauty',
'pageants',
'from',
'1996',
'to',
'2015',
',',
'and',
'he',
'produced',
'and',
'hosted',
'The',
'Apprentice',
',',
'a',
'reality',
'television',
'show',
',',
'from',
'2003',
'to',
'2015',
'.',
'Forbes',
'estimates',
'his',
'net',
'worth',
'to',
'be',
'$',
'3.1',
'billion',
'.']
stanfordcorenlp
from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP("/Users/war/Desktop/NLP/Experiment2/stanford-corenlp-4.2.0")
nlp.word_tokenize(Eng)
['Trump',
'was',
'born',
'and',
'raised',
'in',
'the',
'New',
'York',
'City',
'borough',
'of',
'Queens',
'and',
'received',
'an',
'economics',
'degree',
'from',
'the',
'Wharton',
'School',
'.',
'He',
'was',
'appointed',
'president',
'of',
'his',
'family',
"'s",
'real',
'estate',
'business',
'in',
'1971',
',',
'renamed',
'it',
'The',
'Trump',
'Organization',
',',
'and',
'expanded',
'it',
'from',
'Queens',
'and',
'Brooklyn',
'into',
'Manhattan',
'.',
'The',
'company',
'built',
'or',
'renovated',
'skyscrapers',
',',
'hotels',
',',
'casinos',
',',
'and',
'golf',
'courses',
'.',
'Trump',
'later',
'started',
'various',
'side',
'ventures',
',',
'including',
'licensing',
'his',
'name',
'for',
'real',
'estate',
'and',
'consumer',
'products',
'.',
'He',
'managed',
'the',
'company',
'until',
'his',
'2017',
'inauguration',
'.',
'He',
'co-authored',
'several',
'books',
',',
'including',
'The',
'Art',
'of',
'the',
'Deal',
'.',
'He',
'owned',
'the',
'Miss',
'Universe',
'and',
'Miss',
'USA',
'beauty',
'pageants',
'from',
'1996',
'to',
'2015',
',',
'and',
'he',
'produced',
'and',
'hosted',
'The',
'Apprentice',
',',
'a',
'reality',
'television',
'show',
',',
'from',
'2003',
'to',
'2015',
'.',
'Forbes',
'estimates',
'his',
'net',
'worth',
'to',
'be',
'$',
'3.1',
'billion',
'.']
nlp.word_tokenize(Ch)
['央视315晚会曝光湖北省知名的神丹牌',
'、',
'莲田牌',
'“',
'土鸡蛋',
'”',
'实为普通鸡蛋冒充',
',',
'同时在商标上玩猫腻',
',',
'分别注册',
'“',
'鲜土',
'”',
'、',
'注册',
'“',
'好土',
'”',
'商标',
',',
'让消费者误以为是',
'“',
'土鸡蛋',
'”',
'。',
'3月15日晚间',
',',
'新京报记者就此事致电湖北神丹健康食品有限公司方面',
',',
'其工作人员表示不知情',
',',
'需要了解清楚情况',
',',
'截至发稿暂未取得最新回应',
'。',
'新京报记者还查询发现',
',',
'湖北神丹健康食品有限公司为农业产业化国家重点龙头企业',
'、',
'高新技术企业',
',',
'此前曾因涉嫌虚假宣传',
'“',
'中国最大的蛋品企业',
'”',
'而被罚6万元',
'。']
NLTK
import nltk
tokens_eng = nltk.word_tokenize(Eng)
print(tokens_eng)
['Trump', 'was', 'born', 'and', 'raised', 'in', 'the', 'New', 'York', 'City', 'borough', 'of', 'Queens', 'and', 'received', 'an', 'economics', 'degree', 'from', 'the', 'Wharton', 'School', '.', 'He', 'was', 'appointed', 'president', 'of', 'his', 'family', "'s", 'real', 'estate', 'business', 'in', '1971', ',', 'renamed', 'it', 'The', 'Trump', 'Organization', ',', 'and', 'expanded', 'it', 'from', 'Queens', 'and', 'Brooklyn', 'into', 'Manhattan', '.', 'The', 'company', 'built', 'or', 'renovated', 'skyscrapers', ',', 'hotels', ',', 'casinos', ',', 'and', 'golf', 'courses', '.', 'Trump', 'later', 'started', 'various', 'side', 'ventures', ',', 'including', 'licensing', 'his', 'name', 'for', 'real', 'estate', 'and', 'consumer', 'products', '.', 'He', 'managed', 'the', 'company', 'until', 'his', '2017', 'inauguration', '.', 'He', 'co-authored', 'several', 'books', ',', 'including', 'The', 'Art', 'of', 'the', 'Deal', '.', 'He', 'owned', 'the', 'Miss', 'Universe', 'and', 'Miss', 'USA', 'beauty', 'pageants', 'from', '1996', 'to', '2015', ',', 'and', 'he', 'produced', 'and', 'hosted', 'The', 'Apprentice', ',', 'a', 'reality', 'television', 'show', ',', 'from', '2003', 'to', '2015', '.', 'Forbes', 'estimates', 'his', 'net', 'worth', 'to', 'be', '$', '3.1', 'billion', '.']
tokens_ch = nltk.word_tokenize(Ch)
print(tokens_ch)
['\ufeff央视315晚会曝光湖北省知名的神丹牌、莲田牌', '“', '土鸡蛋', '”', '实为普通鸡蛋冒充,同时在商标上玩猫腻,分别注册', '“', '鲜土', '”', '、注册', '“', '好土', '”', '商标,让消费者误以为是', '“', '土鸡蛋', '”', '。3月15日晚间,新京报记者就此事致电湖北神丹健康食品有限公司方面,其工作人员表示不知情,需要了解清楚情况,截至发稿暂未取得最新回应。新京报记者还查询发现,湖北神丹健康食品有限公司为农业产业化国家重点龙头企业、高新技术企业,此前曾因涉嫌虚假宣传', '“', '中国最大的蛋品企业', '”', '而被罚6万元。']
SpaCy
import spacy
nlp = spacy.load('en_core_web_sm')
print(nlp(Eng))
Trump was born and raised in the New York City borough of Queens and received an economics degree from the Wharton School. He was appointed president of his family's real estate business in 1971, renamed it The Trump Organization, and expanded it from Queens and Brooklyn into Manhattan. The company built or renovated skyscrapers, hotels, casinos, and golf courses. Trump later started various side ventures, including licensing his name for real estate and consumer products. He managed the company until his 2017 inauguration. He co-authored several books, including The Art of the Deal. He owned the Miss Universe and Miss USA beauty pageants from 1996 to 2015, and he produced and hosted The Apprentice, a reality television show, from 2003 to 2015. Forbes estimates his net worth to be $3.1 billion.
print(nlp(Ch))
央视315晚会曝光湖北省知名的神丹牌、莲田牌“土鸡蛋”实为普通鸡蛋冒充,同时在商标上玩猫腻,分别注册“鲜土”、注册“好土”商标,让消费者误以为是“土鸡蛋”。3月15日晚间,新京报记者就此事致电湖北神丹健康食品有限公司方面,其工作人员表示不知情,需要了解清楚情况,截至发稿暂未取得最新回应。新京报记者还查询发现,湖北神丹健康食品有限公司为农业产业化国家重点龙头企业、高新技术企业,此前曾因涉嫌虚假宣传“中国最大的蛋品企业”而被罚6万元。