关于日语分词Japanese segmenter

1.主要JapaneseTokenizer

https://pypi.org/project/JapaneseTokenizer/

安装: pip install JapaneseTokenizer

Supported Tokenizers

1.1 Mecab

安装请参考:https://www.dazhuanlan.com/2020/02/13/5e45085eac4da/

安裝 MeCab

  1. 首先先下載Mecab Source(mecab-0.996.tar.gz)
  2. 再來開啟 Terminal ,先創建一個資料夾

  

$ sudo mkdir /usr/local/mecab

  3. 解壓縮、設置、編譯、安裝

$ cd $HOME/Downloads
$ tar xvfz mecab-0.996.tar.gz
$ cd mecab-0.996
$ ./configure --enable-utf8-only --prefix=/usr/local/mecab
$ make
$ sudo make install

安裝 IPA 辭典

IPA 辭典, 基於 IPA 語料庫, 使用 CRF 進行參數估計的詞典(必安裝)

  1. 首先先下載IPA Source(mecab-ipadic-2.7.0-20070801.tar.gz)
  2. 解壓縮、設置、編譯、安裝
$ cd $HOME/Downloads
$ tar xvfz mecab-ipadic-2.7.0-20070801.tar.gz
$ cd mecab-ipadic-2.7.0-20070801
$ ./configure --prefix=/usr/local/mecab --with-mecab-config=/usr/local/mecab/bin/mecab-config --with-charset=utf8
$ make
$ sudo make install

 

 

示例

export PATH=/usr/local/mecab/bin:$PATH

import JapaneseTokenizer
input_sentence = '10日放送の「中居正広のミになる図書館」(テレビ朝日系)で、SMAPの中居正広が、篠原信一の過去の勘違いを明かす一幕があった。'
# ipadic is well-maintained dictionary #
mecab_wrapper = JapaneseTokenizer.MecabWrapper(dictType='ipadic')
print(mecab_wrapper.tokenize(input_sentence).convert_list_object())

# neologd is automatically-generated dictionary from huge web-corpus #
mecab_neologd_wrapper = JapaneseTokenizer.MecabWrapper(dictType='neologd')
print(mecab_neologd_wrapper.tokenize(input_sentence).convert_list_object())

 

posted @ 2021-02-21 10:19  lingwang3  阅读(264)  评论(0编辑  收藏  举报