关于日语分词Japanese segmenter
1.主要JapaneseTokenizer
https://pypi.org/project/JapaneseTokenizer/
安装: pip install JapaneseTokenizer
Supported Tokenizers
1.1 Mecab
安装请参考:https://www.dazhuanlan.com/2020/02/13/5e45085eac4da/
安裝 MeCab
- 首先先下載Mecab Source(mecab-0.996.tar.gz)
-
再來開啟 Terminal ,先創建一個資料夾
$ sudo mkdir /usr/local/mecab
3. 解壓縮、設置、編譯、安裝
$ cd $HOME/Downloads $ tar xvfz mecab-0.996.tar.gz $ cd mecab-0.996 $ ./configure --enable-utf8-only --prefix=/usr/local/mecab $ make $ sudo make install
安裝 IPA 辭典
IPA 辭典, 基於 IPA 語料庫, 使用 CRF 進行參數估計的詞典(必安裝)
- 首先先下載IPA Source(mecab-ipadic-2.7.0-20070801.tar.gz)
- 解壓縮、設置、編譯、安裝
$ cd $HOME/Downloads $ tar xvfz mecab-ipadic-2.7.0-20070801.tar.gz $ cd mecab-ipadic-2.7.0-20070801 $ ./configure --prefix=/usr/local/mecab --with-mecab-config=/usr/local/mecab/bin/mecab-config --with-charset=utf8 $ make $ sudo make install
示例
export PATH=/usr/local/mecab/bin:$PATH
import JapaneseTokenizer input_sentence = '10日放送の「中居正広のミになる図書館」(テレビ朝日系)で、SMAPの中居正広が、篠原信一の過去の勘違いを明かす一幕があった。' # ipadic is well-maintained dictionary # mecab_wrapper = JapaneseTokenizer.MecabWrapper(dictType='ipadic') print(mecab_wrapper.tokenize(input_sentence).convert_list_object()) # neologd is automatically-generated dictionary from huge web-corpus # mecab_neologd_wrapper = JapaneseTokenizer.MecabWrapper(dictType='neologd') print(mecab_neologd_wrapper.tokenize(input_sentence).convert_list_object())