ahocorasick从安装到使用

简介:

pyahocorasick是个python模块,由两种数据结构实现:trie和Aho-Corasick自动机。

Trie是一个字符串索引的词典,检索相关项时时间和字符串长度成正比。

AC自动机能够在一次运行中找到给定集合所有字符串。AC自动机其实就是在Trie树上实现KMP,可以完成多模式串的匹配。
(推荐学习资料:http://blog.csdn.net/niushuai666/article/details/7002823http://www.cnblogs.com/kuangbin/p/3164106.html

作者

Wojciech Muła, wojciech_mula@poczta.onet.pl
官方地址:
https://pypi.python.org/pypi/pyahocorasick/

pip过程中遇到了报错不能下载,例如:

error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools"

因为并没有直接可用的whl文件,那么只能安装相应的工具包,直接默认安装重启就可以。

在这里给个百度网盘的链接:

链接:https://pan.baidu.com/s/13Fv3dPYQNq6u_ErO9nqCAw
提取码:nhll 

他的具体使用方法可以细看这两个例子:

import ahocorasick
A = ahocorasick.Automaton()

# 向trie树中添加单词
for index,word in enumerate("he her hers she".split()):
    A.add_word(word, (index, word))
# 用法分析add_word(word,[value]) => bool
# 根据Automaton构造函数的参数store设置,value这样考虑:
# 1. 如果store设置为STORE_LENGTH,不能传递value,默认保存len(word)
# 2. 如果store设置为STORE_INTS,value可选,但必须是int类型,默认是len(automaton)
# 3. 如果store设置为STORE_ANY,value必须写,可以是任意类型

# 测试单词是否在树中
if "he" in A:
    print True
else:
    print False
A.get("he")
# (0,'he')
A.get("cat","<not exists>")
# '<not exists>'
A.get("dog")
# KeyError

# 将trie树转化为Aho-Corasick自动机
A.make_automaton()

# 找到所有匹配字符串
for item in A.iter("_hershe_"):
    print item
#(2,(0,'he'))
#(3,(1,'her'))
#(4, (2, 'hers'))
#(6, (3, 'she'))
#(6, (0, 'he'))
 1 import ahocorasick
 2 A = ahocorasick.Automaton()
 3 
 4 # 添加单词
 5 for index,word in enumerate("cat catastropha rat rate bat".split()):
 6     A.add_word(word, (index, word))
 7 
 8 # prefix
 9 list(A.keys("cat"))
10 ## ["cat","catastropha"]
11 
12 list(A.keys("?at","?",ahocprasick.MATCH_EXACT_LENGTH))
13 ## ['bat','cat','rat']
14 
15 list(A.keys("?at?", "?", ahocorasick.MATCH_AT_MOST_PREFIX))
16 ## ["bat", "cat", "rat", "rate"]
17 
18 list(A.keys("?at?", "?", ahocorasick.MATCH_AT_LEAST_PREFIX))
19 ## ['rate']
20 ## keys用法分析
21 ## keys([prefix, [wildcard, [how]]]) => yield strings
22 ## If prefix (a string) is given, then only words sharing this prefix are yielded.
23 ## If wildcard (single character) is given, then prefix is treated as a simple pattern with selected wildcard. Optional parameter how controls which strings are matched:
24 ## MATCH_EXACT_LENGTH [default]:Only strings with the same length as a pattern’s length are yielded. In other words, literally match a pattern.
25 ## MATCH_AT_LEAST_PREFIX:Strings that have length greater or equal to a pattern’s length are yielded.
26 ## MATCH_AT_MOST_PREFIX:Strings that have length less or equal to a pattern’s length are yielded.

 

posted @ 2020-11-24 23:45  博二爷  阅读(6476)  评论(0编辑  收藏  举报