【天天数据结构和算法】PHP中trie数据结构的使用场景和代码实例

一、trie介绍

Trie树，又称字典树，单词查找树或者前缀树，是一种用于快速检索的多叉树结构，如英文字母的字典树是一个26叉树，数字的字典树是一个10叉树。

Trie一词来自retrieve，发音为/tri:/ “tree”，也有人读为/traɪ/ “try”。

Trie树可以利用字符串的公共前缀来节约存储空间。如下图所示，该trie树用10个节点保存了6个字符串tea，ten，to，in，inn，int。

在该trie树中，字符串in，inn和int的公共前缀是“in”，因此可以只存储一份“in”以节省空间。当然，如果系统中存在大量字符串且这些字符串基本没有公共前缀，则相应的trie树将非常消耗内存，这也是trie树的一个缺点。

Trie树的基本性质可以归纳为：

（1）根节点不包含字符，除根节点意外每个节点只包含一个字符。

（2）从根节点到某一个节点，路径上经过的字符连接起来，为该节点对应的字符串。

（3）每个节点的所有子节点包含的字符串不相同。

二、trie的优点

1.查找或匹配字符串时时间复杂度只和树的深度有关，和节点数量无关。

2.所以，在查找海量数据，或匹配数据，或过滤数据中有很好的实现。

三、php代码实现

Trie.php

<?php

/**
 * Created by PhpStorm.
 * User: jysdhr
 * Date: 2017/7/4
 * Time: 9:57
 * Description:PHP实现trie字典数据结构
 */
include "TrieNode.php";

class Trie
{
    private $root;

    public function __construct()
    {

        $this->root = new TrieNode();
    }

    public function foreach_trie()
    {
        echo "<pre>";
        print_r($this->root);
    }

    public function insert($str)
    {
        $this->__insert($this->root, $str);
    }

    public function search(string $str): bool
    {
        return $this->__search($this->root, $str);
    }

    private function __insert(&$node, $str)
    {
        if (strlen($str) == 0)
            return;
        //第一个字符，插入哪一个分叉
        $k = ord(substr($str, 0, 1)) - ord('a');
        if (!isset($node->childs[$k]) || $node->childs[$k] == NULL) {
            //如果分叉不存在则重新开辟分叉
            $node->childs[$k] = new TrieNode();
            //记录字符
            $node->childs[$k]->nodeChar = $k;
            $node->childs[$k]->is_end = strlen($str) == 1 ? true : false;
        }
        $nextWord = substr($str, 1);
        $this->__insert($node->childs[$k], $nextWord);
    }

    /**
     * @Description:查找str是否存在树中
     * @User:jysdhr
     */
    private function __search($node, $str)
    {
        if (strlen($str) == 0)
            return false;
        //首先对str进行拆分
        $k = ord(substr($str, 0, 1)) - ord('a');
        if (isset($node->childs[$k])) {
            $nextWord = substr($str, 1);
            if (strlen($str) == 1) {
                //匹配最后一个字符
                if ($node->childs[$k]->is_end)
                    return true;
            }
            return $this->__search($node->childs[$k], $nextWord);
        }
        return false;
    }

}

TrieNode.php

<?php

/**
 * Created by PhpStorm.
 * User: jysdhr
 * Date: 2017/7/4
 * Time: 10:01
 * Description:
 */
class TrieNode
{
    public  $nodeChar,$childs,$is_end;
    public function __construct()
    {
        $this->childs = array();
    }
}

testTrie.php

<?php
// 测试文件demo.php
include "Trie.php";
$str = file_get_contents('bbe.txt');//将整个文件内容读入到一个字符串中
$badword = explode(" ", $str);//转换成数组
$trie = new Trie();
foreach ($badword as $word)
    $trie->insert($word);

// array_combine() 函数通过合并两个数组来创建一个新数组，其中的一个数组是键名，另一个数组的值为键值。如果其中一个数组为空，或者两个数组的元素个数不同，则该函数返回 false。
// array_fill() 函数用给定的值填充数组，返回的数组有 number 个元素，值为 value。返回的数组使用数字索引，从 start 位置开始并递增。如果 number 为 0 或小于 0，就会出错。
//$badword1 = array_combine($badword,array_fill(0,count($badword),'*'));

$test_str = 'knowledgeasdad';
$start_time = microtime(true);

var_dump(in_array($test_str,$badword));
$end_time = microtime(true);
echo ($end_time-$start_time).'</br>';

$start_time1 = microtime(true);
var_dump($trie->search($test_str));
$end_time1 = microtime(true);

echo ($end_time1-$start_time1).'</br>';

$start_time2 = microtime(true);
foreach ($badword as $value){
    if ($value == $test_str){
        echo '1';
        break;
    }
}
$end_time2 = microtime(true);

echo ($end_time2-$start_time2).'</br>';

?>

以上是匹配圣经中的一个单词的实例，in_array是遍历匹配，当匹配单词越晚出现，耗时越久，当匹配单词不存在时，耗时不可接受。

trie的表现比较稳定，无论是否存在，或匹配单词出现的早晚都与运行时间无太大影响，只与单词长度有关。（效果最好，最稳定）

遍历查找基本与in_array效果相当。

四、总结

所以在项目开发中，在做到敏感词屏蔽，统计词频，字典等功能时，可以考虑运用trie数据结构生成一个trie树序列化存储起来，待更新词库时重新维护trie树，是不是觉得也不是很困难呢，加油。

posted @ 2017-07-19 16:51 jysdhr 阅读(675) 评论(0) 编辑收藏举报

刷新页面返回顶部

jysdhr

【天天数据结构和算法】PHP中trie数据结构的使用场景和代码实例

公告