倒排索引实现方式

倒排索引常用于加快搜索速度。使用场景如下：

index	content
1	A，B，C
2	A，D
3	B，E
4	C，D

在上表中，寻找哪些行有D是比较麻烦的一件事，需要遍历，无法通过简单搜索完成。如果我们能把此表转换成如下形式：

character	indexs
A	1,2
B	1,3
C	1,4
D	2,4
E	3

这样，可以直接对character建立索引，加快搜索速度。标准的空间换时间方案。如何用pandas实现，可以看下面的Demo。

def inverted_index_test():
    df1 = pd.DataFrame({"A":['a c a','b d a'],"B":["江苏","浙江"]})
    print(df1)
    print("A分割操作")
    df2 = df1['A'].str.split(' ',expand=True)
    print(df2)
    print("堆叠操作")
    df3 = df2.stack()
    print(df3)
    print("修剪索引")
    df4 = df3.reset_index(level=1,drop=True)#level 会删除对应的mutil值
    df4 = pd.DataFrame(df4)
    print(df4.index)
    print(df4)
    print("执行join操作")
    df5 = df4.join(df1)
    print(df5)
    print("执行groupby操作")
    def print_inverted_index(df):
        tag = ""
        Bs = []
        for index, row in df.iterrows():
            tag = str(row[0])
            Bs.append(str(row["B"]))
        inverted_index = Bs
        return inverted_index
    inverted_index_series = df5.groupby(0).apply(print_inverted_index)
    print(inverted_index_series)

输出内容：

Python 3.9.13 (main, Aug 25 2022, 23:26:10) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.31.1 -- An enhanced Interactive Python. Type '?' for help.
runfile('/tmp/pycharm_project_228/tools/exp_date_article_analyze.py', wdir='/tmp/pycharm_project_228/tools')
PyDev console: using IPython 7.31.1
Python 3.9.13 (main, Aug 25 2022, 23:26:10) 
[GCC 11.2.0] on linux
       A   B
0  a c a  江苏
1  b d a  浙江
A分割操作
   0  1  2
0  a  c  a
1  b  d  a
堆叠操作
0  0    a
   1    c
   2    a
1  0    b
   1    d
   2    a
dtype: object
修剪索引
Int64Index([0, 0, 0, 1, 1, 1], dtype='int64')
   0
0  a
0  c
0  a
1  b
1  d
1  a
执行join操作
   0      A   B
0  a  a c a  江苏
0  c  a c a  江苏
0  a  a c a  江苏
1  b  b d a  浙江
1  d  b d a  浙江
1  a  b d a  浙江
执行groupby操作
0
a    [江苏, 江苏, 浙江]
b            [浙江]
c            [江苏]
d            [浙江]
dtype: object

posted @ 2023-01-16 09:42 身带吴钩阅读(104) 评论(0) 收藏举报

刷新页面返回顶部

身带吴钩

倒排索引实现方式

公告