prefixspan python

from:https://github.com/chuanconggao/PrefixSpan-py

 

API Usage

Alternatively, you can use the algorithms via API.

from prefixspan import PrefixSpan

db = [
    [0, 1, 2, 3, 4],
    [1, 1, 1, 3, 4],
    [2, 1, 2, 2, 0],
    [1, 1, 1, 2, 2],
]

ps = PrefixSpan(db)

For details of each parameter, please refer to the PrefixSpan class in prefixspan/api.py.

设置长度限制:

ps = PrefixSpan(db)
ps.minlen = 3
ps.maxlen = 5
print("?"*66)
------------------
print(ps.frequent(2))
# [(2, [0]),
#  (4, [1]),
#  (3, [1, 2]),
#  (2, [1, 2, 2]),
#  (2, [1, 3]),
#  (2, [1, 3, 4]),
#  (2, [1, 4]),
#  (2, [1, 1]),
#  (2, [1, 1, 1]),
#  (3, [2]),
#  (2, [2, 2]),
#  (2, [3]),
#  (2, [3, 4]),
#  (2, [4])]

print(ps.topk(5))
# [(4, [1]),
#  (3, [2]),
#  (3, [1, 2]),
#  (2, [1, 3]),
#  (2, [1, 3, 4])]


print(ps.frequent(2, closed=True))

print(ps.topk(5, closed=True))


print(ps.frequent(2, generator=True))

print(ps.topk(5, generator=True))

Closed Patterns and Generator Patterns

一个 频繁的顺序模式 是一种出现在序列数据库的至少“minsup”序列中的模式,其中 最小支持度 是用户设置的参数。

一个 频繁闭合序列模式 是一种频繁的顺序模式,使得它不包括在具有完全相同支持的另一顺序模式中。

算法如 的PrefixSpan 找到频繁的顺序模式。算法如 BIDE+找到频繁的闭合序列模式。 BIDE +通常比PrefixSpan快得多,因为它使用修剪技术来避免生成所有顺序模式。此外,闭合模式集通常比连续模式集小得多,因此BIDE +也更具存储效率。

另一个重要的事情是,闭合序列模式是所有序列模式的紧凑和无损表示。这意味着闭合序列模式的集合通常要小得多,但它是无损的,这意味着它允许恢复整个连续模式集(没有信息丢失),这非常方便。

我可以举个简单的例子。

让我们考虑4个序列:

a  b  c  d  e
a  b  d
b  e  a  
b  c  d  e

让我们说minsup = 2。

b c 是一种频繁的序列模式,因为它出现在两个序列中(它支持2)。 b c 不是一个封闭的顺序模式,因为它包含在一个更大的顺序模式中 b c d 得到同样的支持。

b c d 它也是一个支持2.它也不是一个封闭的顺序模式,因为它包含在一个更大的顺序模式中 b c d e 得到同样的支持。 b c d e 是一个封闭的顺序模式,因为它没有包含在具有相同支持的任何其他顺序模式中。

The closed patterns are much more compact due to the smaller number.

  • A pattern is closed if there is no super-pattern with the same frequency.
prefixspan-cli frequent 2 --closed test.dat

0 : 2
1 : 4
1 2 : 3
1 2 2 : 2
1 3 4 : 2
1 1 1 : 2

The generator patterns are even more compact due to both the smaller number and the shorter lengths.

  • A pattern is generator if there is no sub-pattern with the same frequency.

  • Due to the high compactness, generator patterns are useful as features for classification, etc.

prefixspan-cli frequent 2 --generator test.dat

0 : 2
1 1 : 2
2 : 3
2 2 : 2
3 : 2
4 : 2

There are patterns that are both closed and generator.

prefixspan-cli frequent 2 --closed --generator test.dat

0 : 2

备注:模式挖掘有很多算法。

SPMF offers implementations of the following data mining algorithms.

Sequential Pattern Mining

These algorithms discover sequential patterns in a set of sequences. For a good overview of sequential pattern mining algorithms, please read this survey paper.

Sequential Rule Mining

These algorithms discover sequential rules in a set of sequences.

Sequence Prediction

These algorithms predict the next symbol(s) of a sequence based on a set of training sequences

Itemset Mining

These algorithms discover interesting itemsets (sets of values) that appear in a transaction database (database records containing symbolic data). For a good overview of itemset mining, please read this survey paper.

  • algorithms for discovering frequent itemsets in a transaction database.
  • algorithms for discovering frequent closed itemsets in a transaction database.
  • algorithms for recovering all frequent itemsets from frequent closed itemsets:
    • the LevelWise algorithm (Pasquier et al., 1999) new
    • the DFI-Growth algorithm (___ et al., 2018) new
  • algorithms for discovering frequent maximal itemsets in a transaction database.
    • the FPMax algorithm (Grahne and Zhu, 2003)
    • the Charm-MFI algorithm for discovering frequent closed itemsets and maximal frequent itemsets by post-processing in a transaction database (Szathmary et al. 2006)
  • algorithms for mining frequent itemsets with multiple minimum supports
  • algorithms for mining generator itemsets in a transaction database
    • the DefMe algorithm for mining frequent generator itemsets in a transaction database (Soulet & Rioult, 2014)
    • the Pascal algorithm for mining frequent itemsets, and identifying at the same time which one are generators (Bastide et al., 2002)
    • the Zart algorithm for discovering frequent closed itemsets and their generators in a transaction database (Szathmary et al. 2007)
  • algorithms for mining rare itemsets and/or correlated itemsets in a transaction database
    • the AprioriInverse algorithm for mining perfectly rare itemsets (Koh & Roundtree, 2005)
    • the AprioriRare algorithm for mining minimal rare itemsets and frequent itemsets (Szathmary et al. 2007b)
    • the CORI algorithm for mining minimal rare correlated itemsets using the support and bond measures (Bouasker et al. 2015)
    • the RP-Growth algorithm for mining rare itemsets (Tsang et al., 2011) new
  • algorithms for performing targeted and dynamic queries about association rules and frequent itemsets.
    • the Itemset-Tree, a data structure that can be updated incrementally, and algorithms for querying it. (Kubat et al, 2003)
    • the Memory-Efficient Itemset-Tree, a data structure that can be updated incrementally, and algorithms for querying it. (Fournier-Viger, 2013powerpoint)
  • algorithms to discover frequent itemsets in a stream
    • the estDec algorithm for mining recent frequent itemsets in a data stream (Chang & Lee, 2003)
    • the estDec+ algorithm for mining recent frequent itemsets in a data stream (Shin et al., 2014)
    • the CloStream algorithm for mining frequent closed itemsets in a data stream (Yen et al, 2009)
  • the U-Apriori algorithm for mining frequent itemsets in uncertain data (Chui et al, 2007)
  • the VME algorithm for mining erasable itemsets (Deng & Xu, 2010)
  • algorithms to discover fuzzy frequent itemsets in a quantitative transaction database

Periodic Pattern Mining

These algorithms discover patterns that periodically appear in a sequence of complex events (also called a transaction database)

  • the PFPM algorithm (Fournier-Viger et al, 2016apowerpointvideo  ) for mining frequent periodic patterns in a sequence of transactions (a transaction database))new
  • the PHM algorithm (Fournier-Viger et al, 2016bpowerpoint) for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information new

Episode Mining

These algorithms discover episodes that appear in a single sequence of complex events.

  • the TUP algorithm (Rathore et al., 2016) for mining the top-k high utility episodes in a sequence of complex events (a transaction database) with utility information new
  • the US-SPAN algorithm (Wu et al., 2013 ) for mining high utility episodes in a sequence of complex events (a transaction database) with utility information new

High-Utility Pattern Mining

These algorithms discover patterns having a high utility (importance) in different kinds of data. For a good overview of high utility itemset mining, you may read this survey paper, and the high utility-pattern mining book.

  • algorithms for mining high-utility itemsets in a transaction database having profit information
  • algorithm for efficiently mining high-utility itemsets with length constraints in a transaction database
  • algorithm for mining correlated high-utility itemsets in a transaction database
  • algorithm for mining high-utility itemsets in a transaction database containing negative unit profit values
  • algorithm for mining frequent high-utility itemsets in a transaction database
  • algorithm for mining on-shelf high-utility itemsets in a transaction database containing information about time periods of items
  • algorithm for incremental high-utility itemset mining in a transaction database
  • algorithm for mining concise representations of high-utility  itemsets in a transaction database
  • algorithm for mining the skyline high-utility itemsets in a transaction database
  • algorithm for mining the top-k high-utility itemsets in a transaction database
  • algorithms for mining the top-k high utility itemsets from a data stream with a window
  • algorithm for mining frequent skyline utility patterns in a transaction database
  • algorithm for mining quantitative high utility itemsets in a transaction database:
  • algorithm for mining high-utility sequential rules in a sequence database 
  • algorithm for mining high-utility sequential patterns in a sequence database 
    • the USPAN algorithm (Yin et al. 2012)
  • algorithm for mining high-utility probability sequential patterns in a sequence database 
  • algorithm for mining high-utility itemsets in a transaction database using evolutionary algorithms
  • algorithm for mining high average-utility itemsets in a transaction database
    • the HAUI-Miner algorithm for mining high average-utility itemsets (Lin et al, 2016)
    • the EHAUPM algorithm for mining high average-utility itemsets (Lin et al, 2017new
    • the HAUI-MMAU algorithm for mining high average-utility itemsets with multiple thresholds (Lin et al, 2016)
    • the MEMU algorithm for mining high average-utility itemsets with multiple thresholds (Lin et al, 2018)
  • algorithms for mining high utility episodes in a sequence of complex events (a transaction database)
    • the TUP algorithm (Rathore et al., 2016) for mining frequent periodic patterns in a sequence of transactions (a transaction database))new
    • the UP-SPAN algorithm (Wu et al., 2013 ) for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information new
  • algorithms for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information
  • algorithms for discovering irregular high utility itemsets (non periodic patterns) in a transaction database with utility information
    • the PHM_irregular algorithm, which is a simple variation of the PHM algorithm new
  • algorithm for discovering local high utility itemsets in a database with utility information and timestamps
  • algorithm for discovering peak high utility itemsets in a database with utility information and timestamps

Association Rule Mining

These algorithms discover interesting associations between symbols (values) in a transaction database (database records with binary attributes).

  • an algorithm for mining all association rules in a transaction database (Agrawal & Srikant, 1994)
  • an algorithm for mining all association rules with the lift measure in a transaction database (adapted from Agrawal & Srikant, 1994)
  • an algorithm for mining the IGB informative and generic basis of association rules in a transaction database (Gasmi et al., 2005)
  • an algorithm for mining perfectly sporadic association rules (Koh & Roundtree, 2005)
  • an algorithm for mining closed association rules (Szathmary et al. 2006).
  • an algorithm for mining minimal non redundant association rules (Kryszkiewicz, 1998)
  • the Indirect algorithm for mining indirect association rules (Tan et al. 2000; Tan et 2006)
  • the FHSAR algorithm for hiding sensitive association rules (Weng et al. 2008)
  • the TopKRules algorithm for mining the top-k association rules (Fournier-Viger, 2012bpowerpoint)
  • the TopKClassRules algorithm for mining the top-k class association rules (a variation of TopKRules. This latter is described in Fournier-Viger, 2012bpowerpoint)
  • the TNR algorithm for mining top-k non-redundant association rules (Fournier-Viger 2012dpowerpoint)

Stream pattern mining

These algorithms discovers various kinds of patterns in a stream (an infinite sequence of database records (transactions))

  • the estDec algorithm for mining recent frequent itemsets in a data stream (Chang & Lee, 2003)
  • the estDec+ algorithm for mining recent frequent itemsets in a data stream (Shin et al., 2014)
  • the CloStream algorithm for mining frequent closed itemsets in a data stream (Yen et al, 2009)
  • algorithms for mining the top-k high utility itemsets from a data stream with a window

Clustering

These algorithms automatically find clusters in different kinds of data

  • the original K-Means algorithm (MacQueen, 1967)
  • the Bisecting K-Means algorithm (Steinbach et al, 2000)
  • algorithms for density-based clustering
    • the DBScan algorithm (Ester et al., 1996)
    • the Optics algorithm to extract a cluster ordering of points, which can then be use to generate DBScan style clusters and more (Ankerst et al, 1999)
  • hierarchical clustering algorithm
  • a tool called Cluster Viewer for visualizing clusters
  • a tool called Instance Viewer for visualizing the input of clustering algorithms

Time series mining

These algorithms perform various tasks to analyze time series data

    • an algorithm for converting a time series to a sequence of symbols using the SAX representation of time series. Note that if one converts a set of time series with SAX, he will obtain a sequence database, which allows to then apply traditional algorihtms for sequential rule mining and sequential pattern mining on time series (SAX, 2007).
    • algorithms for calculating the prior moving average of a time series (to remove noise)
    • algorithms for calculating the cumulative moving average f a time series (to remove noise)
    • algorithms for calculating the central moving average of a time series (to remove noise)
    • an algorithm for calculating the median smoothing of a time series (to remove noise)
    • an algorithm for calculating the exponential smoothing of a time series (to remove noise) new
    • an algorithm for calculating the min max normalization of a time series new
    • an algorithm for calculating the autocorrelation function of a time series new
    • an algorithm for calculating the standardization of a time series new
    • an algorithm for calculating the first and second order differencing of a time series
    • an algorithm for calculating the piecewise aggregate approximation of a time series (to reduce the number of data points of a time series)
    • an algorithm for calculating the linear regression of a time series (using the least squares method) new
    • an algorithm for splitting a time series into segments of a given length
    • an algorithm for splitting a time series into a given number of segments
    • algorithms to cluster time series (group time-series according to their similarities). This can be done by applying the clustering algorithms offered in SPMF (K-Means, Bisecting K-Means, DBScan, OPTICS, Hierarchical clustering) on time series.
    • a tool called Time Series Viewer for visualizing time series new
 
posted @ 2019-04-12 15:49  bonelee  阅读(2995)  评论(0编辑  收藏  举报