prefixspan python
from:https://github.com/chuanconggao/PrefixSpan-py
API Usage
Alternatively, you can use the algorithms via API.
from prefixspan import PrefixSpan
db = [
[0, 1, 2, 3, 4],
[1, 1, 1, 3, 4],
[2, 1, 2, 2, 0],
[1, 1, 1, 2, 2],
]
ps = PrefixSpan(db)
For details of each parameter, please refer to the PrefixSpan
class in prefixspan/api.py
.
设置长度限制:
ps = PrefixSpan(db)
ps.minlen = 3
ps.maxlen = 5
print("?"*66)
------------------
print(ps.frequent(2))
# [(2, [0]),
# (4, [1]),
# (3, [1, 2]),
# (2, [1, 2, 2]),
# (2, [1, 3]),
# (2, [1, 3, 4]),
# (2, [1, 4]),
# (2, [1, 1]),
# (2, [1, 1, 1]),
# (3, [2]),
# (2, [2, 2]),
# (2, [3]),
# (2, [3, 4]),
# (2, [4])]
print(ps.topk(5))
# [(4, [1]),
# (3, [2]),
# (3, [1, 2]),
# (2, [1, 3]),
# (2, [1, 3, 4])]
print(ps.frequent(2, closed=True))
print(ps.topk(5, closed=True))
print(ps.frequent(2, generator=True))
print(ps.topk(5, generator=True))
Closed Patterns and Generator Patterns
一个 频繁的顺序模式 是一种出现在序列数据库的至少“minsup”序列中的模式,其中 最小支持度 是用户设置的参数。
一个 频繁闭合序列模式 是一种频繁的顺序模式,使得它不包括在具有完全相同支持的另一顺序模式中。
算法如 的PrefixSpan 找到频繁的顺序模式。算法如 BIDE+找到频繁的闭合序列模式。 BIDE +通常比PrefixSpan快得多,因为它使用修剪技术来避免生成所有顺序模式。此外,闭合模式集通常比连续模式集小得多,因此BIDE +也更具存储效率。
另一个重要的事情是,闭合序列模式是所有序列模式的紧凑和无损表示。这意味着闭合序列模式的集合通常要小得多,但它是无损的,这意味着它允许恢复整个连续模式集(没有信息丢失),这非常方便。
我可以举个简单的例子。
让我们考虑4个序列:
a b c d e
a b d
b e a
b c d e
让我们说minsup = 2。
b c
是一种频繁的序列模式,因为它出现在两个序列中(它支持2)。 b c
不是一个封闭的顺序模式,因为它包含在一个更大的顺序模式中 b c d
得到同样的支持。
b c d
它也是一个支持2.它也不是一个封闭的顺序模式,因为它包含在一个更大的顺序模式中 b c d e
得到同样的支持。 b c d e
是一个封闭的顺序模式,因为它没有包含在具有相同支持的任何其他顺序模式中。
The closed patterns are much more compact due to the smaller number.
- A pattern is closed if there is no super-pattern with the same frequency.
prefixspan-cli frequent 2 --closed test.dat
0 : 2
1 : 4
1 2 : 3
1 2 2 : 2
1 3 4 : 2
1 1 1 : 2
The generator patterns are even more compact due to both the smaller number and the shorter lengths.
-
A pattern is generator if there is no sub-pattern with the same frequency.
-
Due to the high compactness, generator patterns are useful as features for classification, etc.
prefixspan-cli frequent 2 --generator test.dat
0 : 2
1 1 : 2
2 : 3
2 2 : 2
3 : 2
4 : 2
There are patterns that are both closed and generator.
prefixspan-cli frequent 2 --closed --generator test.dat
0 : 2
备注:模式挖掘有很多算法。
SPMF offers implementations of the following data mining algorithms.
Sequential Pattern Mining
These algorithms discover sequential patterns in a set of sequences. For a good overview of sequential pattern mining algorithms, please read this survey paper.
- algorithms for mining sequential patterns in a sequence database
- the CM-SPADE algorithm (Fournier-Viger et al, 2014, powerpoint)
- the CM-SPAM algorithm (Fournier-Viger et al, 2014, powerpoint)
- the FAST algorithm (Salvemini et al, 2011)
- the GSP algorithm (Srikant et al., 1996)
- the LAPIN (aka LAPIN-SPAM) algorithm (Yang et al., 2005)
- the PrefixSpan algorithm (Pei et al., 2004)
- the SPADE algorithm (Zaki et al., 2001)
- the SPAM algorithm (Ayres et al., 2002)
- algorithms for mining closed sequential patterns in a sequence database
- the ClaSP algorithm (Gomariz et al., 2013)
- the CM-ClaSP algorithm (Fournier-Viger et al, 2014, powerpoint)
- the CloFAST algorithm (Fumarola et al, 2016)
- the CloSpan algorithm (Yan et al., 2003)
- the BIDE+ algorithm(Wang et al., 2007)
- algorithms for mining maximal sequential patterns in a sequence database
- the VMSP algorithm (Fournier-Viger et al, 2014, powerpoint)
- the MaxSP algorithm (Fournier-Viger et al., 2013, powerpoint).
- algorithms for mining the top-k sequential patterns in a sequence database
- the TKS algorithm (Fournier-Viger et al., 2013, powerpoint).
- the TSP algorithm (Tzvetkoz et al., 2003).
- the Skopus algorithm for mining the top-k sequential patterns using leverage and significance (Petijean et al., 2016)
- algorithms for mining sequential generator patterns in a sequence database
- the VGEN algorithm (Fournier-Viger et al, 2014)
- the FEAT algorithm (Gao et al., 2008).
- the FSGP algorithm (Yi et al., 2011).
- algorithms for mining compressing sequential patterns
- the GoKrimp and SeqKrimp algorithms (Lam et al., 2012; Lam et al., 2014)
- algorithms for mining multidimensional sequential patterns in a multidimensional sequence database
- the SeqDIM algorithm for mining frequent multidimensional sequential patterns in a multi-dimensional sequence database (Pinto et al., 2001)
- the Songram et al. algorithm for mining frequent closed multidimensional sequential patterns in a multi-dimensional sequence database (Songram et al. 2006)
- the Fournier-Viger et al. algorithm, a sequential pattern mining algorithm that combines several features from well-known sequential pattern mining algorithms and also proposes some original features (Fournier-Viger et al., 2008):
- mining sequences with minimum support by database-projection (based on PrefixSpan, Pei et al., 2004)
- mining sequences with min/max time interval between events and min/max time length of a sequence (based on Hirate-Yamana, 2006)
- mining closed sequences (based on the BIDE+ algorithm by Wang et al. 2007)
- mining multi-dimensional sequences (based on Pinto et al. 2001)
- mining closed multi-dimensional sequences (based on Songram et al. 2006 and Pasquier et al., 1999)
- mining sequences with items having integer values and performing automatic clustering of these values (original extension described in Fournier-Viger et al., 2008)
- algorithm for mining high-utility sequential patterns in a sequence database
- the USPAN algorithm (Yin et al. 2012)
- algorithm for mining high-utility probability sequential patterns in a sequence database
- the PHUSPM algorithm (Zhang et al. 2018)
- the UHUSPM algorithm (Zhang et al. 2018)
- algorithm for progressive sequential pattern mining with convergence guarantees
- the ProSecCo algorithm (Servan-Schreiber et al. 2018)
- the Occur algorithm for finding all occurrences of some sequential patterns in sequences by post-processing.
Sequential Rule Mining
These algorithms discover sequential rules in a set of sequences.
- algorithms for mining sequential rules in a sequence database
- the ERMiner algorithm (Fournier-Viger et al., 2014)
- the RuleGrowth algorithm (Fournier-Viger et al., 2011, Fournier-Viger et al., 2015, powerpoint, video)
- the CMRules algorithm (Fournier-Viger et al., 2010, powerpoint)
- the CMDeo algorithm (Fournier-Viger et al., 2010)
- the RuleGen algorithm (Zaki et al, 2001)
- algorithms for mining sequential rules in a sequence database with the window size constraint
- the TRuleGrowth algorithm (Fournier-Viger, 2012a, Fournier-Viger et al., 2015)
- algorithms for mining top-k sequential rules in a sequence database
- the TopSeqRules algorithm for mining the top-k sequential rules (Fournier-Viger et al., 2011, powerpoint)
- the TopSeqClassRules algorithm for mining the top-k class sequential rules (a variation of Fournier-Viger et al., 2011)
- the TNS algorithm for mining the top-k non-redundant sequential rules (Fournier-Viger 2013)
- algorithm for mining high-utility sequential rules in a sequence database
- the HUSRM algorithm (Zida et al., 2015)
Sequence Prediction
These algorithms predict the next symbol(s) of a sequence based on a set of training sequences
- algorithms for predicting the next symbol of a sequence based on a set of training sequences
- the Compact Prediction Tree+ (CPT+) algorithm (Gueniche et al., 2015, powerpoint)
- the Compact Prediction Tree (CPT) algorithm (Gueniche et al., 2013)
- the First order Markov Chains (PPM - order 1) (Clearly et al, 1984)
- the Dependency Graph (DG) (Padmanabhan, 1996)
- the All-k-Order Markov Chains (AKOM) (Pitkow, 1999)
- the TDAG (Laird & Saul, 1994)
- the LZ78 (Ziv, 1978)
Itemset Mining
These algorithms discover interesting itemsets (sets of values) that appear in a transaction database (database records containing symbolic data). For a good overview of itemset mining, please read this survey paper.
- algorithms for discovering frequent itemsets in a transaction database.
- the Apriori algorithm (Agrawal & Srikant, 1994)
- the AprioriTID algorithm (Agrawal & Srikant, 1994)
- the FP-Growth algorithm (Han et al., 2004)
- the Eclat algorithm (Zaki, 2000)
- the dEclat algorithm (Zaki and Gouda, 2001, 2003)
- the Relim algorithm (Borgelt, 2005)
- the H-Mine algorithm (Pei et al., 2007)
- the LCMFreq algorithm (Uno et al., 2004)
- the PrePost and PrePost+ algorithms (Deng et al., 2012, Deng et Lv, 2015)
- the FIN algorithm (Deng et al., 2014)
- the DFIN algorithm (Deng et al., 2016)
- the NegFIN algoritm (Aryabarzan et al., 2018)
- algorithms for discovering frequent closed itemsets in a transaction database.
- the FPClose algorithm (Grahne and Zhu, 2005)
- the Charm algorithm (Zaki and Hsiao, 2002)
- the dCharm algorithm (Zaki and Gouda, 2001)
- the DCI_Closed algorithm (Lucchese et al, 2004)
- the LCM algorithm (Uno et al., 2004)
- the AprioriClose aka Close algorithm (Pasquier et al., 1999)
- the AprioriTID Close algorithm (Pasquier et al., 1999, Agrawal & Srikant, 1994)
- algorithms for recovering all frequent itemsets from frequent closed itemsets:
- the LevelWise algorithm (Pasquier et al., 1999)
- the DFI-Growth algorithm (___ et al., 2018)
- algorithms for discovering frequent maximal itemsets in a transaction database.
- the FPMax algorithm (Grahne and Zhu, 2003)
- the Charm-MFI algorithm for discovering frequent closed itemsets and maximal frequent itemsets by post-processing in a transaction database (Szathmary et al. 2006)
- algorithms for mining frequent itemsets with multiple minimum supports
- the MSApriori algorithm (Liu et al, 1999)
- the CFPGrowth++ algorithm (Uday & Reddy, 2011, Hu & Chen, 2006)
- algorithms for mining generator itemsets in a transaction database
- the DefMe algorithm for mining frequent generator itemsets in a transaction database (Soulet & Rioult, 2014)
- the Pascal algorithm for mining frequent itemsets, and identifying at the same time which one are generators (Bastide et al., 2002)
- the Zart algorithm for discovering frequent closed itemsets and their generators in a transaction database (Szathmary et al. 2007)
- algorithms for mining rare itemsets and/or correlated itemsets in a transaction database
- the AprioriInverse algorithm for mining perfectly rare itemsets (Koh & Roundtree, 2005)
- the AprioriRare algorithm for mining minimal rare itemsets and frequent itemsets (Szathmary et al. 2007b)
- the CORI algorithm for mining minimal rare correlated itemsets using the support and bond measures (Bouasker et al. 2015)
- the RP-Growth algorithm for mining rare itemsets (Tsang et al., 2011)
- algorithms for performing targeted and dynamic queries about association rules and frequent itemsets.
- the Itemset-Tree, a data structure that can be updated incrementally, and algorithms for querying it. (Kubat et al, 2003)
- the Memory-Efficient Itemset-Tree, a data structure that can be updated incrementally, and algorithms for querying it. (Fournier-Viger, 2013, powerpoint)
- algorithms to discover frequent itemsets in a stream
- the estDec algorithm for mining recent frequent itemsets in a data stream (Chang & Lee, 2003)
- the estDec+ algorithm for mining recent frequent itemsets in a data stream (Shin et al., 2014)
- the CloStream algorithm for mining frequent closed itemsets in a data stream (Yen et al, 2009)
- the U-Apriori algorithm for mining frequent itemsets in uncertain data (Chui et al, 2007)
- the VME algorithm for mining erasable itemsets (Deng & Xu, 2010)
- algorithms to discover fuzzy frequent itemsets in a quantitative transaction database
- the FFI-Miner algorithm for mining fuzzy itemsets (Lin et al., 2015)
- the MFFI-Miner algorithm for mining multiple fuzzy itemsets (Lin et al., 2016)
Periodic Pattern Mining
These algorithms discover patterns that periodically appear in a sequence of complex events (also called a transaction database)
- the PFPM algorithm (Fournier-Viger et al, 2016a, powerpoint, video ) for mining frequent periodic patterns in a sequence of transactions (a transaction database))
- the PHM algorithm (Fournier-Viger et al, 2016b, powerpoint) for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information
Episode Mining
These algorithms discover episodes that appear in a single sequence of complex events.
- the TUP algorithm (Rathore et al., 2016) for mining the top-k high utility episodes in a sequence of complex events (a transaction database) with utility information
- the US-SPAN algorithm (Wu et al., 2013 ) for mining high utility episodes in a sequence of complex events (a transaction database) with utility information
High-Utility Pattern Mining
These algorithms discover patterns having a high utility (importance) in different kinds of data. For a good overview of high utility itemset mining, you may read this survey paper, and the high utility-pattern mining book.
- algorithms for mining high-utility itemsets in a transaction database having profit information
- the EFIM algorithm (Zida et al. 2016, Zida et al., 2015, powerpoint)
- the FHM algorithm (Fournier-Viger et al., 2014, powerpoint)
- the HUI-Miner algorithm (Liu & Qu, 2012)
- the HUP-Miner algorithm (Krishnamoorthy, 2014)
- the mHUIMiner algorithm (Peng et al., 2017)
- the HMiner algorithm (Krishnamoorty, 2017)
- the ULB-Miner algorithm (Duong et al, 2018)
- the UFH algorithm (Dawar et al, 2017)
- the IHUP algorithm (Ahmed et al., 2009)
- the Two-Phase algorithm (Liu et al., 2005)
- the UP-Growth algorithm (Tseng et al., 2011)
- the UP-Growth+ algorithm (Tseng et al., 2013)
- the UP-Hist algorithm (Dawar et al., 2015)
- the d2HUP algorithm (Liu et al, 2012)
- algorithm for efficiently mining high-utility itemsets with length constraints in a transaction database
- the FHM+ algorithm (Fournier-Viger et al, 2016, powerpoint)
- algorithm for mining correlated high-utility itemsets in a transaction database
- the FCHM_bond algorithm, to use the bond measure (Fournier-Viger et al, 2016, powerpoint, Fournier-Viger 2018 et al., to appear, video )
- the FCHM_allconfidence algorithm, to use the all-confidence measure (Fournier-Viger et al, 2016, powerpoint, Fournier-Viger 2018 et al., to appear)
- algorithm for mining high-utility itemsets in a transaction database containing negative unit profit values
- the FHN algorithm (Fournier-Viger et al., 2014, powerpoint)
- the HUINIV-Mine algorithm (Chu et al., 2009)
- algorithm for mining frequent high-utility itemsets in a transaction database
- the FHMFreq algorithm, a variation of the FHM algorithm (Fournier-Viger et al., 2014)
- algorithm for mining on-shelf high-utility itemsets in a transaction database containing information about time periods of items
- the FOSHU algorithm (Fournier-Viger et al., 2015, powerpoint)
- the TS-HOUN algorithm (Lan et al., 2014)
- algorithm for incremental high-utility itemset mining in a transaction database
- the EIHI algorithm (Fournier-Viger et al., 2015, powerpoint)
- the HUI-LIST-INS algorithm (Lin et al., 2014)
- algorithm for mining concise representations of high-utility itemsets in a transaction database
- the HUG-Miner algorithm (Fournier-Viger et al., 2014, powerpoint) for mining high-utility generators
- the GHUI-Miner algorithm (Fournier-Viger et al., 2014, powerpoint) for mining generators of high-utility itemsets
- the MinFHM algorithm (Fournier-Viger et al., 2016, powerpoint, video ) for mining minimal high-utility itemsets
- the EFIM-Closed algorithm (Fournier-Viger et al., 2016, powerpoint) for mining closed high-utility itemsets
- the CHUI-Miner algorithm (Wu et al., 2015) for mining closed high-utility itemsets
- the CHUD algorithm for mining closed high-utility itemsets (Tseng et al., 2011/2015)
- the CHUI-Miner(Max) algorithm for mining maximal high utility itemsets (Wu et al., 2019).
- algorithm for mining the skyline high-utility itemsets in a transaction database
- the SkyMine algorithm (Goyal et al., 2015)
- algorithm for mining the top-k high-utility itemsets in a transaction database
- the TKU algorithm (Tseng et al., 2015), obtained from UP-Miner under GPL license
- the TKO-Basic algorithm (Tseng et al., 2015)
- algorithms for mining the top-k high utility itemsets from a data stream with a window
- the FHMDS and FHMDS-Naive algorithms (Dawar et al. 2017)
- algorithm for mining frequent skyline utility patterns in a transaction database
- the SFUPMinerUemax algorithms (Lin et al, 2016)
- algorithm for mining quantitative high utility itemsets in a transaction database:
- the VHUQI algorithm (Wu et al., 2014)
- algorithm for mining high-utility sequential rules in a sequence database
- the HUSRM algorithm (Zida et al., 2015)
- algorithm for mining high-utility sequential patterns in a sequence database
- the USPAN algorithm (Yin et al. 2012)
- algorithm for mining high-utility probability sequential patterns in a sequence database
- the PHUSPM algorithm (Zhang et al. 2018)
- the UHUSPM algorithm (Zhang et al. 2018)
- algorithm for mining high-utility itemsets in a transaction database using evolutionary algorithms
- the HUIM-GA algorithm (Kannimuthu et al., 2014)
- the HUIM-BPSO algorithm (Lin et al, 2016)
- the HUIM-GA-tree algorithm (Lin et al, 2016)
- the HUIM-BPSO-tree algorithm (Lin et al, 2016)
- the HUIF-PSO algorithm (Song et al., 2018)
- the HUIF-GA algorithm (Song et al., 2018)
- the HUIF-BA algorithm (Song et al., 2018)
- algorithm for mining high average-utility itemsets in a transaction database
- the HAUI-Miner algorithm for mining high average-utility itemsets (Lin et al, 2016)
- the EHAUPM algorithm for mining high average-utility itemsets (Lin et al, 2017)
- the HAUI-MMAU algorithm for mining high average-utility itemsets with multiple thresholds (Lin et al, 2016)
- the MEMU algorithm for mining high average-utility itemsets with multiple thresholds (Lin et al, 2018)
- algorithms for mining high utility episodes in a sequence of complex events (a transaction database)
- the TUP algorithm (Rathore et al., 2016) for mining frequent periodic patterns in a sequence of transactions (a transaction database))
- the UP-SPAN algorithm (Wu et al., 2013 ) for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information
- algorithms for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information
- the PHM algorithm (Fournier-Viger et al, 2016b, powerpoint)
- algorithms for discovering irregular high utility itemsets (non periodic patterns) in a transaction database with utility information
- the PHM_irregular algorithm, which is a simple variation of the PHM algorithm
- algorithm for discovering local high utility itemsets in a database with utility information and timestamps
- the LHUI-Miner algorithm (Fournier-Viger et al., 2019)
- algorithm for discovering peak high utility itemsets in a database with utility information and timestamps
- the PHUI-Miner algorithm (Fournier-Viger et al., 2019)
Association Rule Mining
These algorithms discover interesting associations between symbols (values) in a transaction database (database records with binary attributes).
- an algorithm for mining all association rules in a transaction database (Agrawal & Srikant, 1994)
- an algorithm for mining all association rules with the lift measure in a transaction database (adapted from Agrawal & Srikant, 1994)
- an algorithm for mining the IGB informative and generic basis of association rules in a transaction database (Gasmi et al., 2005)
- an algorithm for mining perfectly sporadic association rules (Koh & Roundtree, 2005)
- an algorithm for mining closed association rules (Szathmary et al. 2006).
- an algorithm for mining minimal non redundant association rules (Kryszkiewicz, 1998)
- the Indirect algorithm for mining indirect association rules (Tan et al. 2000; Tan et 2006)
- the FHSAR algorithm for hiding sensitive association rules (Weng et al. 2008)
- the TopKRules algorithm for mining the top-k association rules (Fournier-Viger, 2012b, powerpoint)
- the TopKClassRules algorithm for mining the top-k class association rules (a variation of TopKRules. This latter is described in Fournier-Viger, 2012b, powerpoint)
- the TNR algorithm for mining top-k non-redundant association rules (Fournier-Viger 2012d, powerpoint)
Stream pattern mining
These algorithms discovers various kinds of patterns in a stream (an infinite sequence of database records (transactions))
- the estDec algorithm for mining recent frequent itemsets in a data stream (Chang & Lee, 2003)
- the estDec+ algorithm for mining recent frequent itemsets in a data stream (Shin et al., 2014)
- the CloStream algorithm for mining frequent closed itemsets in a data stream (Yen et al, 2009)
- algorithms for mining the top-k high utility itemsets from a data stream with a window
- the FHMDS and FHMDS-Naive algorithms (Dawar et al. 2017)
Clustering
These algorithms automatically find clusters in different kinds of data
- the original K-Means algorithm (MacQueen, 1967)
- the Bisecting K-Means algorithm (Steinbach et al, 2000)
- algorithms for density-based clustering
- the DBScan algorithm (Ester et al., 1996)
- the Optics algorithm to extract a cluster ordering of points, which can then be use to generate DBScan style clusters and more (Ankerst et al, 1999)
- a hierarchical clustering algorithm
- a tool called Cluster Viewer for visualizing clusters
- a tool called Instance Viewer for visualizing the input of clustering algorithms
Time series mining
These algorithms perform various tasks to analyze time series data
- an algorithm for converting a time series to a sequence of symbols using the SAX representation of time series. Note that if one converts a set of time series with SAX, he will obtain a sequence database, which allows to then apply traditional algorihtms for sequential rule mining and sequential pattern mining on time series (SAX, 2007).
- algorithms for calculating the prior moving average of a time series (to remove noise)
- algorithms for calculating the cumulative moving average f a time series (to remove noise)
- algorithms for calculating the central moving average of a time series (to remove noise)
- an algorithm for calculating the median smoothing of a time series (to remove noise)
- an algorithm for calculating the exponential smoothing of a time series (to remove noise)
- an algorithm for calculating the min max normalization of a time series
- an algorithm for calculating the autocorrelation function of a time series
- an algorithm for calculating the standardization of a time series
- an algorithm for calculating the first and second order differencing of a time series
- an algorithm for calculating the piecewise aggregate approximation of a time series (to reduce the number of data points of a time series)
- an algorithm for calculating the linear regression of a time series (using the least squares method)
- an algorithm for splitting a time series into segments of a given length
- an algorithm for splitting a time series into a given number of segments
- algorithms to cluster time series (group time-series according to their similarities). This can be done by applying the clustering algorithms offered in SPMF (K-Means, Bisecting K-Means, DBScan, OPTICS, Hierarchical clustering) on time series.
- a tool called Time Series Viewer for visualizing time series