lesson02_Action01_market_apriori
挖掘数据中的关联规则
!pip install efficient_apriori
!pip install fptools
Requirement already satisfied: efficient_apriori in c:\users\aoc\anaconda3\lib\site-packages (1.1.1)
Collecting fptools
Downloading fptools-1.0-py2.py3-none-any.whl (5.2 kB)
Installing collected packages: fptools
Successfully installed fptools-1.0
import pandas as pd
import numpy as np
from efficient_apriori import apriori as EA
from mlxtend.frequent_patterns import apriori, association_rules
加载数据:
df = pd.read_csv('./MarketBasket/Market_Basket_Optimisation.csv',header=None)
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | shrimp | almonds | avocado | vegetables mix | green grapes | whole weat flour | yams | cottage cheese | energy drink | tomato juice | low fat yogurt | green tea | honey | salad | mineral water | salmon | antioxydant juice | frozen smoothie | spinach | olive oil |
1 | burgers | meatballs | eggs | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | chutney | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | turkey | avocado | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | mineral water | milk | energy bar | whole wheat rice | green tea | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7496 | butter | light mayo | fresh bread | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7497 | burgers | frozen vegetables | eggs | french fries | magazines | green tea | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7498 | chicken | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7499 | escalope | green tea | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7500 | eggs | frozen smoothie | yogurt cake | low fat yogurt | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7501 rows × 20 columns
使用efficient_apriori进行关联分析:
将数据存入transaction
用于存商品数据
# transactions = []
# for i in range(0,df.shape[0]):
# temp = set()
# for j in range(0,df.shape[1]):
# item = str(df.values[i,j])
# if item != 'nan':
# temp.add(item)
# transactions.append(temp)
订单去重后进行分析
# 测试第二行删掉空值后转为列表的效果
df.loc[1].dropna().to_list()
['burgers', 'meatballs', 'eggs']
# 等价于上一个for循环的代码
# 用于存商品数据
transactions = []
# 遍历每一行
for i in range(0, df.shape[0]):
# 每一笔订单,外层增加了set()对订单中的商品进行去重
temp = set(df.loc[i].dropna().to_list())
transactions.append(temp)
transactions
计算频繁项集和关联规则
%%time
itemsets, rules = EA(transactions, min_support=0.04, min_confidence=0.2)
Wall time: 304 ms
print("频繁项集:\n",itemsets)
print("关联规则:\n",rules)
频繁项集:
{1: {('olive oil',): 494, ('honey',): 356, ('salmon',): 319, ('shrimp',): 536, ('frozen smoothie',): 475, ('mineral water',): 1788, ('low fat yogurt',): 574, ('green tea',): 991, ('eggs',): 1348, ('burgers',): 654, ('turkey',): 469, ('milk',): 972, ('whole wheat rice',): 439, ('french fries',): 1282, ('soup',): 379, ('spaghetti',): 1306, ('frozen vegetables',): 715, ('cookies',): 603, ('cooking oil',): 383, ('champagne',): 351, ('chicken',): 450, ('chocolate',): 1229, ('tomatoes',): 513, ('pancakes',): 713, ('grated cheese',): 393, ('fresh bread',): 323, ('escalope',): 595, ('ground beef',): 737, ('herb & pepper',): 371, ('cake',): 608}, 2: {('milk', 'mineral water'): 360, ('eggs', 'mineral water'): 382, ('mineral water', 'spaghetti'): 448, ('ground beef', 'mineral water'): 307, ('chocolate', 'mineral water'): 395}}
关联规则:
[{mineral water} -> {milk}, {milk} -> {mineral water}, {mineral water} -> {eggs}, {eggs} -> {mineral water}, {spaghetti} -> {mineral water}, {mineral water} -> {spaghetti}, {ground beef} -> {mineral water}, {mineral water} -> {chocolate}, {chocolate} -> {mineral water}]
#关联规则的数量有9个
len(rules)
9
看起来去重前后结果没有变化,在2.2中使用短数据集再验证下
#存在重复商品
df.loc[4494].dropna()
0 ham
1 eggs
2 honey
3 gums
4 light cream
5 ham
Name: 4494, dtype: object
#存在重复商品
df.loc[4394].dropna()
0 burgers
1 ham
2 eggs
3 whole wheat rice
4 ham
5 french fries
6 cookies
7 green tea
Name: 4394, dtype: object
测试订单中重复商品是否影响efficient_apriori关联分析结果:
transactions_ = [('eggs', 'bacon', 'soup', 'soup'),
('eggs', 'bacon', 'apple', 'bacon'),
('soup', 'bacon', 'banana')]
itemsets, rules = EA(transactions_,min_support=0.5,min_confidence=1)
print("频繁项集:\n",itemsets)
print("关联规则:\n",rules)
频繁项集:
{1: {('eggs',): 2, ('soup',): 2, ('bacon',): 3}, 2: {('bacon', 'eggs'): 2, ('bacon', 'soup'): 2}}
关联规则:
[{eggs} -> {bacon}, {soup} -> {bacon}]
小结:通过这个例子可以看出,当同一个元组(也就是同一个订单)中存在重复商品时,不影响频繁项集的计数结果。因为它统计的是某商品在不同订单中出现的次数,而不是同一个订单中的出现次数。
使用mlxtend进行关联分析
尝试两种不同的df处理方法
df处理方法1:列字段合并再进行读热编码
df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | shrimp | almonds | avocado | vegetables mix | green grapes | whole weat flour | yams | cottage cheese | energy drink | tomato juice | low fat yogurt | green tea | honey | salad | mineral water | salmon | antioxydant juice | frozen smoothie | spinach | olive oil |
1 | burgers | meatballs | eggs | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | chutney | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | turkey | avocado | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | mineral water | milk | energy bar | whole wheat rice | green tea | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
创建空DataFrame
用于存储拼接后的字符串
df_new = pd.DataFrame(columns=['items'])
df_new
items |
---|
将df每行中的商品拼接成一个字符串
并存入新的DataFrame中
%%time
for i in range(df.shape[0]):
df_new.loc[i] = df.loc[i].str.cat(sep='/')
Wall time: 14.9 s
df_new
items | |
---|---|
0 | shrimp/almonds/avocado/vegetables mix/green gr... |
1 | burgers/meatballs/eggs |
2 | chutney |
3 | turkey/avocado |
4 | mineral water/milk/energy bar/whole wheat rice... |
... | ... |
7496 | butter/light mayo/fresh bread |
7497 | burgers/frozen vegetables/eggs/french fries/ma... |
7498 | chicken |
7499 | escalope/green tea |
7500 | eggs/frozen smoothie/yogurt cake/low fat yogurt |
7501 rows × 1 columns
type(df_new.loc[0, 'items'])
str
对新DataFrame进行one-hot编码
one_hot_df = df_new['items'].str.get_dummies(sep="/")
one_hot_df
asparagus | almonds | antioxydant juice | asparagus | avocado | babies food | bacon | barbecue sauce | black tea | blueberries | ... | turkey | vegetables mix | water spray | white wine | whole weat flour | whole wheat pasta | whole wheat rice | yams | yogurt cake | zucchini | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7496 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7497 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7498 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7499 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
7501 rows × 120 columns
# 所有列最大值的和=120,说明每一列最大值均为1,符合one-hot编码的特征
one_hot_df.max(axis=0).sum()
120
获取频繁项集通过one-hot编码的DataFrame对象
apriori(
df,
min_support=0.5,
use_colnames=False,
max_len=None,
verbose=0,
low_memory=False,
)
%%time
itemsets = apriori(one_hot_df,min_support=0.04,use_colnames=True)
Wall time: 74.8 ms
itemsets.sort_values(by=['support'], ascending=False)
support | itemsets | |
---|---|---|
20 | 0.238368 | (mineral water) |
7 | 0.179709 | (eggs) |
26 | 0.174110 | (spaghetti) |
9 | 0.170911 | (french fries) |
4 | 0.163845 | (chocolate) |
14 | 0.132116 | (green tea) |
19 | 0.129583 | (milk) |
15 | 0.098254 | (ground beef) |
12 | 0.095321 | (frozen vegetables) |
22 | 0.095054 | (pancakes) |
0 | 0.087188 | (burgers) |
1 | 0.081056 | (cake) |
5 | 0.080389 | (cookies) |
8 | 0.079323 | (escalope) |
18 | 0.076523 | (low fat yogurt) |
24 | 0.071457 | (shrimp) |
27 | 0.068391 | (tomatoes) |
21 | 0.065858 | (olive oil) |
11 | 0.063325 | (frozen smoothie) |
28 | 0.062525 | (turkey) |
3 | 0.059992 | (chicken) |
34 | 0.059725 | (mineral water, spaghetti) |
29 | 0.058526 | (whole wheat rice) |
30 | 0.052660 | (mineral water, chocolate) |
13 | 0.052393 | (grated cheese) |
6 | 0.051060 | (cooking oil) |
31 | 0.050927 | (eggs, mineral water) |
25 | 0.050527 | (soup) |
16 | 0.049460 | (herb & pepper) |
33 | 0.047994 | (mineral water, milk) |
17 | 0.047460 | (honey) |
2 | 0.046794 | (champagne) |
10 | 0.043061 | (fresh bread) |
23 | 0.042528 | (salmon) |
32 | 0.040928 | (mineral water, ground beef) |
从频繁项集中找出符合条件的关联条件
利用mlxtend进行数据关联分析,查看返回值各列含义
association_rules(
df,
metric='confidence',
min_threshold=0.8,
support_only=False,
)
%%time
rules = association_rules(itemsets,metric='lift',min_threshold=1)
Wall time: 4.99 ms
#按规则提升度降序排序
rules.sort_values(by=['lift'],ascending=False)
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
4 | (mineral water) | (ground beef) | 0.238368 | 0.098254 | 0.040928 | 0.171700 | 1.747522 | 0.017507 | 1.088672 |
5 | (ground beef) | (mineral water) | 0.098254 | 0.238368 | 0.040928 | 0.416554 | 1.747522 | 0.017507 | 1.305401 |
6 | (mineral water) | (milk) | 0.238368 | 0.129583 | 0.047994 | 0.201342 | 1.553774 | 0.017105 | 1.089850 |
7 | (milk) | (mineral water) | 0.129583 | 0.238368 | 0.047994 | 0.370370 | 1.553774 | 0.017105 | 1.209650 |
8 | (mineral water) | (spaghetti) | 0.238368 | 0.174110 | 0.059725 | 0.250559 | 1.439085 | 0.018223 | 1.102008 |
9 | (spaghetti) | (mineral water) | 0.174110 | 0.238368 | 0.059725 | 0.343032 | 1.439085 | 0.018223 | 1.159314 |
1 | (chocolate) | (mineral water) | 0.163845 | 0.238368 | 0.052660 | 0.321400 | 1.348332 | 0.013604 | 1.122357 |
0 | (mineral water) | (chocolate) | 0.238368 | 0.163845 | 0.052660 | 0.220917 | 1.348332 | 0.013604 | 1.073256 |
2 | (eggs) | (mineral water) | 0.179709 | 0.238368 | 0.050927 | 0.283383 | 1.188845 | 0.008090 | 1.062815 |
3 | (mineral water) | (eggs) | 0.238368 | 0.179709 | 0.050927 | 0.213647 | 1.188845 | 0.008090 | 1.043158 |
'mineral water' in one_hot_df.columns
True
df处理方法2:直接整行读入CSV
这种方法感觉更适合mlxtend包的apriori算法
数据加载
df = pd.read_csv('./MarketBasket/Market_Basket_Optimisation.csv', sep='\t',header=None)
df
0 | |
---|---|
0 | shrimp,almonds,avocado,vegetables mix,green gr... |
1 | burgers,meatballs,eggs |
2 | chutney |
3 | turkey,avocado |
4 | mineral water,milk,energy bar,whole wheat rice... |
... | ... |
7496 | butter,light mayo,fresh bread |
7497 | burgers,frozen vegetables,eggs,french fries,ma... |
7498 | chicken |
7499 | escalope,green tea |
7500 | eggs,frozen smoothie,yogurt cake,low fat yogurt |
7501 rows × 1 columns
df独热编码
%%time
df_one_hot = df[0].str.get_dummies(sep=',')
Wall time: 695 ms
df_one_hot
asparagus | almonds | antioxydant juice | asparagus | avocado | babies food | bacon | barbecue sauce | black tea | blueberries | ... | turkey | vegetables mix | water spray | white wine | whole weat flour | whole wheat pasta | whole wheat rice | yams | yogurt cake | zucchini | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7496 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7497 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7498 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7499 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
7501 rows × 120 columns
#校验是否最大值都是1
df_one_hot.max().sum()
120
计算频繁项集:
%%time
frequence_items = apriori(df_one_hot,min_support=0.04,use_colnames=True)
Wall time: 61.8 ms
%%time
frequence_items.sort_values(by=['support'],ascending=False)
Wall time: 997 µs
support | itemsets | |
---|---|---|
20 | 0.238368 | (mineral water) |
7 | 0.179709 | (eggs) |
26 | 0.174110 | (spaghetti) |
9 | 0.170911 | (french fries) |
4 | 0.163845 | (chocolate) |
14 | 0.132116 | (green tea) |
19 | 0.129583 | (milk) |
15 | 0.098254 | (ground beef) |
12 | 0.095321 | (frozen vegetables) |
22 | 0.095054 | (pancakes) |
0 | 0.087188 | (burgers) |
1 | 0.081056 | (cake) |
5 | 0.080389 | (cookies) |
8 | 0.079323 | (escalope) |
18 | 0.076523 | (low fat yogurt) |
24 | 0.071457 | (shrimp) |
27 | 0.068391 | (tomatoes) |
21 | 0.065858 | (olive oil) |
11 | 0.063325 | (frozen smoothie) |
28 | 0.062525 | (turkey) |
3 | 0.059992 | (chicken) |
34 | 0.059725 | (mineral water, spaghetti) |
29 | 0.058526 | (whole wheat rice) |
30 | 0.052660 | (mineral water, chocolate) |
13 | 0.052393 | (grated cheese) |
6 | 0.051060 | (cooking oil) |
31 | 0.050927 | (eggs, mineral water) |
25 | 0.050527 | (soup) |
16 | 0.049460 | (herb & pepper) |
33 | 0.047994 | (mineral water, milk) |
17 | 0.047460 | (honey) |
2 | 0.046794 | (champagne) |
10 | 0.043061 | (fresh bread) |
23 | 0.042528 | (salmon) |
32 | 0.040928 | (mineral water, ground beef) |
根据频繁项集求关联规则
%%time
rules = association_rules(frequence_items,metric='lift',min_threshold=1)
rules.sort_values(by='lift',ascending=False)
Wall time: 3.99 ms
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
4 | (mineral water) | (ground beef) | 0.238368 | 0.098254 | 0.040928 | 0.171700 | 1.747522 | 0.017507 | 1.088672 |
5 | (ground beef) | (mineral water) | 0.098254 | 0.238368 | 0.040928 | 0.416554 | 1.747522 | 0.017507 | 1.305401 |
6 | (mineral water) | (milk) | 0.238368 | 0.129583 | 0.047994 | 0.201342 | 1.553774 | 0.017105 | 1.089850 |
7 | (milk) | (mineral water) | 0.129583 | 0.238368 | 0.047994 | 0.370370 | 1.553774 | 0.017105 | 1.209650 |
8 | (mineral water) | (spaghetti) | 0.238368 | 0.174110 | 0.059725 | 0.250559 | 1.439085 | 0.018223 | 1.102008 |
9 | (spaghetti) | (mineral water) | 0.174110 | 0.238368 | 0.059725 | 0.343032 | 1.439085 | 0.018223 | 1.159314 |
1 | (chocolate) | (mineral water) | 0.163845 | 0.238368 | 0.052660 | 0.321400 | 1.348332 | 0.013604 | 1.122357 |
0 | (mineral water) | (chocolate) | 0.238368 | 0.163845 | 0.052660 | 0.220917 | 1.348332 | 0.013604 | 1.073256 |
2 | (eggs) | (mineral water) | 0.179709 | 0.238368 | 0.050927 | 0.283383 | 1.188845 | 0.008090 | 1.062815 |
3 | (mineral water) | (eggs) | 0.238368 | 0.179709 | 0.050927 | 0.213647 | 1.188845 | 0.008090 | 1.043158 |
efficient_apriori和mlxtend的apriori算法比较:
除了efficient_apriori效率高,返回参数少;mlxtend效率低,返回参数多以外:
1、efficient_apriori的apriori算法,频繁项集和关联规则是一起输出;且只能通过最小置信度来获取关联规则;
2、mlxtend的apriori算法,提供2个方法,一个输出计算频繁项集,一个输出计算关联规则,支持最小置信度、最小支持度等8个指标来获取关联规则;
3、mlxtend的apriori算法结果更加清晰直观;
4、efficient_apriori的apriori算法,得到的关联规则中包含的支持度等信息是隐藏在Rule对象中的,需要单独打印Rule对象才能看到,不如mlxtend的结果直观。
尝试FPGrowth算法
fptools包
import fptools as fp
transactions
%%time
tree = fp.build_tree(transactions,minsup=300)
Wall time: 39.9 ms
tree
(<fptools.FPTree at 0x2114bfede50>,
{'salmon': 0,
'fresh bread': 1,
'champagne': 2,
'honey': 3,
'herb & pepper': 4,
'soup': 5,
'cooking oil': 6,
'grated cheese': 7,
'whole wheat rice': 8,
'chicken': 9,
'turkey': 10,
'frozen smoothie': 11,
'olive oil': 12,
'tomatoes': 13,
'shrimp': 14,
'low fat yogurt': 15,
'escalope': 16,
'cookies': 17,
'cake': 18,
'burgers': 19,
'pancakes': 20,
'frozen vegetables': 21,
'ground beef': 22,
'milk': 23,
'green tea': 24,
'chocolate': 25,
'french fries': 26,
'spaghetti': 27,
'eggs': 28,
'mineral water': 29})
tree[0].nodes
len(tree[0].rank)
30
#项头表
tree[0].rank
{'salmon': 0,
'fresh bread': 1,
'champagne': 2,
'honey': 3,
'herb & pepper': 4,
'soup': 5,
'cooking oil': 6,
'grated cheese': 7,
'whole wheat rice': 8,
'chicken': 9,
'turkey': 10,
'frozen smoothie': 11,
'olive oil': 12,
'tomatoes': 13,
'shrimp': 14,
'low fat yogurt': 15,
'escalope': 16,
'cookies': 17,
'cake': 18,
'burgers': 19,
'pancakes': 20,
'frozen vegetables': 21,
'ground beef': 22,
'milk': 23,
'green tea': 24,
'chocolate': 25,
'french fries': 26,
'spaghetti': 27,
'eggs': 28,
'mineral water': 29}
mineral_water_node = tree[0].nodes['mineral water'][0]
mineral_water_node
<fptools.FPNode at 0x2114c00f8b0>
#子孩子节点
mineral_water_node.children
defaultdict(fptools.FPNode,
{'green tea': <fptools.FPNode at 0x2114bfdff40>,
'eggs': <fptools.FPNode at 0x2114bb133d0>,
'salmon': <fptools.FPNode at 0x2114c5611f0>,
'spaghetti': <fptools.FPNode at 0x2114c561850>,
'ground beef': <fptools.FPNode at 0x2114c5641f0>,
'cake': <fptools.FPNode at 0x2114c564250>,
'chicken': <fptools.FPNode at 0x2114c564670>,
'chocolate': <fptools.FPNode at 0x2114c5671f0>,
'french fries': <fptools.FPNode at 0x2114c5675b0>,
'olive oil': <fptools.FPNode at 0x2114c56d130>,
'frozen vegetables': <fptools.FPNode at 0x2114c562130>,
'turkey': <fptools.FPNode at 0x2114c562a90>,
'shrimp': <fptools.FPNode at 0x2114c562e50>,
'fresh bread': <fptools.FPNode at 0x2114c57d970>,
'frozen smoothie': <fptools.FPNode at 0x2114c101790>,
'honey': <fptools.FPNode at 0x2114c106550>,
'cookies': <fptools.FPNode at 0x2114c106c10>,
'tomatoes': <fptools.FPNode at 0x2114c106e50>,
'soup': <fptools.FPNode at 0x2114c10a6d0>,
'grated cheese': <fptools.FPNode at 0x2114c10ab50>,
'milk': <fptools.FPNode at 0x2114c10da00>,
'cooking oil': <fptools.FPNode at 0x2114c118850>,
'low fat yogurt': <fptools.FPNode at 0x2114c120070>,
'escalope': <fptools.FPNode at 0x2114c125670>,
'pancakes': <fptools.FPNode at 0x2114c12edf0>,
'burgers': <fptools.FPNode at 0x2114c308550>,
'whole wheat rice': <fptools.FPNode at 0x2114c31a0d0>,
'herb & pepper': <fptools.FPNode at 0x2114c342970>,
'champagne': <fptools.FPNode at 0x2114c3fb130>})
mineral_water_node.count
1788
mineral_water_node.item
'mineral water'
green_tea_node = mineral_water_node.children['green tea']
green_tea_node
<fptools.FPNode at 0x2114bfdff40>
green_tea_node.count
78
green_tea_node.parent.item
'mineral water'
%%time
items = [i for i in fp.fpgrowth(tree[0], 3000)]
Wall time: 18 ms
print(len(items))
items
30
[['mineral water'],
['green tea'],
['low fat yogurt'],
['shrimp'],
['olive oil'],
['frozen smoothie'],
['honey'],
['salmon'],
['eggs'],
['burgers'],
['turkey'],
['milk'],
['whole wheat rice'],
['french fries'],
['soup'],
['spaghetti'],
['frozen vegetables'],
['cookies'],
['cooking oil'],
['champagne'],
['chocolate'],
['chicken'],
['tomatoes'],
['pancakes'],
['grated cheese'],
['fresh bread'],
['ground beef'],
['escalope'],
['herb & pepper'],
['cake']]
#最小支持频数统计频繁项集, 返回生成器
generate = fp.frequent_itemsets(transactions,minsup=300)
generate
<generator object frequent_itemsets at 0x000002114C1CD9E0>
itemsets = [s for s in generate]
print(len(itemsets))
itemsets
35
[['mineral water'],
['green tea'],
['low fat yogurt'],
['shrimp'],
['olive oil'],
['frozen smoothie'],
['honey'],
['salmon'],
['eggs'],
['eggs', 'mineral water'],
['burgers'],
['turkey'],
['milk'],
['milk', 'mineral water'],
['whole wheat rice'],
['french fries'],
['soup'],
['spaghetti'],
['spaghetti', 'mineral water'],
['frozen vegetables'],
['cookies'],
['cooking oil'],
['champagne'],
['chocolate'],
['chocolate', 'mineral water'],
['chicken'],
['tomatoes'],
['pancakes'],
['grated cheese'],
['fresh bread'],
['ground beef'],
['ground beef', 'mineral water'],
['escalope'],
['herb & pepper'],
['cake']]
type(itemsets[0][0])
str
one_hot_df.sum(axis=0).sort_values(ascending=False)[:30]
mineral water 1788
eggs 1348
spaghetti 1306
french fries 1282
chocolate 1229
green tea 991
milk 972
ground beef 737
frozen vegetables 715
pancakes 713
burgers 654
cake 608
cookies 603
escalope 595
low fat yogurt 574
shrimp 536
tomatoes 513
olive oil 494
frozen smoothie 475
turkey 469
chicken 450
whole wheat rice 439
grated cheese 393
cooking oil 383
soup 379
herb & pepper 371
honey 356
champagne 351
fresh bread 323
salmon 319
dtype: int64
小结:FPGrowth的算法内,只找到支持按照最小支持度来求频繁项集的方法。计算效率确实快。它的最小支持度是最小出现次数,和前面两种算法的最小出现概率不同。另外就是它得到的频繁项集中不会显示数据。
本节的几个问答:
Q1 关联规则中的支持度、置信度和提升度代表的什么,如何计算?
A1:
支持度Support:
- 是个百分比,指的是某个商品组合出现的次数与总次数之间的百分比。支持度越高,代表这个组合出现的频率越大。
- 支持度(A) = 商品A出现的订单数 / 总订单数
置信度Confidence:
- 是个条件概念,表示在先决条件X发生的情况下,由关联规则”X→Y“推出Y的概率。即在含有X的项集中,含有Y的可能性;如:在购买了商品A的订单中,也含有商品B的可能性:
- 商品(A->B)的置信度 = 在购买了A的订单中也存在商品B的订单数 / 购买了商品A的订单总数
提升度Lift:
- 商品A的出现,对商品B的出现概率提升的程度:
- 提升度(A->B) = 置信度(A->B) / 支持度(B)
- 提升度的三种可能:
- 提升度(A->B) > 1:B单独出现的支持度,不如AB在一起时的置信度大,所以说明AB在一起,A的出现对商品B的出现概率有提升,可以放一起;
- 提升度(A->B) = 1:B单独出现的支持度和AB在一起时候的置信度相等,说明A的出现对B的出现概率没有影响i,无所谓是否放一起;
- 提升度(A->B) < 1:B单独出现的支持度大于AB一起时候的置信度,说明A的出现会使B的出现概率降低,不能放一起。
Q2:关联规则与协同过滤的区别?
A2:
协同过滤 (Collaborative filtering)依赖用户偏好信息,偏好又称为用户评分(rating);关联规则分析 (Association Rules,又称 Basket Analysis) 用于从大量数据中挖掘出有价值的数据项之间的相关关系。
二者的区别如下:
- 关联规则是基于整体事务transaction,而协同过滤是关注用户个性化偏好(评分);
- 关联规则的商品组合使用的是购物篮分析,也就是Apriori算法,而协同过滤计算的是相似度;
- 关联规则没有利用“用户偏好”,而是基于购物订单进行的频繁项集挖掘;
- 关联规则与协同过滤的策略思路是完全不同的类型;
- 当需求是推荐的基础是且只是当前一次(最近一次)的购买或者点击时,关联规则更适用;
- 当需求是基于用户历史的行为进行分析,建立一定时间内的偏好排序,在这段时期内持续地按照这个排序来进行推荐,协同过滤更适用。
Q3:为什么我们需要多种推荐算法?(关联规则与协同过滤)
A3:
-
一般地,关联规则被划分为动态推荐,而协同过滤则更多地被视为静态推荐。
-
所谓动态推荐,就是推荐的基础是且只是当前一次(最近一次)的购买或者点击。譬如用户在网站上看了一个啤酒,系统就找到与这个啤酒相关的关联规则,然后根据这个规则向用户进行推荐。而静态推荐则是在对用户进行了一定分析的基础上,建立了这个用户在一定时期内的偏好排序,然后在这段时期内持续地按照这个排序来进行推荐。由此可见,关联规则与协同过滤的策略思路是完全不同的类型。
-
两种推荐算法的思考维度不同,适用的场景也不同,有了多种推荐算法,就可以满足更多场景下的需求。很多时候,我们需要把多种推荐方法的结果综合起来做一个混合的推荐。
Q4:关联规则中的最小支持度、最小置信度该如何确定?
A4:
-
最小支持度,最小置信度是实验出来的,是超参数。
-
不同的数据最小支持度、最小置信度不一样,最好的方式不断测试出来。
-
最小支持度的经验参考值是0.01~0.5;假如找Top20,可以从高到低输出前20个项集的支持度作为参考。一般数据量越小,最小支持度阈值设置应该越大,否则可能导致所有项集都是频繁项集;而数据量大时,最小支持度阈值设置应该越小,否则可能会无结果。最小支持度与项出现的次数有关,所以我尝试过 sr.value_counts()或one_hot_df.sum(axis=0) 的方法统计并排序,查看他们的频数以作为最小支持度的参考。
-
最小置信度:可能是0.5到1之间的数;先设较小的数,以便得到较多的关联规则;然后再增大到合适大小。
-
提升度:表示使用关联规则可以提升的倍数,是置信度与期望置信度的比值,提升度至少要大于1.
-
对于最小提升度,我没有从DataFrame或者Series的方法去快速拿到参考值,我的理解是,假如我们的目的是想找到是否有提升,那首次运行可以设稍微低于1的值,看看关联规则有多少,因为直接最小提升度设为1的时候,万一运行没结果,就不知道是程序的问题,还是所有频繁项集的提升度都小于1。
Q5:都有哪些常见的回归分析方法,评价指标是什么?
A5:
常见回归分析方法:
- 根据涉及的变量的多少,分为一元线性回归
y=kx+b
和多元线性回归y=β₁* x₁+ β₂* x₂+... + b
; - 根据自变量的多少,分为简单回归分析
y=kx+b
和多重回归分析y=β₁* x₁+ β₂* x₂+... + b
; - 按照自变量和因变量之间的关系类型,分为线性回归分析和非线性回归分析(如:多项式回归分析);
实现上述回归分析方法的常见回归模型有:
- 可以通过LinearRegression() 普通最小二乘法回归模型实现;
- 使用L2正则化的最小二乘回归模型:岭回归Ridge()
- 使用L1正则化的回归模型:Lasso回归Lasso()
- ElasticNet回归
- 逐步回归
- 另外,还有一种特殊的回归:逻辑回归,虽然也叫回归,但实际它更适合处理分类问题。
评价指标:
- 损失函数衡量模型好坏:
- 均方误差MSE(mean-squared_error)
- 平均绝对误差MAE(mean_absolute_error)
- 确定系数R方值( r2_score) 表示模型对显示数据拟合的程度作评价。R方值不可导。