R学习-小白笔记06 - EleanorInHarbin

----传统回归模型的困难----
#为什么一定是线性？或某种非线性模型？
#过分依赖与分析者的经验
#对于非连续的离散数据难以处理

----网格方法----

#《Science》上的文章《Detecting Novel Associations in Large Data Sets》
#方法概要：用网格判断数据的集中程度，集中程度意味着是否有关联关系
#方法具有一般性，即无论数据是怎样分布的，不限于特定的关联函数类型，此判断方法都是有效
#方法具有等效性，计算的熵值和噪音的程度有关，跟关联的类型无关
#MIC：the Maximal Information Coefficient
#MINE：Maximal Information-based Nonparametric Exploration

----MIC值计算----
#坐标平面被划分为(x,y)网格G（未必等宽），其中xy<n^0.6
#在G上可以诱导出“自然概率密度函数”p(x,y)，任何一个方格（box）内的概率密度函数值为这个方格所包含的样本点数量占全体样本点的比例
#计算网格划分G下的 mutual information值I_G

#构造特征矩阵{mxy}，矩阵的元素mxy=max{I_G}/log min{x,y}。 max取遍所有可能的(x,y)网格G
#MIC=max {mxy}。Max取遍所有可能的(x,y)对

#Mxy的计算是个难点，数据科学家构造了一个近似的逼近算法以提高效率
http://www.sciencemag.org/content/suppl/2011/12/14/334.6062.1518.DC1
#在作者的网站上，可以下载MINE计算MIC的程序（Java和R）以及测试用数据集
http://www.exploredata.net/Downloads
#实验：WHO数据集，垒球数据集…

----MIC的性质----
#如果变量对x,y存在函数关系，则当样本数增加时，MIC必然趋向于1
#如果变量对x,y可以由参数方程c(t)=[x(t),y(t)]所表达的曲线描画，
则当样本数增加时，MIC必然趋于1
#如果变量对x,y在统计意义下互相独立，则当样本数增加时，MIC趋于0

#MIC观察
#MIC与线性回归模型对比

----对基因数据集spellman的探索----
#数据集包含6223组基因数据
#MINE对关联关系的辨认力明显强于以往的方法，例如双方都发现
了HTB1，但MINE方法挖出了过去未被发现的HSP12

----数据挖掘：关联规则挖掘----
#例子：购物篮分析

----名词----
#挖掘数据集：购物篮数据
#挖掘目标：关联规则
#关联规则：牛奶=>鸡蛋【支持度=2%，置信度=60%】
#支持度：分析中的全部事务的2%同时购买了牛奶和鸡蛋
#置信度：购买了牛奶的同时有60%也购买了鸡蛋
#最小支持度阈值和最小置信度阈值：由挖掘者或领域专家设定
#项集：项（商品）的集合
#k-项集：k个项组成的项集
#频繁项集：满足最小支持度的项集，频繁k-项集一般记为Lk
#强关联规则：满足最小支持度阈值和最小置信度阈值的规则

----关联规则挖掘路线图----
#两步过程：找出所有频繁项集；由频繁项集产生强关联规则
#算法：Apriori
#例子

----Apriori算法的工作过程----

----步骤说明----
#扫描D，对每个候选项计数，生成候选1-项集C1
#定义最小支持度阈值为2，从C1生成频繁1-项集L1
#通过L1xL1生成候选2-项集C2
#扫描D，对C2里每个项计数，生成频繁2-项集L2
#计算L3xL3，利用apriori性质：频繁项集的子集必然是频繁的，我们可以删去一部分项，从而得到C3，由C3再经过支持度计数生成L3
#可见Apriori算法可以分成连接，剪枝两个步骤不断循环重复

----由频繁项集提取关联规则----
#例子：我们计算出频繁项集{I1,I2,I5}，能提取哪些规则？
#I1^I2=>I5，由于{I1,I2,I5}出现了2次，{I1,I2}出现了4次，故置信度为2/4=50%类似可以算出

----用 R 进行购物篮分析----
#安装arules包并加载
#内置Groceries数据集
#library(arules) #加载arules程序包
#data(Groceries) #调用数据文件
#inspect(Groceries) #观看数据集里的数据

#求频繁项集
>frequentsets=eclat(Groceries,parameter=list(support=0.05,maxlen=10))
Eclat

parameter specification:
tidLists support minlen maxlen target ext
FALSE 0.05 1 10 frequent itemsets FALSE

algorithmic control:
sparse sort verbose
7 -2 TRUE

Absolute minimum support count: 491

create itemset ...
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [28 item(s)] done [0.00s].
creating sparse bit matrix ... [28 row(s), 9835 column(s)] done [0.00s].
writing ... [31 set(s)] done [0.00s].
Creating S4 object ... done [0.00s].

#观看频繁项集
>inspect(frequentsets[1:10])
items support
[1] {whole milk,yogurt} 0.05602440
[2] {whole milk,rolls/buns} 0.05663447
[3] {other vegetables,whole milk} 0.07483477
[4] {whole milk} 0.25551601
[5] {other vegetables} 0.19349263
[6] {rolls/buns} 0.18393493
[7] {yogurt} 0.13950178
[8] {soda} 0.17437722
[9] {root vegetables} 0.10899847
[10] {tropical fruit} 0.10493137

>inspect(sort(frequentsets,by="support")[1:10]) #根据支持度对求得的频繁项集排序并察看
items support
[1] {whole milk} 0.25551601
[2] {other vegetables} 0.19349263
[3] {rolls/buns} 0.18393493
[4] {soda} 0.17437722
[5] {yogurt} 0.13950178
[6] {bottled water} 0.11052364
[7] {root vegetables} 0.10899847
[8] {tropical fruit} 0.10493137
[9] {shopping bags} 0.09852567
[10] {sausage} 0.09395018

#利用apriori函数提取关联规则
#使用aprioi建模
>rules=apriori(Groceries,parameter=list(support=0.01,confidence=0.5))
Apriori

Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.5 0.1 1 none FALSE TRUE 5 0.01 1 10 rules FALSE

Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE

Absolute minimum support count: 98

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [88 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [15 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].

#列出关联规则
>summary(rules) #观察求得的关联规则之摘要
set of 15 rules

rule length distribution (lhs + rhs):sizes
3
15

Min. 1st Qu. Median Mean 3rd Qu. Max.
3 3 3 3 3 3

summary of quality measures:
support confidence lift
Min. :0.01007 Min. :0.5000 Min. :1.984
1st Qu.:0.01174 1st Qu.:0.5151 1st Qu.:2.036
Median :0.01230 Median :0.5245 Median :2.203
Mean :0.01316 Mean :0.5411 Mean :2.299
3rd Qu.:0.01403 3rd Qu.:0.5718 3rd Qu.:2.432
Max. :0.02227 Max. :0.5862 Max. :3.030

mining info:
data ntransactions support confidence
Groceries 9835 0.01 0.5

>inspect(rules)
lhs rhs support confidence lift
[1] {curd,yogurt} => {whole milk} 0.01006609 0.5823529 2.279125
[2] {other vegetables,butter} => {whole milk} 0.01148958 0.5736041 2.244885
[3] {other vegetables,domestic eggs} => {whole milk} 0.01230300 0.5525114 2.162336
[4] {yogurt,whipped/sour cream} => {whole milk} 0.01087951 0.5245098 2.052747
[5] {other vegetables,whipped/sour cream} => {whole milk} 0.01464159 0.5070423 1.984385
[6] {pip fruit,other vegetables} => {whole milk} 0.01352313 0.5175097 2.025351
[7] {citrus fruit,root vegetables} => {other vegetables} 0.01037112 0.5862069 3.029608
[8] {tropical fruit,root vegetables} => {other vegetables} 0.01230300 0.5845411 3.020999
[9] {tropical fruit,root vegetables} => {whole milk} 0.01199797 0.5700483 2.230969
[10] {tropical fruit,yogurt} => {whole milk} 0.01514997 0.5173611 2.024770
[11] {root vegetables,yogurt} => {other vegetables} 0.01291307 0.5000000 2.584078
[12] {root vegetables,yogurt} => {whole milk} 0.01453991 0.5629921 2.203354
[13] {root vegetables,rolls/buns} => {other vegetables} 0.01220132 0.5020921 2.594890
[14] {root vegetables,rolls/buns} => {whole milk} 0.01270971 0.5230126 2.046888
[15] {other vegetables,yogurt} => {whole milk} 0.02226741 0.5128806 2.007235

#按需要筛选关联规则
>x=subset(rules,subset=rhs%in%"whole milk"&lift>=1.2) #求所需要的关联规则子集
>inspect(sort(x,by="support")[1:5]) #根据支持度对求得的关联规则子集排序并察看
lhs rhs support confidence lift
[1] {other vegetables,yogurt} => {whole milk} 0.02226741 0.5128806 2.007235
[2] {tropical fruit,yogurt} => {whole milk} 0.01514997 0.5173611 2.024770
[3] {other vegetables,whipped/sour cream} => {whole milk} 0.01464159 0.5070423 1.984385
[4] {root vegetables,yogurt} => {whole milk} 0.01453991 0.5629921 2.203354
[5] {pip fruit,other vegetables} => {whole milk} 0.01352313 0.5175097 2.025351

#其中 lift=P(L,R)/(P(L)P(R))是一个类似相关系数的指标。lift=1时表示L和R独立。
这个数越大，越表明L和R存在在一个购物篮中不是偶然现象。

----购物篮分析的应用----
#超市里的货架摆设设计
#电子商务网站的交叉推荐销售
#网站或节目的阅读/收听推荐

posted on 2018-01-12 10:42 EleanorInHarbin 阅读(215) 评论(0) 编辑收藏举报