7月23日 R进行层次聚类算法的继续完善

1、前面读取数据库不变

##连接数据库，将数据库中的文件读取出来
#加载包
library(RMySQL) 
#建立连接
conn <- dbConnect(dbDriver("MySQL"), dbname = "eswp", user="root", password="root")
#读取 表2008yearnew
text = dbReadTable(conn, "sixclasscleaned")[1:594,2:2]#只读取mesh词的那一列，通过前面的第一个下标修改读取的行数，读取20行

2、在此-文档矩阵的生成过程中加入了不进行小写转换的参数（tolower=FALSE），

#加载tm包
library(tm)
#建立语料库
corpus=Corpus(VectorSource(text))
#从语料库建立词-文档矩阵，用tf-idf来表示，stopwords = stopwords("mesh")#表示使用mesh停用词表；
#tolower = FALSE标识不将大写转换成小写，而默认是转换的；
#停用词表放在tm包中的stopwords文件夹中，目前停用词表中只有aged一词
tdm = TermDocumentMatrix(corpus,control = list(stopwords=stopwords("mesh"), weighting = weightTfIdf,tolower = FALSE))

3、降维的方法，使用removeSparseTerms函数。

##词太多，需要在这里进行筛选，使用removeSparseTerms()进行词的筛选
tdm_removed = removeSparseTerms(tdm, 0.99)#这个值需要不断测试来设置，和矩阵原来的稀疏程度有关

4、聚类中距离的计算公式不便，但是在聚类过程中尝试了所有method的组合，即词间距离和类间距离的计算方法，最终使用下面的方法，但是不一定适合你。

#R中计算距离的方法有euclidean，maximum，manhattan，canberra，minkowski，binary等，在这里使用euclidean距离，即欧氏距离
dist_tdm_removed <- dist(tdm_removed, method = 'canberra')

#根据距离行层次聚类，距离使用average即类平均法距离，可以使用的其他距离有，single,complete,median,mcquitty,average,centroid,ward等
hc <- hclust(dist_tdm_removed, method = 'mcquitty')

5、用cutree函数对层次聚类结果进行分割，并进行了格式化输出。

cutNum = 25 #设置分割的类的数目
#对树进行分割
ct = cutree(hc,k=cutNum)
write(paste("共分为",cutNum,"类"),"data.txt",append=FALSE) #统计各个类的数目
write("----------------","data.txt",append=TRUE) 
write("\n","data.txt",append=TRUE)

#输出各个类
#输出到屏幕
#for(i in 1:cutNum){print(paste("第",i,"类：",sum(ct==i),"个"));print(attr(ct[ct==i],"names"));print("----------------")} 
#输出到外部文件
for(i in 1:cutNum){
  write(paste("第",i,"类： ",sum(ct==i),"个"),"data.txt",append=TRUE);
  write("----------------","data.txt",append=TRUE);
  write(attr(ct[ct==i],"names"),"data.txt",append=TRUE);
  write("----------------","data.txt",append=TRUE)
  write("\n","data.txt",append=TRUE)
  }

输出结果如下

共分为 25 类
----------------

第 1 类： 16 个
----------------
adenocarcinoma,follicular
biopsy,fine-needle
carcinoma,papillary
carcinoma,squamous_cell
fluorodeoxyglucose_f18
iodine_radioisotopes
lymph_node_excision
lymph_nodes
lymphatic_metastasis
positron-emission_tomography
prospective_studies
radiopharmaceuticals
thyroid_gland
thyroid_neoplasms
thyroid_nodule
thyroidectomy
----------------

第 2 类： 14 个
----------------
adolescent
asian_continental_ancestry_group
case-control_studies
child
child,preschool
china
cohort_studies
genetic_predisposition_to_disease
incidence
polymorphism,single_nucleotide
precursor_cell_lymphoblastic_leukemia-lymphoma
risk_factors
smoking
survival_analysis
----------------

6、输出图像方法不变

#如果类数目较多，则会重合看不清楚，使用下列方法画出大像素图形
png("test.png",width=3500,height=3000) #将输出设备改为png，像素尽可能的大，但是如果改的过大容易出现问题。

#cex为标签的大小,同时，可以使用cex.axis属性来改变坐标系上数字的大小，使用cex.lab改变下面矩阵名字的大小

#使用cex.main改变上方标题的大小，使用cex.sub改变下方聚类方法名称的大小，lwd是图形中线的宽度，此时图形将会在工作目录中看到
plot(hc,cex=2,cex.axis=3,cex.lab=3,cex.main=3,cex.sub=3,lwd=1.5)
rect.hclust(hc,k=30, border="red")#对聚类结果的标识
dev.off()

posted @ 2012-07-23 18:58 todoit 阅读(982) 评论(2) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

点滴

记录我的成长之路

7月23日 R进行层次聚类算法的继续完善

公告