GOSemSim

GOSemSimGO-terms Semantic Similarity Measures

 

Installation

Install GOSemSim is easy, follow the guide in the Bioconductor page:

## try http:// if https:// URLs are not supportedsource("https://bioconductor.org/biocLite.R")## biocLite("BiocUpgrade") ## you may need thisbiocLite("GOSemSim")
 

GO ID

找到 Gene Ontology (GO)勾选下面几个选项 

 

然后获得所有酵母蛋白的Gene ontology数据

 提取Gene Ontology ID

  1. # -*- coding: utf-8 -*-
  2. """
  3. Created on Fri Oct 28 19:04:38 2016
  4. @author: sun
  5. """
  6. import pandas as pd
  7. import re
  8. yeast=pd.read_csv('yeast.csv')
  9. #Gene ontology (biological process)
  10. #Gene ontology (molecular function)
  11. #Gene ontology (cellular component)
  12. bp=yeast['Gene ontology (biological process)']
  13. bp=bp.fillna(value='')
  14. for i in range(len(bp)):
  15. temp=re.findall(r"GO:\d{7}",bp[i])
  16. bp[i]=';'.join(temp)
  17. mf=yeast['Gene ontology (molecular function)']
  18. mf=mf.fillna(value='')
  19. for i in range(len(mf)):
  20. temp=re.findall(r"GO:\d{7}",mf[i])
  21. mf[i]=';'.join(temp)
  22. cc=yeast['Gene ontology (cellular component)']
  23. cc=cc.fillna(value='')
  24. for i in range(len(cc)):
  25. temp=re.findall(r"GO:\d{7}",cc[i])
  26. cc[i]=';'.join(temp)
  27. yeast['Gene ontology (biological process)']=bp
  28. yeast['Gene ontology (molecular function)']=mf
  29. yeast['Gene ontology (cellular component)']=cc
  30. yeast.to_csv('go.csv',index=False,columns =['Entry',
  31. 'Gene ontology (cellular component)',
  32. 'Gene ontology (molecular function)',
  33. 'Gene ontology (biological process)'])
 

获得Gene ontology

 

获取gold_yeast的Gene ontology

yeast_gold_protein_pair.csv
 
  1. yeast=pd.read_csv('yeast_gold_protein_pair.csv')
  2. go=pd.read_csv('go.csv',index_col=0)
  3. protein_a=go.loc[yeast.idA,:]
  4. protein_b=go.loc[yeast.idB,:]
  5. protein_a.to_csv('GOProteinA.csv')
  6. protein_b.to_csv('GOProteinB.csv')
GOProteinA.csv
GOProteinB.csv
 
 

R语言的安装

 
官方网址:https://www.r-project.org/
 科大镜像:https://mirrors.ustc.edu.cn/CRAN/
 
 没什么好说的,直接双击安装即可。注意:不能装到带有空格的目录中

注意:以下内容FQ或许会顺利点

 R的一个可视化界面(RStudio)的安装

 
下载地址:https://www.rstudio.com/products/rstudio/download/
 
 
直接选择对应的操作系统就行了
也没什么好说,双击直接安装就行了。

软件界面如下

 

然后开始安装GOSemSim,运行文章开头安装的代码

如果顺利的话,这样就算成功安装GOSemSim了。

如果没FQ的话,一般会有如下错误

 解决办法:
    进入到R的安装目录,编辑etc/Rprofile.site
 
 
添加  options(download.file.method="libcurl"),重新打开RStudio。
 
选择Tools->Global Options...->Package->Cran mirror选择科大的镜像。如图所示
最后如果还是不行,建议手动下载安装包,下面列出了所需安装包。
 

 

GOSemSim 说明文档

 
 
3.3 Supported organisms

For IC-based methods, information of GO term is species specific. We need to calculate IC for all GO terms of a species before we measure semantic similarity. GOSemSim support all organisms that have an OrgDb object available.

Bioconductor have already provided OrgDb for about 20 species, seehttp://bioconductor.org/packages/release/BiocViews.html#___OrgDb.

首先需要下载酵母的OrgDb数据库

 打开RStudio把刚刚下载好的org.Sc.sgd.db_3.4.0.tar.gz安装上去。
Tools->Install Packages
Install from:Package Archive File(.zip;.tar.gz)
点击Browse...选择刚刚下载好的org.Sc.sgd.db_3.4.0.tar.gz文件
 到这里我们需要的文件已经安装好了

Once we have OrgDb, we can build annotation data needed by GOSemSim via godata function.

library(GOSemSim)hsGO <- godata('org.Hs.eg.db', ont="MF")
## [1] "preparing gene to GO mapping data..."
## [1] "preparing IC data..."

User can set computeIC=FALSE if they only want to use Wang’s method.


 

 goSim 和mgoSim的介绍

In GOSemSim, we implemented all these IC-based and graph-based methods. goSim function calculates semantic similarity between two GO terms, while mgoSim function calculates semantic similarity between two sets of GO terms.

goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Jiang")
## [1] 0.155
goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Wang")
## [1] 0.158
go1 = c("GO:0004022","GO:0004024","GO:0004174")go2 = c("GO:0009055","GO:0005515")mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)
##            GO:0009055 GO:0005515
## GO:0004022      0.205      0.158
## GO:0004024      0.185      0.141
## GO:0004174      0.205      0.158
mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA")
## [1] 0.192
执行结果
  1. > library(GOSemSim)
  2. > hsGO <- godata('org.Hs.eg.db', ont="MF")
  3. [1]"preparing gene to GO mapping data..."
  4. [1]"preparing IC data..."
  5. > goSim("GO:0004022","GO:0005515", semData=hsGO, measure="Jiang")
  6. [1]0.155
  7. > goSim("GO:0004022","GO:0005515", semData=hsGO, measure="Wang")
  8. [1]0.158
  9. > go1 = c("GO:0004022","GO:0004024","GO:0004174")
  10. > go2 = c("GO:0009055","GO:0005515")
  11. > mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)
  12. GO:0009055 GO:0005515
  13. GO:00040220.2050.158
  14. GO:00040240.1850.141
  15. GO:00041740.2050.158
  16. > mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA")
  17. [1]0.192
 
至此测试完成!

 

使用gold_yeast数据获得GO特征

代码如下
  1. library(GOSemSim)
  2. library(org.Sc.sgd.db)
  3. GOProteinA<- read.csv("GOProteinA.csv", stringsAsFactors = F)
  4. GOProteinB<- read.csv("GOProteinB.csv", stringsAsFactors = F)
  5. cc <- c()
  6. mf <- c()
  7. bp <- c()
  8. scGo <- godata('org.Sc.sgd.db', ont ="cc", computeIC = F)
  9. for(i in1:length(GOProteinA$Entry)){
  10. go1 <- c(strsplit(GOProteinA[i,2], split =";")[[1]])
  11. go2 <- c(strsplit(GOProteinB[i,2], split =";")[[1]])
  12. cc[[i]]<- mgoSim(go1, go2, semData = scGo, measure ="Wang")
  13. }
  14. scGo <- godata('org.Sc.sgd.db', ont ="mf", computeIC = F)
  15. for(i in1:length(GOProteinA$Entry)){
  16. go1 <- c(strsplit(GOProteinA[i,3], split =";")[[1]])
  17. go2 <- c(strsplit(GOProteinB[i,3], split =";")[[1]])
  18. mf[[i]]<- mgoSim(go1, go2, semData = scGo, measure ="Wang")
  19. }
  20. scGo <- godata('org.Sc.sgd.db', ont ="bp", computeIC = F)
  21. for(i in1:length(GOProteinA$Entry)){
  22. go1 <- c(strsplit(GOProteinA[i,4], split =";")[[1]])
  23. go2 <- c(strsplit(GOProteinB[i,4], split =";")[[1]])
  24. bp[[i]]<- mgoSim(go1, go2, semData = scGo, measure ="Wang")
  25. }
  26. GOFeature<-
  27. data.frame(GOProteinA$Entry,GOProteinB$Entry, cc, mf, bp)
  28. write.csv(GOFeature,
  29. 'GOFeature.csv',
  30. na ='0',#将nan值填充为0
  31. row.names = FALSE)
GOFeature.csv
 
与原论文(Ensemble learning prediction of protein–protein interactions using proteins functional annotations)提供的数据相比,
结果不一样?下图是原论文数据
 
 

检查错误

  • 以第一对蛋白质为例,首先单独获得P00546和P25302的GO ID

P00546 

Gene ontology (cellular component) cc总共5个
GO:0005935;GO:0000307;GO:0005737;GO:0005783;GO:0005634
  1. cellular bud neck [GO:0005935];
  2. cyclin-dependent protein kinase holoenzyme complex [GO:0000307];
  3. cytoplasm [GO:0005737];
  4. endoplasmic reticulum [GO:0005783];
  5. nucleus [GO:0005634]
Gene ontology (molecular function) mf总共5个
GO:0005524;GO:0004693;GO:0042393;GO:0004674;GO:0000993
  1. ATP binding [GO:0005524];
  2. cyclin-dependent protein serine/threonine kinase activity [GO:0004693];
  3. histone binding [GO:0042393];
  4. protein serine/threonine kinase activity [GO:0004674];
  5. RNA polymerase II core binding [GO:0000993]
Gene ontology (biological process) bp总共35个
GO:0006370;GO:0051301;GO:0000706;GO:1990758;GO:2001033;GO:0051447;GO:0045930;GO:0045875;GO:0045892;GO:0007070;
GO:0018105;GO:0018107;GO:0070816;GO:0051446;GO:0045931;GO:0010571;GO:0010696;GO:0045893;GO:0045944;GO:0010898;
GO:1990139;GO:0034504;GO:1902002;GO:1990802;GO:1990804;GO:1990801;GO:1990803;GO:0010568;GO:0010569;GO:0010570;
GO:0060303;GO:0090169;GO:0032210;GO:0007130;GO:0016192
  1. 7-methylguanosine mRNA capping [GO:0006370];
  2. cell division [GO:0051301];
  3. meiotic DNA double-strand break processing [GO:0000706];
  4. mitotic sister chromatid biorientation [GO:1990758];
  5. negative regulation of double-strand break repair via nonhomologous end joining [GO:2001033];
  6. negative regulation of meiotic cell cycle [GO:0051447];
  7. negative regulation of mitotic cell cycle [GO:0045930];
  8. negative regulation of sister chromatid cohesion [GO:0045875];
  9. negative regulation of transcription, DNA-templated [GO:0045892];
  10. negative regulation of transcription from RNA polymerase II promoter during mitosis [GO:0007070];
  11. peptidyl-serine phosphorylation [GO:0018105];
  12. peptidyl-threonine phosphorylation [GO:0018107];
  13. phosphorylation of RNA polymerase II C-terminal domain [GO:0070816];
  14. positive regulation of meiotic cell cycle [GO:0051446];
  15. positive regulation of mitotic cell cycle [GO:0045931];
  16. positive regulation of nuclear cell cycle DNA replication [GO:0010571];
  17. positive regulation of spindle pole body separation [GO:0010696];
  18. positive regulation of transcription, DNA-templated [GO:0045893];
  19. positive regulation of transcription from RNA polymerase II promoter [GO:0045944];
  20. positive regulation of triglyceride catabolic process [GO:0010898];
  21. protein localization to nuclear periphery [GO:1990139];
  22. protein localization to nucleus [GO:0034504];
  23. protein phosphorylation involved in cellular protein catabolic process [GO:1902002];
  24. protein phosphorylation involved in DNA double-strand break processing [GO:1990802];
  25. protein phosphorylation involved in double-strand break repair via nonhomologous end joining [GO:1990804];
  26. protein phosphorylation involved in mitotic spindle assembly [GO:1990801];
  27. protein phosphorylation involved in protein localization to spindle microtubule [GO:1990803];
  28. regulation of budding cell apical bud growth [GO:0010568];
  29. regulation of double-strand break repair via homologous recombination [GO:0010569];
  30. regulation of filamentous growth [GO:0010570]; 
  31. regulation of nucleosome density [GO:0060303];
  32. regulation of spindle assembly [GO:0090169]; 
  33. regulation of telomere maintenance via telomerase [GO:0032210];
  34. synaptonemal complex assembly [GO:0007130]; 
  35. vesicle-mediated transport [GO:0016192]
 

P25302

Gene ontology (cellular component) cc总共2个
GO:0000790;GO:0033309
  1. nuclear chromatin [GO:0000790];
  2. SBF transcription complex [GO:0033309]
Gene ontology (molecular function) mf总共3个
GO:0003677;GO:0042802;GO:0001077
  1. DNA binding [GO:0003677];
  2. identical protein binding [GO:0042802];
  3. transcriptional activator activity, RNA polymerase II core promoter proximal region sequence-specific binding [GO:0001077]
Gene ontology (biological process) bp总共2个
GO:0061408;GO:0071931
  1. positive regulation of transcription from RNA polymerase II promoter in response to heat stress [GO:0061408];
  2. positive regulation of transcription involved in G1/S transition of mitotic cell cycle [GO:0071931]
程序计算个数没错

 

  • 分别计算CC,MF,BP
  1. ccA <- c(strsplit(GOProteinA[1,2], split =";")[[1]])
  2. ccB <- c(strsplit(GOProteinB[1,2], split =";")[[1]])
  3. mfA <- c(strsplit(GOProteinA[1,3], split =";")[[1]])
  4. mfB <- c(strsplit(GOProteinB[1,3], split =";")[[1]])
  5. bpA <- c(strsplit(GOProteinA[1,4], split =";")[[1]])
  6. bpB <- c(strsplit(GOProteinB[1,4], split =";")[[1]])
  7. scGo <- godata('org.Sc.sgd.db', ont ="cc")
  8. cc <- mgoSim(ccA, ccB, semData = scGo, measure ="Wang")
  9. scGo <- godata('org.Sc.sgd.db', ont ="mf")
  10. mf <- mgoSim(mfA, mfB, semData = scGo, measure ="Wang")
  11. scGo <- godata('org.Sc.sgd.db', ont ="bp")
  12. bp <- mgoSim(bpA, bpB, semData = scGo, measure ="Wang")
单个执行结果与原始程序运行结果一样。
 

 未找到错误继续

  • 检查原论文说明。
似乎也没有其他说明。

结论。

  • 可以肯定一点。程序没错。
  • 可能是数据更新了。原始论文是2014的论文。2年可是GO数据的更新导致数据不一致。
 
 
 
 
 
 
 
 
 
 





附件列表

 

posted @ 2016-11-16 11:50  春暖夏微凉。  阅读(2554)  评论(0编辑  收藏  举报