1. 整理表达矩阵
下载的文件是按样本存放的,每个tsv文件中都记录着一个样本的基因表达量,需要将所有tsv文件合并,得到所有样本的基因表达量的表格。
转录组数据合并
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | rm (list = ls ()) library (stringr) library (jsonlite) library (progress) options (stringsAsFactors = F) cancer_type <- "TCGA-COAD" # 设置肿瘤类型(需修改) setwd ( "D:/BioInformationAnalyze/TCGAdata/CRC" ) # 设置工作目录(需修改) ## 获取每个样品id和相应tsv文件名的对应关系: json <- jsonlite:: fromJSON ( "metadata.cart.2022-12-24.json" ) # 读取Metadata文件中每个样品的信息(需修改matadata文件的名称) View (json) sample_id <- sapply (json$associated_entities, function (x){x[,1]}) # 提取所有样品id,如:TCGA-W5-AA2X-01A-11R-A41I-07 file_sample <- data.frame (sample_id, file_name = json$file_name) # 获取样品id和相应tsv文件名的对应关系 ## 合并所有样品id的基因表达量: count_file <- list.files ( "expdata" , pattern = "*.tsv$" , recursive = TRUE ) # 获取expdata文件夹下的所有tsv表达文件的“路径+文件名” count_file_name <- strsplit (count_file, split= "/" ) %>% sapply ( function (x){x[2]}) # 在count_file中分割出文件名 ## 新建空白矩阵,行数与tsv文件中基因名行数形同(需查看并修改) matrix = data.frame ( matrix (nrow = 60660, ncol = 0)) ## 取每个文件中的一列数据(count或tpm),然后合并 pb <- progress_bar$ new (format = "(:spin) [:bar] :percent 已用时::elapsedfull 预计剩余时间::eta" , total = length (count_file), clear = FALSE , width = 80) # 设置进度条 for (i in 1: length (count_file)){ data<- paste0 ( "expdata/" , count_file[i]) %>% read.delim (fill = TRUE , header = FALSE , row.names = 1) data <- data[- c (1:6),] data <- data[3] # [3]是count,[6]是tpm,[7]是fpkm。一般用tpm,但差异分析R包需要用count。生存分析用tpm。保存3和6两套数据。 colnames (data) <- file_sample$sample_id[ which (file_sample$file_name == count_file_name[i])] matrix <- cbind (matrix, data) pb$ tick () # 显示进度条 Sys.sleep (1 / 100) # 用于计算时间 } # 得到了表达矩阵 matrix :行名为样品id,列名为Ensembl ID,值为基因的表达量 |
表达矩阵 matrix 长这样:
转录组数据拆分,mRNA,lncRNA
表达矩阵是很多种基因类型的混合,我们需要将分析常用的mRNA(protein_coding)、lncRNA分别挑出来。
虽然表达矩阵中也有miRNA,但其gene_name不是常规命名。若要分析mirna,需在TCGA-GDC网站另外下载,其数量与表达矩阵中是一致的。
1 2 3 4 5 6 7 8 9 | data <- paste0 ( "expdata/" , count_file[1]) %>% read.delim (fill = TRUE , header = FALSE , row.names = 1) %>% as.matrix () # 随便读取一个tsv文件 gene_name <- data[- c (1:6),1] # 挑出gene_name gene_type <- data[- c (1:6),2] # 挑出gene_type table (gene_type) matrix <- cbind (gene_name, gene_type, matrix) # 将 gene_name 和 gene_type 添加到表达矩阵 mateix 的最前面 mRNA_matrix <- matrix[matrix$gene_type == "protein_coding" ,] ; dim (mRNA_matrix) # 拆分得到mRNA的表达矩阵 mRNA_matrix lnc_matrix <- matrix[matrix$gene_type == "lncRNA" ,] ; dim (lnc_matrix) # 拆分得到lncRNA的表达矩阵 lncRNA_matrixm |
mRNA_matrix 和 lncRNA_matrix 长这样:
设置Gene Symbol为表达矩阵的列名(上一步得到的列名是Ensembl ID)
1 2 3 4 5 6 | ## 按gene_name列去除重复的基因,保留每个基因最大表达量结果 matrix0 <- aggregate ( . ~ gene_name, data = lnc_matrix, max) # 需要lnc就写lnc,需要mRNA就写mRNA。这里以lnc为例。这一步运行时间较长。 # 函数解析:aggregate函数中,“. ~ gene_name”表示以gene_name为分类,分别计算其他列的最大值。 rownames (matrix0) <- matrix0[,1] # 将“gene_name”列设为列名 matrix0 <- matrix0[, - c (1,2)] matrix0 <- as.data.frame ( lapply (matrix0, as.numeric), row.names = rownames (matrix0), check.names = F) |
matrix0 长这样:
过滤掉在很多样本种表达量都为0基因
表达矩阵整理完成,需要过滤掉在很多样本中表达量都为0的基因。过滤标准不唯一。
判断“matrix0”中的每个基因,是否在至少44个样本中>1。保留符合条件的基因,剔除不符合条件的基因。样本数“>44”根据总样本量修改,可以是数值,比如大于正常样本数“>=40”,也可以是百分比,如大于总样本数的75%“>=0.75*(ncol(matrix0))”
1 2 | exp <- matrix0[ apply (matrix0, 1, function (x) sum (x >= 1) >= 44), ] # counts用这个。x = 1表示对 matrix0 的每一行都执行 function 函数 # exp = matrix0[apply(matrix0, 1, function(x)sum(x > 0.1) >= 44), ] # tpm用这个。 |
过滤后的表达矩阵 exp 长这样:
定义表达矩阵分组信息
1 2 3 | group_list <- ifelse ( as.numeric ( str_sub ( colnames (matrix0), 14, 15)) < 10, "tumor" , "normal" ) %>% factor (levels = c ( "normal" , "tumor" )) # 将正常样品和肿瘤样品分开,并设置因子,便于后续分析 table (group_list) |
保存转录组数据的整理结果
1 2 | save (exp, group_list, cancer_type, file = paste0 (cancer_type, "_01.1.lncRNA_Expdata_count.Rdata" )) # save(exp, group_list, cancer_type, file = paste0(cancer_type, "_01.2.lncRNA_Expdata_tpm.Rdata")) # 换成 tpm 再运行、保存一次 |
2. 整理临床信息
1 2 3 4 5 6 7 | rm (list = ls ()) library (stringr) library (jsonlite) library (XML) options (stringsAsFactors = F) cancer_type <- "TCGA-COAD" # 设置肿瘤类型(需修改) setwd ( "D:/BioInformationAnalyze/TCGAdata/CRC" ) # 设置工作目录(需修改) |
先对单个文件进行检查(备用)
确认文件无误时可不检查
1 2 3 4 5 6 7 8 | result <- xmlParse ( "./clinical/00b9eb65-04c6-4d62-9cf6-6b77d75ab79b/nationwidechildrens.org_clinical.TCGA-DM-A285.xml" ) # 读取单个xml文件,需修改文件目录和文件名 rootnode <- xmlRoot (result) # 将根节点从xml文件中移除 rootsize <- xmlSize (rootnode) # 查找根中的节点数 print (rootnode[1]) # 查看第一个节点 print (rootnode[2]) # 查看第二个节点,可以确认第二个节点为临床信息的保存位置 xmldataframe <- xmlToDataFrame (rootnode[2]) # 将xml中第二个节点转换为数据框 head ( t ( xmlToDataFrame (rootnode[2]))) # 转置并查看该数据框 |
读取所有临床数据文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | ## 读取所有临床数据文件路径 xmls <- dir ( "clinical/" , pattern = "*.xml$" , recursive = T) # 读取目录为“clinical/”,读取模式为以“*.xml”结尾的文件,recursive表示列表是否应重新进入目录 ## 构建读取函数 td = function (x){ result <- xmlParse ( file.path ( "clinical/" , x)) rootnode <- xmlRoot (result) xmldataframe <- xmlToDataFrame (rootnode[2]) return ( t (xmldataframe)) } ## 读取并整合所有临床信息 cl = lapply (xmls, td) # 对于xmls中的每一个文件,执行td函数,输出每一个病人的临床信息,并整合到列表中 cl_df <- t ( do.call (cbind,cl)) # 合并列表中的每一个元素为数据框,并将数据框转置(转置的输出结果为矩阵) cl_df[1:3,1:3] clinical = data.frame (cl_df) # 将转置的结果重新转变为数据框 clinical[1:4,1:4] |
保存临床数据的整理结果
1 | save (clinical, cancer_type, file = paste0 (cancer_type, "_02.lncRNA_Clinical.Rdata" )) |
临床数据整理补充方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | 用perl整理数据(补充方法二,推荐) ---- ## 将”getClinical.pl“文件复制到clinical文件夹下 ## 在clinical文件夹下运行terminal,运行命令“perl getClinical.pl”。 ## 即可在clinical文件夹下得到“clinical.xls”文件 ## 其中包含:Id futime fustat Age Gender Grade Stage T M N ## 以下为getClinical.pl的perl脚本,复制到txt文件中重命名为.pl即可 use strict; #use warnings; use XML::Simple; opendir (RD, "." ) or die $!; my @dirs = readdir (RD); closedir (RD); open (WF, ">clinical.xls" ) or die $!; print WF "Id\tfutime\tfustat\tAge\tGender\tGrade\tStage\tT\tM\tN\n" ; foreach my $dir ( @dirs ){ #print $dir . "\n"; next if ( $dir eq '.' ); next if ( $dir eq '..' ); #print $dir . "\n"; if (-d $dir ){ opendir (RD, "$dir" ) or die $!; while ( my $xmlfile = readdir (RD)){ if ( $xmlfile =~/\.xml$/){ #print "$dir\\$xmlfile\n"; my $userxs = XML::Simple->new( KeyAttr => "name" ); my $userxml = "" ; if (-f "$dir/$xmlfile" ){ $userxml = $userxs ->XMLin( "$dir/$xmlfile" ); } else { $userxml = $userxs ->XMLin( "$dir\$xmlfile" ); } # print output #open(WF,">dumper.txt") or die $!; #print WF Dumper($userxml); #close(WF); my $disease_code = $userxml ->{ 'admin:admin' }{ 'admin:disease_code' }{ 'content' }; #get disease code my $disease_code_lc = lc ( $disease_code ); my $patient_key = $disease_code_lc . ':patient' ; #ucec:patient my $follow_key = $disease_code_lc . ':follow_ups' ; my $patient_barcode = $userxml ->{ $patient_key }{ 'shared:bcr_patient_barcode' }{ 'content' }; #TCGA-AX-A1CJ my $gender = $userxml ->{ $patient_key }{ 'shared:gender' }{ 'content' }; #male/female my $age = $userxml ->{ $patient_key }{ 'clin_shared:age_at_initial_pathologic_diagnosis' }{ 'content' }; my $race = $userxml ->{ $patient_key }{ 'clin_shared:race_list' }{ 'clin_shared:race' }{ 'content' }; #white/black my $grade = $userxml ->{ $patient_key }{ 'shared:neoplasm_histologic_grade' }{ 'content' }; #G1/G2/G3 my $clinical_stage = $userxml ->{ $patient_key }{ 'shared_stage:stage_event' }{ 'shared_stage:clinical_stage' }{ 'content' }; #stage I my $clinical_T = $userxml ->{ $patient_key }{ 'shared_stage:stage_event' }{ 'shared_stage:tnm_categories' }{ 'shared_stage:clinical_categories' }{ 'shared_stage:clinical_T' }{ 'content' }; my $clinical_M = $userxml ->{ $patient_key }{ 'shared_stage:stage_event' }{ 'shared_stage:tnm_categories' }{ 'shared_stage:clinical_categories' }{ 'shared_stage:clinical_M' }{ 'content' }; my $clinical_N = $userxml ->{ $patient_key }{ 'shared_stage:stage_event' }{ 'shared_stage:tnm_categories' }{ 'shared_stage:clinical_categories' }{ 'shared_stage:clinical_N' }{ 'content' }; my $pathologic_stage = $userxml ->{ $patient_key }{ 'shared_stage:stage_event' }{ 'shared_stage:pathologic_stage' }{ 'content' }; #stage I my $pathologic_T = $userxml ->{ $patient_key }{ 'shared_stage:stage_event' }{ 'shared_stage:tnm_categories' }{ 'shared_stage:pathologic_categories' }{ 'shared_stage:pathologic_T' }{ 'content' }; my $pathologic_M = $userxml ->{ $patient_key }{ 'shared_stage:stage_event' }{ 'shared_stage:tnm_categories' }{ 'shared_stage:pathologic_categories' }{ 'shared_stage:pathologic_M' }{ 'content' }; my $pathologic_N = $userxml ->{ $patient_key }{ 'shared_stage:stage_event' }{ 'shared_stage:tnm_categories' }{ 'shared_stage:pathologic_categories' }{ 'shared_stage:pathologic_N' }{ 'content' }; $gender =( defined $gender )? $gender : "unknow" ; $age =( defined $age )? $age : "unknow" ; $race =( defined $race )? $race : "unknow" ; $grade =( defined $grade )? $grade : "unknow" ; $clinical_stage =( defined $clinical_stage )? $clinical_stage : "unknow" ; $clinical_T =( defined $clinical_T )? $clinical_T : "unknow" ; $clinical_M =( defined $clinical_M )? $clinical_M : "unknow" ; $clinical_N =( defined $clinical_N )? $clinical_N : "unknow" ; $pathologic_stage =( defined $pathologic_stage )? $pathologic_stage : "unknow" ; $pathologic_T =( defined $pathologic_T )? $pathologic_T : "unknow" ; $pathologic_M =( defined $pathologic_M )? $pathologic_M : "unknow" ; $pathologic_N =( defined $pathologic_N )? $pathologic_N : "unknow" ; my $survivalTime = "" ; my $vital_status = $userxml ->{ $patient_key }{ 'clin_shared:vital_status' }{ 'content' }; my $followup = $userxml ->{ $patient_key }{ 'clin_shared:days_to_last_followup' }{ 'content' }; my $death = $userxml ->{ $patient_key }{ 'clin_shared:days_to_death' }{ 'content' }; if ( $vital_status eq 'Alive' ){ $survivalTime = "$followup\t0" ; } else { $survivalTime = "$death\t1" ; } for my $i ( keys %{ $userxml ->{ $patient_key }{ $follow_key }}){ eval { $followup = $userxml ->{ $patient_key }{ $follow_key }{ $i }{ 'clin_shared:days_to_last_followup' }{ 'content' }; $vital_status = $userxml ->{ $patient_key }{ $follow_key }{ $i }{ 'clin_shared:vital_status' }{ 'content' }; $death = $userxml ->{ $patient_key }{ $follow_key }{ $i }{ 'clin_shared:days_to_death' }{ 'content' }; }; if ($@){ for my $j (0..5){ #假设最多有6次随访 my $followup_for = $userxml ->{ $patient_key }{ $follow_key }{ $i }[ $j ]{ 'clin_shared:days_to_last_followup' }{ 'content' }; my $vital_status_for = $userxml ->{ $patient_key }{ $follow_key }{ $i }[ $j ]{ 'clin_shared:vital_status' }{ 'content' }; my $death_for = $userxml ->{ $patient_key }{ $follow_key }{ $i }[ $j ]{ 'clin_shared:days_to_death' }{ 'content' }; if ( ( $followup_for =~ /\d+/) || ( $death_for =~ /\d+/) ){ $followup = $followup_for ; $vital_status = $vital_status_for ; $death = $death_for ; my @survivalArr = split (/\t/, $survivalTime ); if ( $vital_status eq 'Alive' ){ if ( $followup > $survivalArr [0]){ $survivalTime = "$followup\t0" ; } } else { if ( $death > $survivalArr [0]){ $survivalTime = "$death\t1" ; } } } } } my @survivalArr = split (/\t/, $survivalTime ); if ( $vital_status eq 'Alive' ){ if ( $followup > $survivalArr [0]){ $survivalTime = "$followup\t0" ; } } else { if ( $death > $survivalArr [0]){ $survivalTime = "$death\t1" ; } } } print WF "$patient_barcode\t$survivalTime\t$age\t$gender\t$grade\t$pathologic_stage\t$pathologic_T\t$pathologic_M\t$pathologic_N\n" ; } } close (RD); } } close (WF); |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!