转：cufflinks的用法之cufflinks

转自：http://yangl.net/2016/06/03/cufflinks/

Cufflinks下主要包含cufflinks,cuffmerge,cuffcompare和cuffdiff等几支主要的程序。主要用于基因表达量的计算和差异表达基因的寻找。

1. Cufflinks简介

Cufflinks程序主要根据Tophat的比对结果，依托或不依托于参考基因组的GTF注释文件，计算出(各个gene的)isoform的FPKM值，并给出trascripts.gtf注释结果(组装出转录组)。

注意：

1. fragment的长度的估测，若为pair-end测序，则cufflinks自己会有一套算法，算出结果。若为single-end测序，则cufflinks默认的是高斯分布，或者你自己提供相关的参数设置。

2. cufflinks计算multi-mapped reads，一般a read map到10个位置，则每个位置记为10%。a read mapping to 10 positions will count as 10% of a read at each position.

3. 一般不推荐用cufflinks拼接细菌的转录组，推荐 Glimmer。但是，若有注释文件，可以用cufflinks和cuffdiff来检测基因的表达和差异性。

4. cufflinks/cuffdiff不能计算出exon或splicing event的FPKM

5.cuffdiff处理时间序列data：采用参数-t

6.当你使用cufflinks时，在最后出现了99%，然后一直不动。因为cuffdiff需要更多的CPU来处理一些匹配很多reads的loci。而这些位点一般要等其他位点全部解决了后，才由cuffdiff来处理。可以用参数-M来提供相关的文件，过滤掉rRNA或者线粒体RNA。

7. 当使用cufflinks或cuffdiff出现了“crash with a ‘bad_alloc’ error”，cuffdiff和cufflinks运行了很长时间才结束————这表明计算机拼接一个高表达的基因或定量分析一个高表达的基因，运行的内存使用玩尽了！解决方法：修改选项“-max-bundle-frags”，可以先尝试500000，若错误依旧在，可以继续下调！

8. cuffdiff报道的结果里面所有的基因和转录本的FPKM=0，这表明GTF中的染色体名字和BAM里的名字不匹配。

9. cuffdiff和cufflinks的缺点：存在一定的假基因和转录本（原因：测序深度，测序质量，测序样本的测序次数，以及注释的错误）

10. large fold change表达量不代表数据的明显性（这些基因的isform多或这些基因测序测到的少，整体较低的表达）。cuffdiff中明显表达倍数改变的基因，存在不确定性。

11. 通过cufflinks产生的结果中transcript.gtf文件中cuff标识的转录本就是新的转录本。相应的，其他模块输出中CUFF标识代表着新的转录本。

12. 若出现了如下错误：

You are using Cufflinks v2.2.1, which is the most recent release.
open: No such file or directory
File 30 doesn’t appear to be a valid BAM file, trying SAM…
Error: cannot open alignment file 30 for reading
这表明，你的参数有问题。例如“–min-intron-length”,你设置为了：“-min-intron-length”

2. 使用方法

$ cufflinks [options]* 

一个常用的例子： $ cufflinks -p 8 -G transcript.gtf --library-type fr-unstranded -o cufflinks_output tophat_out/accepted_hits.bam

普通参数：

-o | --output-dir default: ./ 设置输出的文件夹名称

  -p | --num-threads default: 1 用于比对reads的CPU线程数

  -G | --GTF 提供一个GFF文件，以此来计算isoform的表达。此时，将不会组装新的transcripts， 程序会忽略和reference transcript不兼容的比对结果

  -g | --GTF-guide 提供GFF文件，以此来指导转录子组装(RABT assembly)。此时，输出结果会包含reference transcripts和novel genes and isforms。

  -M | --mask-file 提供GFF文件。Cufflinks将忽略比对到该GTF文件的transcripts中的reads。该 文件中常常是rRNA的注释，也可以包含线立体和其它希望忽略的transcripts的注释。将这些不需要的RNA去除后，对计算mRNA的表达量是有利的。

-b | --frag-bias-correct 提供一个fasta文件来指导Cufflinks运行新的bias detection and correction algorithm。这样能明显提高转录子丰度计算的精确性。

  -u | --multi-read-correct 让Cufflinks来做initial estimation步骤，从而更精确衡量比对到genome多个位点的reads。

  --library-type default:fr-unstranded 处理的reads具有链特异性。比对结果中将会有个XS标签。一般Illumina数据的lib rary-type为 fr-unstranded。
--library-norm-method    具体参考官网,三种方式：classic-fpkm  默认的方式。geometric  针对DESeq。quartile  计算时，fragments和总的map的count取75%

丰度评估参数：

-m | --frag-len-mean default: 200 插入片段的平均长度。不过现在Cufflinks能learns插入片段的平均长度，因此不推荐自主 设置此值。

  -s | --frag-len-std-dev default: 80 插入片段长度的标准差。不过现在Cufflinks能learns插入片段的平均长度，因此不推荐自 主设置此值。

  -N | --upper-quartile-form 使用75%分为数的值来代替总的值(比对到单一位点的fragments的数值)，作normalize。这样有利于在低丰度基因和转录子中寻找差异基因。

  --total-hits-norm default: TRUE Cufflinks在计算FPKM时,算入所有的fragments和比对上的reads。和下一个参数 对立。默认激活该参数。

  --compatible-hits-norm Cufflinks在计算FPKM时，只针对和reference transcripts兼容的fragments以及比对上的reads。该参数默认不激活，只能在有 --GTF 参数下有效，并且作 RABT 或 ab initio 的时候无效。

--max-mle-iterations  进行极大似然法时选择的迭代次数，默认为：5000

--max-bundle-frags  一个skipped locus/loci在别skipped前可以拥有的最大的fragment片段。默认为1000000

--no-effective-length-correction   Cufflinks will not employ its "effective" length normalization to transcript FPKM.Cufflinks将不会使用它的“effective” 长度标准化去计算转录的FPKM

--no-length-correction   Cufflinks将根本不会使用转录本的长度去标准化fragment的数目。当fragment的数目和the features being quantified的size是独立的，可以使用（例如for small RNA libraries, where no fragmentation takes place, or 3 prime end sequencing, where sampled RNA fragments are all essentially the same length).小心使用

组装常用参数：

-L | --label default: CUFF Cufflink以GTF格式来报告转录子片段(transfrags),该参数是GTF文件的前缀

-F/--min-isoform-fraction <0.0-1.0>  在计算一个基因的isoform 丰度后，过滤了丰度极低的转录本，因为这些转录本不可以信任。也可以过滤一些read匹配极低的外显子。默认为0.1或者10% of the most abundant isoform (the major isoform) of the gene.（一个基因的主要isoform的丰度的10%）

-j/--pre-mrna-fraction <0.0-1.0>   内含子被aligment覆盖的最低深度。若小于这个值则那些内含子的alignments被忽略掉。默认为15%。 The minimum depth of coverage in the intronic region covered by the alignment is divided by the number of spliced reads, and if the result is lower than this parameter value, the intronic alignments are ignored. The default is 15%.

-I/--max-intron-length  内含子的最大长度。若大于该值的内含子，cufflinks不会报告。默认为300000.Cufflinks will not report transcripts with introns longer than this, and will ignore SAM alignments with REF_SKIP CIGAR operations longer than this. The default is 300,000.

-a/--junc-alpha <0.0-1.0>    剪接比对过滤中假阳性的二项检验中的 alpha value。默认为 0.001

-A/--small-anchor-fraction <0.0-1.0>  在junction中一个reads小于自身长度的这个百分比，会被怀疑，可能会在拼接前被过滤掉。默认为0.09

--min-frags-per-transfrag default: 10 组装出的transfrags被支持的RNA-seq的fragments数少于该值则不被报道。

--overhang-tolerance  当决定一个reads或转录本与某个转录本兼容或匹配的时候，允许的能加入该转录本的外显子的延伸长度。默认是8bp和bowtie/tophat默认的一致。

--max-bundle-length  Maximum genomic length allowed for a given bundle. The default is 3,500,000bp.

--min-intron-length default: 50 最小的intron大小。

--trim-3-avgcov-thresh  最小的3‘端的平均覆盖程度。小于该值，则删除其3’端序列。默认10  Minimum average coverage required to attempt 3' trimming. The default is 10.

--trim-3-dropoff-frac   最低百分比的拼接的转录本的3‘端的平均覆盖程度。默认0.1  The fraction of average coverage below which to trim the 3' end of an assembled transcript. The default is 0.1.

--max-multiread-fraction <0.0-1.0>   若一个转录本Transfrags的reads能匹配到基因组的多个位置，其中该转录本的reads有超过该百分比是multireads，则不会报告这个转录本。默认为75%   The fraction a transfrag's supporting reads that may be multiply mapped to the genome. A transcript composed of more than this fraction will not be reported by the assembler. Default: 0.75 (75% multireads or more is suppressed).

--overlap-radius default: 50 Transfrags之间的距离少于该值，则将其连到一起。

Advanced Reference Annotation Based Transcript (RABT) Assembly Options:当你使用-g/--GTF-guide这个参数时，需要考虑的选项。

--3-overhang-tolerance    当决定一个拼接的转录本（这个转录本可能不是新的转录本）和一个参考转录本是否合并时，参考转录本的3‘端允许延伸的长度。默认600bp   The number of bp allowed to overhang the 3' end of a reference transcript when determining if an assembled transcript should be merged with it (ie, the assembled transcript is not novel). The default is 600 bp.

--intron-overhang-tolerance   当决定一个拼接的转录本（这个转录本可能不是新的转录本）和一个参考转录本是否合并时，参考转录本的外显子允许延伸的长度。默认50bp   The number of bp allowed to enter the intron of a reference transcript when determining if an assembled transcript should be merged with it (ie, the assembled transcript is not novel). The default is 50 bp.

--no-faux-reads   This option disables tiling of the reference transcripts with faux reads. Use this if you only want to use sequencing reads in assembly but do not want to output assembled transcripts that lay within reference transcripts. All reference transcripts in the input annotation will also be included in the output.这一项将不能掩盖参考转录组中的假reads。当你只想在拼接中使用测序的reads而不想输出lay within reference transcripts的拼接的转录组。输入时注释的所有的参考转录组也将会输入到输出中。

Cufflinks输出结果：

cufflinks的输入文件是sam或bam格式。并且sam或bam格式的文件必须排好序。（The SAM file supplied to Cufflinks must be sorted by reference position.）Tophat的输出结果sam或bam已经排好了序。针对其他的未排序的sam或bam文件采用如下排序方式：
sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted

1. transcripts.gtf
该文件包含Cufflinks的组装结果isoforms。前7列为标准的GTF格式，最后一列为attributes。其每一列的意义：

列数 列的名称 例子 描述 
1 序列名 chrX 染色体或contig名 ; 2 来源 Cufflinks 产生该文件的程序名 ; 3 类型 exon 记录的类型，一般是transcript或exon ; 4 起始 1 1-base的值 ; 5 结束 1000 结束位置 ; 6 得分 1000 ; 7 链 + Cufflinks猜测isoform来自参考序列的那一条链， 一般是'+','-'或'.';   8 frame . Cufflinks不去预测起始或终止密码子框的位置 ; 9 attributes ... 详见下

每一个GTF记录包含如下attributes：

Attribute 例子 描述 
gene_id   CUFF.1    Cufflinks的gene id ;  transcript_id   CUFF.1.1 Cufflinks的转录子 id   ; FPKM 101.267 isoform水平上的丰度, Fragments Per Kilobase of exon model per Million mapped fragments ; frac 0.7647 保留着的一项，忽略即可，以后可能会取消这个;  conf_lo 0.07 isoform丰度的95%置信区间的下边界，即 下边界值 = FPKM * ( 1.0 - conf_lo ) ;  conf_hi 0.1102 isoform丰度的95%置信区间的上边界，即 上边界值 = FPKM * ( 1.0 + conf_hi ) ; cov 100.765 计算整个transcript上read的覆盖度;  full_read_support yes 当使用 RABT assembly 时，该选项报告所有的intr ons和exons是否完全被reads所覆盖

2. ispforms.fpkm_tracking
isoforms(可以理解为gene的各个外显子)的fpkm计算结果


3. genes.fpkm_tracking
gene的fpkm计算结果

posted on 2017-07-02 10:03 青萍，你好阅读(1609) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

一周一paper，一周一技术