Reproduce ENCODE/CSHL Long RNA-seq data visualization viewed in UCSC

Motivation

The ENCODE data comes out, and luckily they provide both .bam file and .bigwig file. Thus, it occurs to me that I want to give a try to reproduce the data visualization with tool: BEDtools and other related tools.

Result

I'll first upload the difference between my-version and official version:enter image description here

Top to Bottom:

  • Black: my-version-POSitive-strand.bigwig
  • Blue: Official-version-POSitive-strand.bigwig
  • Red: Official-version-REVerse-strand.bigwig
  • Grey: my-version-REVerse-strand.bigwig

From the image, we will find my-version-data and official-version-data roughly share the same peaks, however, my-version-peaks are somehow masked by certain uniform noises. And it drives me crazy.

Note that I know not all the bioinformatics works can be reproduces, but this issue dose not get involved with much algorithms, decisions, etc. Therefore, it's supposed to be reproducible, I think.

Data Set

ENCODE/CSHL long RNA-seq Data set can be found here:http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/ And here I use K562-chromatin-subcellular fraction (Rep4) to explore as an example:

Data Processing

BAM sort

samtools sort wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam.sort

Genome Coverage

I refer to the standard manual of BEDtools, I'll use forward strand as example, and the reverse strand signal is generated in the same way.

genomeCoverageBed -bg -ibam wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqK562ChromatinTotalAlnRep4.bam.sort -g hg19.chromInfo -strand + >K562-Chromatin-POS-4.bedgraph

Note that I've used -strand flag to separate the two strands.

bedgraphtoBigWig

bedGraphToBigWig executive script available from UCSC exe list

bedGraphToBigWig K562-Chromatin-POS-4.bedgraph hg19.chromInfo K562-Chromatin-POS-4.bigwig

Upload to ftp and finally to UCSC genome browser.

Discussion

I was wondering which filtering step I've missed.

I've checked whether all the reads in the .bam file are unique mapped. As the reads are mapped to genome with a tool named, STAR.. According to the manual and common sense, the mapping quality in .sam file equaling 255 means unique mapped reads. Thus, all the reads in the .bam file are unique mapped after I've check the mapping quality.

Another gene difference

 

eh...hey guys, at first, -split flag comes to my mind when I used coverageBED, but based on the manual of BEDtools, I misunderstood the -split will omit all the intron reads (or maybe it is because my mother tongue is not English). Nevertheless, having added the -split flag, I managed to reproduce the data visualization of ENCODE data. Here is the image:

reproduce

The green one is the POSitive strand signal, and it's exactly the same as the official version.

The key is to add the -split option when one runs the genomeCoverageBed command.

Enjoy

 

posted @ 2012-10-06 21:27  Puriney  阅读(649)  评论(0编辑  收藏  举报