Reproduce ENCODE/CSHL Long RNA-seq data visualization[2]--coSI

% Boxplot coSI exons % Yun YAN % Oct 9, 2012

Motivation

Reproduce the boxplot of the coSI shown in the original paper Figure2 with the graphic tool ggplot.

Data Set

Raw data is suggested to be formated for friendly usage in R console.

Data Processing

Here I calculate the relative distance to the polyA site.

Relative Distance = Absolute Distance to polyA / Gene Length

Thus the Relative Distance ranges from 0 to 1.0.

Note that this actually does not make biologically sense.

Finally the data will formated as following:

> head(df)
  bin position     coSI Bin
1 2nd 0.833110 0.897727 2nd
2 2nd 0.837699 0.897727 2nd
3 2nd 0.837699 0.897727 2nd
4 2nd 0.837699 0.897727 2nd
5 2nd 0.843272 0.897727 2nd
6 1st 0.980673 0.897727 1st

 

Classifying Position

In the R console, put the following commands:

> position<-df$position
> position[position<0.1] = "10th"
> position[position>=0.1 & position<0.2] = "9th"
> position[position>=0.2 & position<0.3] = "8th"
> position[position>=0.3 & position<0.4] = "7th"
> position[position>=0.4 & position<0.5] = "6th"
> position[position>=0.5 & position<0.6] = "5th"
> position[position>=0.6 & position<0.7] = "4th"
> position[position>=0.7 & position<0.8] = "3rd"
> position[position>=0.8 & position<0.9] = "2nd"
> position[position>=0.9 & position<=1.0] = "1st"
> cbind(bin=position,df)

 

Factoring bin

Without factoring bin, the x-axis labels will not be properly displayed.

df$Bin<- factor(df$bin, levels=c("1st","2nd","3rd","4th","5th","6th","7th","8th","9th","10th"), labels=c("1st","2nd","3rd","4th","5th","6th","7th","8th","9th","10th"))

 

Boxploting

Let's make a contrast. First one is the wanted one.

ggplot(df, aes(x=Bin, y=coSI, fill=Bin)) + geom_boxplot()+theme_bw()+ xlab('Notice x-label order')+ggtitle('Exon bins by relative distance to polyA site')

 

 

Then here comes up with the unordered one.

ggplot(df, aes(x=bin, y=coSI, fill=bin)) + geom_boxplot()+theme_bw()+ xlab('Notice x-label order')+ggtitle('Exon bins by relative distance to polyA site')

  

Discussion

Factor

Without using factor, the x-labe will not be properly displayed.

Relative Distance

Relative distance is somehow wrong. The concept of relative distance will treat the first exon of short gene as same as the exon of long gene. However, their distances to their polyA site are distinct.

posted @ 2012-10-09 17:26  Puriney  阅读(286)  评论(0编辑  收藏  举报