Reproduce ENCODE/CSHL Long RNA-seq data visualization[2]--coSI
% Boxplot coSI exons % Yun YAN % Oct 9, 2012
Motivation
Reproduce the boxplot of the coSI shown in the original paper Figure2 with the graphic tool ggplot
.
Data Set
Raw data is suggested to be formated for friendly usage in R console.
Data Processing
Here I calculate the relative distance to the polyA site.
Relative Distance = Absolute Distance to polyA / Gene Length
Thus the Relative Distance ranges from 0 to 1.0.
Note that this actually does not make biologically sense.
Finally the data will formated as following:
> head(df) bin position coSI Bin 1 2nd 0.833110 0.897727 2nd 2 2nd 0.837699 0.897727 2nd 3 2nd 0.837699 0.897727 2nd 4 2nd 0.837699 0.897727 2nd 5 2nd 0.843272 0.897727 2nd 6 1st 0.980673 0.897727 1st
Classifying Position
In the R console, put the following commands:
> position<-df$position > position[position<0.1] = "10th" > position[position>=0.1 & position<0.2] = "9th" > position[position>=0.2 & position<0.3] = "8th" > position[position>=0.3 & position<0.4] = "7th" > position[position>=0.4 & position<0.5] = "6th" > position[position>=0.5 & position<0.6] = "5th" > position[position>=0.6 & position<0.7] = "4th" > position[position>=0.7 & position<0.8] = "3rd" > position[position>=0.8 & position<0.9] = "2nd" > position[position>=0.9 & position<=1.0] = "1st" > cbind(bin=position,df)
Factoring bin
Without factoring bin
, the x-axis labels will not be properly displayed.
df$Bin<- factor(df$bin, levels=c("1st","2nd","3rd","4th","5th","6th","7th","8th","9th","10th"), labels=c("1st","2nd","3rd","4th","5th","6th","7th","8th","9th","10th"))
Boxploting
Let's make a contrast. First one is the wanted one.
ggplot(df, aes(x=Bin, y=coSI, fill=Bin)) + geom_boxplot()+theme_bw()+ xlab('Notice x-label order')+ggtitle('Exon bins by relative distance to polyA site')
Then here comes up with the unordered one.
ggplot(df, aes(x=bin, y=coSI, fill=bin)) + geom_boxplot()+theme_bw()+ xlab('Notice x-label order')+ggtitle('Exon bins by relative distance to polyA site')
Discussion
Factor
Without using factor
, the x-labe will not be properly displayed.
Relative Distance
Relative distance is somehow wrong. The concept of relative distance will treat the first exon of short gene as same as the exon of long gene. However, their distances to their polyA site are distinct.