Adding A Summary Row For Table
Introduction
In daily analysis we will face a simple chanllenge: how to add a summary row for our table?
# A tibble: 5 × 2 cut avg_pri <ord> <dbl> 1 Fair 4359. 2 Good 3929. 3 Very Good 3982. 4 Premium 4584. 5 Ideal 3458.
In most of my experience, I just export them to Excel and manually add a summary row. This is not too bad, because after all I still have to export them to Excel for further communication, like preparing PowerPoint or deliver them with mail.
Do we have a more easy and robust solution with R?
How About add_row()?
As an example, we will use diamonds dataset. If you installed and import tidyverse package you can use it directly.
library(tidyverse) diamonds %>% group_by(cut) %>% summarise(avg_pri = mean(price))
# A tibble: 5 × 2 cut avg_pri <ord> <dbl> 1 Fair 4359. 2 Good 3929. 3 Very Good 3982. 4 Premium 4584. 5 Ideal 3458.
A The first plain idea is using a function called add_row() because we want to add a row indeed. This function allows you to build tibble row by row, so that we can add a summary row as we want.
When you use add_row(), you are not able to access the original dataframe columns. Instead, you need to use dataset$columname
.
# use it separately cut_analysis <- diamonds %>% group_by(cut) %>% summarise(avg_pri = mean(price)) cut_analysis %>% add_row(cut = "All", avg_pri = mean(diamonds$price)) # or, just use it in our pipline data flow diamonds %>% group_by(cut) %>% summarise(avg_pri = mean(price)) %>% add_row(cut = "All", avg_pri = mean(diamonds$price)) # avg_pri = mean(price) is not working
# A tibble: 6 × 2 cut avg_pri <chr> <dbl> 1 Fair 4359. 2 Good 3929. 3 Very Good 3982. 4 Premium 4584. 5 Ideal 3458. 6 All 3933.
This is a little step foreward compare to "Excel method" I daily used. But still it can be frustrating when the output columns are more than handful numbered.
For example:
# using add_row() can be frustrating when the output columns are more than handful diamonds %>% group_by(cut) %>% summarise(n = n(), avg_carat = mean(carat), sum_depth = sum(depth), sum_table = sum(table), avg_price = mean(price), sd_x = sd(x), sd_y = sd(y), sd_z = sd(z))
# A tibble: 5 × 9 cut n avg_carat sum_depth sum_table avg_price sd_x sd_y sd_z <ord> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Fair 1610 1.05 103107. 95077. 4359. 0.964 0.956 0.652 2 Good 4906 0.849 305967 287956. 3929. 1.06 1.05 0.655 3 Very Good 12082 0.806 746888. 700226. 3982. 1.10 1.10 0.730 4 Premium 13791 0.892 844901. 810167. 4584. 1.19 1.26 0.731 5 Ideal 21551 0.703 1329899. 1205814. 3458. 1.06 1.07 0.658
How about bind_rows()?
Another idea is to use bind_rows(). It can add rows to dataframe as well.
Not like in add_row() we have to specify columns by ourselves. bind_row() gives us a chance to copy-paste our aggregation code.
# do it separately summary_row <- diamonds %>% summarise(avg_pri = mean(price)) cut_analysis %>% bind_rows(summary_row) # or, just use it as functional bind_rows( diamonds %>% group_by(cut) %>% summarise(avg_pri = mean(price)), diamonds %>% summarise(avg_pri = mean(price)) )
# A tibble: 6 × 2 cut avg_pri <ord> <dbl> 1 Fair 4359. 2 Good 3929. 3 Very Good 3982. 4 Premium 4584. 5 Ideal 3458. 6 NA 3933.
It is a great improvment becuase even when the summary columns are massive, we can use it by copying our summary code without much effort.
bind_rows( # grouped summary diamonds %>% group_by(cut) %>% summarise(n = n(), avg_carat = mean(carat), sum_depth = sum(depth), sum_table = sum(table), avg_price = mean(price), sd_x = sd(x), sd_y = sd(y), sd_z = sd(z)), # total summary # it is just copy-and-paste and delete group_by() clause diamonds %>% summarise(n = n(), avg_carat = mean(carat), sum_depth = sum(depth), sum_table = sum(table), avg_price = mean(price), sd_x = sd(x), sd_y = sd(y), sd_z = sd(z)), )
# A tibble: 6 × 9 cut n avg_carat sum_depth sum_table avg_price sd_x sd_y sd_z <ord> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Fair 1610 1.05 103107. 95077. 4359. 0.964 0.956 0.652 2 Good 4906 0.849 305967 287956. 3929. 1.06 1.05 0.655 3 Very Good 12082 0.806 746888. 700226. 3982. 1.10 1.10 0.730 4 Premium 13791 0.892 844901. 810167. 4584. 1.19 1.26 0.731 5 Ideal 21551 0.703 1329899. 1205814. 3458. 1.06 1.07 0.658 6 NA 53940 0.798 3330763. 3099240. 3933. 1.12 1.14 0.706
One deeper layer problem
One more layer deeper problem is that, when we grouped more than just one variable, how can we add summary rows?
For example, how can we add summary rows to each cut group, or to each color group?
diamonds %>% group_by(cut, color) %>% summarise(avg_pri = mean(price))
# A tibble: 35 × 3 # Groups: cut [5] cut color avg_pri <ord> <ord> <dbl> 1 Fair D 4291. 2 Fair E 3682. 3 Fair F 3827. 4 Fair G 4239. 5 Fair H 5136. 6 Fair I 4685. 7 Fair J 4976. 8 Good D 3405. 9 Good E 3424. 10 Good F 3496. # … with 25 more rows
Using bind_rows() like above can solve this problem as well. We just need to choose which group are we want to summary.
For example, I want to add summary rows about variable cut:
bind_rows( diamonds %>% group_by(cut, color) %>% summarise(avg_pri = mean(price)), diamonds %>% group_by(cut) %>% summarise(avg_pri = mean(price)) )
Or, I just want to add summary rows about variabe color:
bind_rows( diamonds %>% group_by(cut, color) %>% summarise(avg_pri = mean(price)), diamonds %>% group_by(color) %>% summarise(avg_pri = mean(price)) )
What if I don't want to copy-and-paste my code?
Another more refreshing method is using bind_rows() but first with a transformation to our dataset.
We can bind the dataset to itself, but with the column we want to group renamed. This is making our data redundancy but the process of output will directly contains a total summary row.
bind_rows( diamonds, diamonds %>% mutate(cut = "All") ) %>% group_by(cut) %>% summarise(avg_pri = mean(price))
# A tibble: 6 × 2 cut avg_pri <chr> <dbl> 1 All 3933. 2 Fair 4359. 3 Good 3929. 4 Ideal 3458. 5 Premium 4584. 6 Very Good 3982.
There is a open source function written by camille
, making a lot of works underhood. You can use it directly in the pipline data flow. This is creator's web page:
https://github.com/camille-s/camiller/blob/main/R/bind_self.R
# using bind_self() function by camille diamonds %>% group_by(cut) %>% bind_self(group = "cut", new_value = "All") %>% summarise(avg_pri = mean(price)) # if you are facing more than just one variable is grouped, you # can still use bind_self diamonds %>% group_by(cut, color) %>% bind_self(group = "cut", new_value = "All") %>% summarise(avg_pri = mean(price))
I think this is a useful function towords this problem. However, still it can be risky, especially when your dataset is very large. Remember you are making double data redundancy and this may make your computer's memroy overflow.
Final thought
One solution is that we can just use two smaller tables to show a separated result, or we can use written words to describle a total result.
Regardless, knowing how to use bind_rows()/ bind_self()/ add_row() is like a tool always packed at our back pocket.