Using pivot_longer() and pivot_wider() in R

1. Introduction

In R language, we often use group_by() with summarise() to do summary. If we group_by() multiple columns, we will have a multi-index table as result. Normally this is not a problem, for example, we actually need this kind of multi-index table to draw some good ggplot() pictures. However, in some business situation, we will need a more "Excel style" result. That is, the table only has one index, and the other dimentions of summary will go to columns name, and all the values will fill into the table.

Additionally, there are also some industry, like medical science, they have a tradition to record data in wide table. But this kind of table is not good at analyse, so we need to change it first.

 

2. A Multi-index Table

First, let’s have a look at what is our daily multi-index table. As I mentioned, this is caused by group_by() function and we normally do need it in R.

# we are using classic dataset diamonds
library(tidyverse)
diamonds

# a multi-index result
diamonds %>%
  group_by(cut, color) %>%
  summarise(carat_mean = mean(carat)) 

Output:

   cut   color carat_mean
   <ord> <ord>      <dbl>
 1 Fair  D          0.920
 2 Fair  E          0.857
 3 Fair  F          0.905
 4 Fair  G          1.02 
 5 Fair  H          1.22 
 6 Fair  I          1.20 
 7 Fair  J          1.34 
 8 Good  D          0.745
 9 Good  E          0.745
10 Good  F          0.776
# … with 25 more rows

In R, multi-index table is welcomed, because it is good at next move, like ggplot().

# a multi-index result is good at ggplot()
diamonds %>%
  group_by(cut, color) %>%
  summarise(carat_mean = mean(carat)) %>%
  ggplot() +
  geom_col(aes(x = cut, y = carat_mean, fill = color))

Output:

 

3. How to pivot a long table into wide?

So now the problem is, for some reasons, we have to change a multi-index table into a more "Excel style" result. What we can do?

# using pivot_wider()
diamonds %>%
  group_by(cut, color) %>%
  summarise(carat_mean = mean(carat)) %>%
  pivot_wider(id_cols = cut, names_from = color, values_from = carat_mean)

Output:

  cut           D     E     F     G     H     I     J
  <ord>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Fair      0.920 0.857 0.905 1.02  1.22  1.20   1.34
2 Good      0.745 0.745 0.776 0.851 0.915 1.06   1.10
3 Very Good 0.696 0.676 0.741 0.767 0.916 1.05   1.13
4 Premium   0.722 0.718 0.827 0.841 1.02  1.14   1.29
5 Ideal     0.566 0.578 0.656 0.701 0.800 0.913  1.06

We can use pivot_wider() function. This function has 3 important arguments:

1st, id_cols. This argument means which column or columns you still want to keep in your new result. Normally even we don’t want a multi-index table, but we will still want a least one index table.

2nd, names_from. This argument means which index you don’t want anymore, so pivot_wider() will melt this column and makes a wide table. Notice, we must guarantee the columns which we are going to melt have enumerable value, they can’t be a numeric value column. Otherwise the result will be ridiculous.

3rd, values_from. This argument means which column’s value we are using to fill into the new table.

After the calculation, all the result will be filled as we need. Some people may call it Excel style result.

A little more complex pivot_wider() problem is, you want melt more than one old index to build new column’s name. That is easy with c() function.

# a little more complex pivot
diamonds %>%
  group_by(cut, color, clarity) %>%
  summarise(carat_mean = mean(carat)) 

# use c() function to multi select columns you want to melt
diamonds %>%
  group_by(cut, color, clarity) %>%
  summarise(carat_mean = mean(carat)) %>%
  pivot_wider(id_cols = cut, names_from = c(color, clarity), values_from = carat_mean)

Output1, the original multi-index table:

   cut   color clarity carat_mean
   <ord> <ord> <ord>        <dbl>
 1 Fair  D     I1           1.88 
 2 Fair  D     SI2          1.02 
 3 Fair  D     SI1          0.914
 4 Fair  D     VS2          0.844
 5 Fair  D     VS1          0.63 
 6 Fair  D     VVS2         0.591
 7 Fair  D     VVS1         0.607
 8 Fair  D     IF           0.38 
 9 Fair  E     I1           0.969
10 Fair  E     SI2          1.02 
# … with 266 more rows

Output2, the pivoted wider table:

cut     D_I1 D_SI2 D_SI1 D_VS2 D_VS1 D_VVS2 D_VVS1  D_IF  E_I1 E_SI2 E_SI1 E_VS2 E_VS1 E_VVS2 E_VVS1  F_I1 F_SI2 F_SI1 F_VS2 F_VS1 F_VVS2 F_VVS1  F_IF  G_I1 G_SI2 G_SI1 G_VS2 G_VS1 G_VVS2
  <ord>  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
1 Fair    1.88 1.02  0.914 0.844 0.63   0.591  0.607 0.38  0.969 1.02  0.867 0.690 0.633  0.601  0.64  1.02  1.08  0.864 0.759 0.805  0.627  0.668 0.555  1.23 1.26  0.910 0.978 0.774  0.665
2 Good    1.04 0.858 0.701 0.702 0.663  0.481  0.491 0.787 1.33  0.883 0.724 0.739 0.681  0.560  0.418 0.976 1.00  0.768 0.752 0.625  0.608  0.466 0.533  1.17 1.09  0.884 0.816 0.779  0.626
3 Very …  0.95 0.932 0.708 0.634 0.583  0.466  0.475 0.803 1.07  0.930 0.723 0.664 0.610  0.427  0.400 1.21  0.951 0.796 0.742 0.688  0.571  0.494 0.607  1.12 1.03  0.786 0.810 0.701  0.651
4 Premi…  1.16 0.919 0.692 0.585 0.687  0.581  0.538 0.708 1.04  0.958 0.726 0.619 0.643  0.512  0.462 1.14  1.04  0.842 0.731 0.767  0.658  0.606 0.525  1.29 1.14  0.883 0.809 0.750  0.693
5 Ideal   0.96 0.750 0.595 0.499 0.534  0.545  0.460 0.616 1.04  0.874 0.671 0.521 0.504  0.484  0.427 1.11  0.932 0.770 0.632 0.644  0.577  0.476 0.411  1.17 0.976 0.760 0.770 0.717  0.646
# … with 27 more variables: G_VVS1 <dbl>, G_IF <dbl>, H_I1 <dbl>, H_SI2 <dbl>, H_SI1 <dbl>, H_VS2 <dbl>, H_VS1 <dbl>, H_VVS2 <dbl>, H_VVS1 <dbl>, I_I1 <dbl>, I_SI2 <dbl>, I_SI1 <dbl>,
#   I_VS2 <dbl>, I_VS1 <dbl>, I_VVS2 <dbl>, I_VVS1 <dbl>, J_I1 <dbl>, J_SI2 <dbl>, J_SI1 <dbl>, J_VS2 <dbl>, J_VS1 <dbl>, J_VVS2 <dbl>, J_VVS1 <dbl>, E_IF <dbl>, H_IF <dbl>, I_IF <dbl>,
#   J_IF <dbl>

   

4. How to pivot a wide table into long?

On the other hand, we may want to change a wide table into long as well. What can we do?

# we are using classic dataset who, a medical science table
who

# using pivot_longer()
who %>%
  pivot_longer(cols = starts_with("new"), names_to = "type", values_to = "cases") 

Output1, the original wide table:

country     iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544 new_sp_m4554 new_sp_m5564 new_sp_m65 new_sp_f014 new_sp_f1524 new_sp_f2534 new_sp_f3544 new_sp_f4554
   <chr>       <chr> <chr> <int>       <int>        <int>        <int>        <int>        <int>        <int>      <int>       <int>        <int>        <int>        <int>        <int>
 1 Afghanistan AF    AFG    1980          NA           NA           NA           NA           NA           NA         NA          NA           NA           NA           NA           NA
 2 Afghanistan AF    AFG    1981          NA           NA           NA           NA           NA           NA         NA          NA           NA           NA           NA           NA
 3 Afghanistan AF    AFG    1982          NA           NA           NA           NA           NA           NA         NA          NA           NA           NA           NA           NA
 4 Afghanistan AF    AFG    1983          NA           NA           NA           NA           NA           NA         NA          NA           NA           NA           NA           NA
 5 Afghanistan AF    AFG    1984          NA           NA           NA           NA           NA           NA         NA          NA           NA           NA           NA           NA
 6 Afghanistan AF    AFG    1985          NA           NA           NA           NA           NA           NA         NA          NA           NA           NA           NA           NA
 7 Afghanistan AF    AFG    1986          NA           NA           NA           NA           NA           NA         NA          NA           NA           NA           NA           NA
 8 Afghanistan AF    AFG    1987          NA           NA           NA           NA           NA           NA         NA          NA           NA           NA           NA           NA
 9 Afghanistan AF    AFG    1988          NA           NA           NA           NA           NA           NA         NA          NA           NA           NA           NA           NA
10 Afghanistan AF    AFG    1989          NA           NA           NA           NA           NA           NA         NA          NA           NA           NA           NA           NA
# … with 7,230 more rows, and 44 more variables: new_sp_f5564 <int>, new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
#   new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>, new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>, new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
#   new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>, new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
#   new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>, new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>, newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,
#   newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>, newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>, newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,
#   newrel_f65 <int>

Output2, the pivoted long table:

   country     iso2  iso3   year type         cases
   <chr>       <chr> <chr> <int> <chr>        <int>
 1 Afghanistan AF    AFG    1980 new_sp_m014     NA
 2 Afghanistan AF    AFG    1980 new_sp_m1524    NA
 3 Afghanistan AF    AFG    1980 new_sp_m2534    NA
 4 Afghanistan AF    AFG    1980 new_sp_m3544    NA
 5 Afghanistan AF    AFG    1980 new_sp_m4554    NA
 6 Afghanistan AF    AFG    1980 new_sp_m5564    NA
 7 Afghanistan AF    AFG    1980 new_sp_m65      NA
 8 Afghanistan AF    AFG    1980 new_sp_f014     NA
 9 Afghanistan AF    AFG    1980 new_sp_f1524    NA
10 Afghanistan AF    AFG    1980 new_sp_f2534    NA
# … with 405,430 more rows

We can use pivot_longer(). This function has 3 important arguments.

1st, cols. This argument means which columns names you want to melt or you don't want anymore. Sometimes, they can be up to 100 or more. We can use start_with()/ ends_with()/ ... function to multi select them.

2nd, names_to. This argument means, the melted columns name will go to which new column. We need to specify it by a string. That is the new column’s name.

3rd, values_to. This argument means all the values will go to a new column. We need to specify it by a string. That is the new column’s name.

  

5. Final thoughts

When I use Python to manipulate data and use matplotlib to draw pictures, some years ago, I find it very hard to use multi-index table. I have to change it to this non multi-index format. But in R, I believe it is designed intentionally, it is very welcome multi-index table and easy to use.

Form time to time in R, we may forget how to change it into a non multi-index table. Besides, many articles online about pivot topic are still back to "Stone Ages" that they are discussing base R functions or some other wired packages. 

 

 

 

posted @ 2021-12-04 16:19  DrVonGoosewing  阅读(65)  评论(0编辑  收藏  举报