How to use map() instead of if_else() sandwich?
1. Why do I care?
library(tidyverse) (tips <- read_csv("tips.csv"))
# A tibble: 244 × 7 total_bill tip sex smoker day time size <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> 1 17.0 1.01 Female No Sun Dinner 2 2 10.3 1.66 Male No Sun Dinner 3 3 21.0 3.5 Male No Sun Dinner 3 4 23.7 3.31 Male No Sun Dinner 2 5 24.6 3.61 Female No Sun Dinner 4 6 25.3 4.71 Male No Sun Dinner 4 7 8.77 2 Male No Sun Dinner 2 8 26.9 3.12 Male No Sun Dinner 4 9 15.0 1.96 Male No Sun Dinner 2 10 14.8 3.23 Male No Sun Dinner 2 # … with 234 more rows
When the logic is simple if_else() is convenience.
# a common workflow is... tips %>% mutate(tips_type = if_else(tip >= total_bill * 0.20, "well paid", "under paid"))
# A tibble: 244 × 8 total_bill tip sex smoker day time size tips_type <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> 1 17.0 1.01 Female No Sun Dinner 2 under paid 2 10.3 1.66 Male No Sun Dinner 3 under paid 3 21.0 3.5 Male No Sun Dinner 3 under paid 4 23.7 3.31 Male No Sun Dinner 2 under paid 5 24.6 3.61 Female No Sun Dinner 4 under paid 6 25.3 4.71 Male No Sun Dinner 4 under paid 7 8.77 2 Male No Sun Dinner 2 well paid 8 26.9 3.12 Male No Sun Dinner 4 under paid 9 15.0 1.96 Male No Sun Dinner 2 under paid 10 14.8 3.23 Male No Sun Dinner 2 well paid # … with 234 more rows
But a disadvantage of this method is that when the logic is growing, the code will become chaos. There will be many layers of if_else() overlapping, like a sandiwch.
Needless to say, if you are going to use more than just two columns, you will confuse yourself easily.
For example:
# more layers of if_else() is difficult to write and to read: tips %>% mutate(tips_type = if_else(tip >= total_bill * 0.2, "well paid", if_else(tip >= total_bill * 0.15, "fare paid", if_else(tip >= total_bill * 0.1, "acceptable", "under paid")))) # many layers of if_else() overlapping, like a sandwich # needless to say, if we are using more columns than just two: #tips %>% # mutate(tips_type = if_else((tip >= total_bill * 0.2) & (day %in% c("Sat", "Sun") & time == "Dinner"), "well paid", ...))
That is, we need a separated place to arrange our business logic and prepare our function, instead of making huge if_else() sandwich. Luckly, R is a functional programming language and it has a convenience tool call map(), from tidyverse package.
2. Before map()
Secondly, map() is no magic and it is only a wrapper of for-loop. For-loop is a good choice to handle multiple inputs with a same process. This is exactly what we need. However, for-loop is hard to write inside a mutate() function. Don't forget we are talking about create a new column problem. So we can use the wrapper of for-loop, the map().
# you can use for-loop all the same but that breaks your data flow pipline: tips %>% filter(time == "Dinner") %>% mutate(tips_type = "oh waite a minute, I first write a for-loop to find the result") # (joke)mY cOoL foR-lO0p tips_type_result <- vector("character", nrow(tips)) for (i in seq_along(tips$tip)) { if (tips$tip[[i]] > tips$total_bill[[i]] * 0.2) { tips_type_result[[i]] = "well paid" } else { tips_type_result[[i]] = "under paid" } } # (joke)lEt's gO bAck to mUtaTe tips %>% filter(time == "Dinner") %>% mutate(tips_type = tips_type_result) %>% summarise()... # you will not want to do things like this. That's why we need map().
3. A workflow of using map()
After that, we can use this function with map() inside mutate(), which also keeps out data flow pipline to next step.
Using map() increases readability and rubust of our code. If our business logic changes, we can change the independ function instead change mutate clause.
tip_type_judge <- function(tip, total_bill) { if (tip >= total_bill * 0.2) { return("well paid") } else if (tip >= total_bill * 0.15) { return("fare paid") } else if (tip >= total_bill * 0.1) { return("acceptable") } else { return("under paid") } } tips %>% mutate(tip_type = map2_chr(tip, total_bill, tip_type_judge))
# A tibble: 244 × 8 total_bill tip sex smoker day time size tip_type <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> 1 17.0 1.01 Female No Sun Dinner 2 under paid 2 10.3 1.66 Male No Sun Dinner 3 fare paid 3 21.0 3.5 Male No Sun Dinner 3 fare paid 4 23.7 3.31 Male No Sun Dinner 2 acceptable 5 24.6 3.61 Female No Sun Dinner 4 acceptable 6 25.3 4.71 Male No Sun Dinner 4 fare paid 7 8.77 2 Male No Sun Dinner 2 well paid 8 26.9 3.12 Male No Sun Dinner 4 acceptable 9 15.0 1.96 Male No Sun Dinner 2 acceptable 10 14.8 3.23 Male No Sun Dinner 2 well paid # … with 234 more rows
If you have more than just 2 columns as input, you can use pmap_*(). It takse a list as input and in the list we can use as many columns as we need. Note that I use pmap_chr() below. You can use pmap_dbl() if your result is double floating point numbers.
# if you have more than 2 columns as input, use pmap_*() instead tip_type_judge_v2 <- function(tip, total_bill, day) { if ((tip >= total_bill * 0.2) & (day %in% c("Sun", "Sat"))) { return("well paid") } else if ((tip >= total_bill * 0.15) & !(day %in% c("Sun", "Sat"))) { return("well paid") } else if ((tip >= total_bill) * 0.1) { return("acceptable") } else { return("under paid") } } tips %>% mutate(tip_type = pmap_chr(list(tip, total_bill, day), tip_type_judge_v2))
# A tibble: 244 × 8 total_bill tip sex smoker day time size tip_type <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> 1 17.0 1.01 Female No Sun Dinner 2 under paid 2 10.3 1.66 Male No Sun Dinner 3 under paid 3 21.0 3.5 Male No Sun Dinner 3 under paid 4 23.7 3.31 Male No Sun Dinner 2 under paid 5 24.6 3.61 Female No Sun Dinner 4 under paid 6 25.3 4.71 Male No Sun Dinner 4 under paid 7 8.77 2 Male No Sun Dinner 2 well paid 8 26.9 3.12 Male No Sun Dinner 4 under paid 9 15.0 1.96 Male No Sun Dinner 2 under paid 10 14.8 3.23 Male No Sun Dinner 2 well paid # … with 234 more rows