Predicting food preferences with sparklyr (machine learning)
This week I want to show how to run machine learning applications on a Spark cluster. I am using the sparklyr package, which provides a handy interface to access Apache Spark functionalities via R.
The question I want to address with machine learning is whether the preference for a country’s cuisine can be predicted based on preferences of other countries’ cuisines.
Apache Spark
Apache Spark™ can be used to perform large-scale data analysis workflows by taking advantage of its parallel cluster-computing setup. Because machine learning processes are iterative, running models on parallel clusters can vastly increase the speed of training. Obviously, Spark’s power comes to pass when dispatching it to external clusters, but for demonstration purposes, I am running the demo on a local Spark instance.
MLlib
Spark’s distributed machine learning library MLlib sits on top of the Spark core framework. It implements many popular machine learning algorithms, plus many helper functions for data preprocessing. With sparklyr you can easily access MLlib. You can work with a couple of different machine learning algorithms and with functions for manipulating features and Spark dataframes. Additionally, you can also perform SQL queries. sparklyr also implements dplyr, making it especially convenient for handling data.
If you don’t have Spark installed locally, run:
library(sparklyr)
spark_install(version = "2.0.0")
Now we can connect to a local Spark instance:
library(sparklyr)
sc <- spark_connect(master = "local", version = "2.0.0")
Preparations
Before I start with the analysis, I am preparing my custom ggplot2 theme and load the packages tidyr (for gathering data for plotting), dplyr (for data manipulation) and ggrepel (for non-overlapping text labels in plots).
library(tidyr)
library(ggplot2)
library(ggrepel)
library(dplyr)
my_theme <- function(base_size = 12, base_family = "sans"){
theme_minimal(base_size = base_size, base_family = base_family) +
theme(
axis.text = element_text(size = 12),
axis.title = element_text(size = 14),
panel.grid.major = element_line(color = "grey"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "aliceblue"),
strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1),
strip.text = element_text(face = "bold", size = 12, color = "black"),
legend.position = "right",
legend.justification = "top",
panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
)
}
The Data
Of course, the power of Spark lies in speeding up operations on large datasets. But because that’s not very handy for demonstration, I am here working with a small dataset: the raw data behind The FiveThirtyEight International Food Association’s 2014 World Cup.
This dataset is part of the fivethirtyeight package and provides scores for how each person rated their preference of the dishes from several countries. The following categories could be chosen:
- 5: I love this country’s traditional cuisine. I think it’s one of the best in the world.
- 4: I like this country’s traditional cuisine. I think it’s considerably above average.
- 3: I’m OK with this county’s traditional cuisine. I think it’s about average.
- 2: I dislike this country’s traditional cuisine. I think it’s considerably below average.
- 1: I hate this country’s traditional cuisine. I think it’s one of the worst in the world.
- N/A: I’m unfamiliar with this country’s traditional cuisine.
Because I think that whether someone doesn’t know a country’s cuisine is in itself information, I recoded NAs to 0.
library(fivethirtyeight)
food_world_cup[food_world_cup == "N/A"] <- NA
food_world_cup[, 9:48][is.na(food_world_cup[, 9:48])] <- 0
food_world_cup$gender <- as.factor(food_world_cup$gender)
food_world_cup$location <- as.factor(food_world_cup$location)
The question I want to address with machine learning is whether the preference for a country’s cuisine can be predicted based on preferences of other countries’ cuisines, general knowledge and interest in different cuisines, age, gender, income, education level and/ or location.
Before I do any machine learning, however, I want to get to know the data. First, I am calculating the percentages for each preference category and plot them with a pie chart that is facetted by country.
# calculating percentages per category and country
percentages <- food_world_cup %>%
select(algeria:vietnam) %>%
gather(x, y) %>%
group_by(x, y) %>%
summarise(n = n()) %>%
mutate(Percent = round(n / sum(n) * 100, digits = 2))
# rename countries & plot
percentages %>%
mutate(x_2 = gsub("_", " ", x)) %>%
mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
mutate(x_2 = gsub("And", "and", x_2)) %>%
ggplot(aes(x = "", y = Percent, fill = y)) +
geom_bar(width = 1, stat = "identity") +
theme_minimal() +
coord_polar("y", start = 0) +
facet_wrap(~ x_2, ncol = 8) +
scale_fill_brewer(palette = "Set3") +
labs(fill =