Analyzing Public Health Data in R
Today’s post is by Thomas Yokota, an epidemiologist in Hawaii. I’ve been corresponding with Thomas via email and telephone for a while. I asked Thomas if he could write an introduction to how R, mapping and open data are used in the public health community. This is his reply.
Bonus: Download the code from this post!
What is public health?
Public health is a broad and sometimes overreaching field. Consequently, I am asked, “what IS public health?” more so than “what school did you go to (in Hawaii)?” I also work for a hospital system where I am often mistaken as someone knowledgeable of clinical care. With that said, I’d like to start off with a quote by C.E. Taylor:
“The focus of public health is the community. The ‘patient’ is a whole population unit. The shift in professional orientation which occurs as the unit of attention moves from the individual to the group must be clearly recognized and explicitly stated because it has led to many misunderstandings in the past.”
Once it’s clearly understood that public health is not brain surgery and – according to the CDC – more about zombie invasions, the next response I usually get is, “so what does public health do if it doesn’t cure diabetes?” Such a loaded question. In a nutshell, the four core functions of public health are as follows:
- Assessment
- Policy development
- Assurance
- Communication
In other words, public health professionals rely on studies and surveys, such as the Behavioral Risk Factors Surveillance System (BRFSS), to understand health trends. Consequently, campaigns and/or policies are developed to break trends when needed.
So why R?
R is a great asset for making powerful graphs and maps, and can complement current SAS/STATA/SPSS/etc. workflows. Proprietary statistical programs will always be public health’s frenemies in the area of research and complex survey analysis – it’s inevitable that we must sacrifice funding health fair incentives to pay for license fees – but let’s not write off R too soon. R is also desirable for pubic health data analysts (i.e., epidemiologists) who work in community organizations.
BRFSS
In laymen terms, many public health organizations and professionals cite the BRFSS when interested in health risk behaviors, health access and chronic disease prevalence. Introductions and examples have already been created on BRFSS for R, and have been featured on R-bloggers. Anthony Damico maintains a GitHub repository on downloading and storing complex surveys. I recommend his solution for those who are new to R and/or complex survey analysis and because there is a significant speed increase when working on analyzing the BRFSS alongside MonetDB. I also recommend Dr. Thomas Lumley’s book Complex Survey Analysis to understand complex surveys.
Analyzing BRFSS in R
In this example, I am taking a couple of shortcuts to import the data and to test out the srvyr package because… pipes. It also helps me to cut down some of the code and instructions entailed to setting up R and MonetDB – it’s also been done before. It is important to note, however, that Survey package is doing all the heavy lifting.
load dependencies and data
I use pacman package when handling all additional package loading and installs. I find it to be more convenient than library()
. I am also borrowing Anthony’s download script to pull the BRFSS from CDC before loading it into R.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
install.packages ( 'pacman' ) pacman:: p_load (RCurl, foreign, downloader, survey, srvyr, ggplot2, dplyr) source_url ( "https://raw.githubusercontent.com/ajdamico/asdfree/master/Download%20Cache/download%20cache.R" , prompt=F, echo=F) # download ez-pz brought to you by anthony joseph damico [ajdamico@gmail.com] tf <- tempfile (); td <- tempdir () download_cached (xpt, tf, mode= 'wb' ) local.fn <- unzip (tf, exdir=td) brfss14 <- read.xport (local.fn) save (brfss14, file= "brfss14.rda" ) # load("brfss14.rda") state.id <- read.csv ( "https://raw.githubusercontent.com/tyokota/complexsurvey/master/brfss_state.csv" , stringsAsFactors=F) ageg65yr.id <- read.csv ( "https://raw.githubusercontent.com/tyokota/complexsurvey/master/ageg65yr.csv" , stringsAsFactors=F) |
create the survey object
After importing the data, I use the Survey package to utilize the BRFSS sampling design and weighting so that we can create estimates that represent the population. The raking weighting methodology should be understood before proceeding. In this example, I will look at BMI categories as inferred by the following variable and question: BMI5CAT-Computed body mass index categories. I am re-coding BMI5CAT so that I can create estimates for overweight and/or obese prevalence.
1
2
3
4
5
6
|
brfss.design14 <- brfss14 %>% as_survey_design (ids=X_PSU, weight=X_LLCPWT, nest= TRUE , strata=X_STSTR, variables= c (X_BMI5CAT, X_MRACE1, X_STATE)) brfss.design14 <- brfss.design14 %>% mutate (X_BMI5CAT2 = car:: recode (X_BMI5CAT, "c(3,4)=1; NA=NA; else=0" ), X_BMI5CAT= as.factor (X_BMI5CAT)) |
state-by-state comparison
Now that we have created the survey design object, we can use R to create estimates from the data. Before doing this, however, it is advised that the reader studies the BRFSS code book. Another good source to learning how to use the BRFSS is to study past reports and case studies. You can always verify your query against the CDC BRFSS Prevalence & Trends Tool.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
options (survey.lonely.psu = "certainty" ) brfss.design14.0 %>% filter (X_STATE==15) %>% group_by (X_STATE, X_BMI5CAT) %>% summarize (prevalence = survey_mean (na.rm=T, vartype = c ( "ci" )), N = survey_total (na.rm=T)) # matches CDC tool brfss.design14 <- brfss.design14 %>% mutate (X_BMI5CAT2= as.factor (X_BMI5CAT2)) BMI5CAT2.1 <- brfss.design14 %>% group_by (X_STATE, X_BMI5CAT2) %>% summarize (prevalence = survey_mean (na.rm=T), N = survey_total (na.rm=T)) %>% mutate (X_STATE = as.character (X_STATE)) %>% left_join (state.id, by= c ( "X_STATE" = "VALUE" )) |
Most commonly, BRFSS is used to make state-by-state comparisons. In the past (and sometimes currently), state-by-state comparisons were often ranked. This made for great click-bait news headlines. Most recently, choropleth maps have been used to visualize prevalence rates across the nation.
1
2
3
4
5
6
7
8
9
10
11
|
BMI5CAT2.1 %>% filter (X_BMI5CAT2==1) %>% ggplot ( aes (x= reorder (STATE, prevalence), y=prevalence)) + scale_y_continuous (labels=scales::percent) + geom_bar (stat= "identity" ) + ylab ( "YES (%)" ) + xlab ( "STATE" ) + ggtitle ( "Computed body mass index categories (overweight or obese) by state" ) + theme (axis.text.x = element_text (angle = 90, hjust = 1), panel.background = element_rect (fill = 'white' )) + guides (fill= FALSE ) |
1
2
3
4
5
6
7
|
BMI5CAT2.2 <- BMI5CAT2.1 %>% filter (X_BMI5CAT2==1) %>% select (region=STATE, value=prevalence) %>% mutate (region = tolower (region)) choroplethr:: state_choropleth (BMI5CAT2.2, title= "Computed body mass index categories (overweight or obese) choropleth map" , num_colors=9) |
I’m not a big fan of this map projection, and hope there will be an option in the future to change the projection.
deep dive Hawaii – stratify by race
Before we start dropping leaflets on healthy eating across the Bible Belt, why don’t we dig into the data a bit more. Let’s take my home state for example. Hawaii appears to have fewer obesity and overweight prevalence than most of the nation – California, drool at our sculpted natural beach bodies. Seems like there’s no issues here…
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
brfss14a <- brfss14 %>% filter (X_STATE==15) brfss.design14a <- brfss14a %>% as_survey_design (ids=X_PSU, weight=X_LLCPWT, nest= TRUE , strata=X_STSTR, variables= c (X_BMI5CAT, X_MRACE1, X_STATE, X_AGEG5YR)) brfss.design14a <- brfss.design14a %>% mutate (X_BMI5CAT2 = car:: recode (X_BMI5CAT, "c(3,4)=1; NA=NA; else=0" ), X_BMI5CAT2a=car:: recode (X_BMI5CAT2, "1='Obese/overweight'; NA=NA; else='Not Obese/overweight'" ), X_BMI5CAT2a= factor (X_BMI5CAT2a, levels= c ( 'Obese/overweight' , 'Not Obese/overweight' ), ordered= TRUE ), X_MRACE1a=car:: recode (X_MRACE1, "1='White'; 2='Black'; 3='AIAN'; 4='Asian'; 5='NHOPI'; NA=NA; c(6,7)='other/multiracial'; else=NA" ), X_AGEG5YR= as.factor (X_AGEG5YR)) BMI5CAT2.3 <- brfss.design14a %>% group_by (X_STATE, X_MRACE1a, X_BMI5CAT2a) %>% summarize (prevalence = survey_mean (na.rm=T), N = survey_total (na.rm=T), prevalence_ci = survey_mean (na.rm=T, vartype = c ( "ci" ))) %>% mutate (X_STATE = as.character (X_STATE)) %>% left_join (state.id, by= c ( "X_STATE" = "VALUE" )) BMI5CAT2.3 %>% ggplot ( aes (x = X_MRACE1a, y = prevalence, group = X_BMI5CAT2a)) + geom_bar (stat = "identity" , aes (fill = X_BMI5CAT2a), alpha = 0.3) + scale_y_continuous (labels=scales::percent) + ylab ( "Prevalence (%)" ) + xlab ( "Race" ) + ggtitle ( "Computed body mass index categories (overweight or obese) in Hawaii by race" ) + theme (axis.text.x = element_text (angle = 90, hjust = 1), panel.background = element_rect (fill = 'white' )) + guides (fill = guide_legend (title = "Categories" )) |
Stratifying by race shows us a different picture: NHOPI [ed note: NHOPI stands for (N)ative (H)awaiian & (O)ther (P)acific (I)slander] overweight/obese prevalence is at 80% (CI95%: 73-87%), easily topping all point estimates by state.
deep dive Hawaii – stratify by race and age group
Going one level deeper, we see that onset of overweight/obesity starts earlier for NHOPI and that prevalence across age groups is higher in comparison. The graph suggests that efforts to address overweight/obesity among NHOPI groups is needed in Hawaii; in addition, it also suggests that intervention at earlier stages would be beneficial for this group as well.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
BMI5CAT2.4 <- brfss.design14a %>% group_by (X_STATE, X_MRACE1a, X_AGEG5YR, X_BMI5CAT2a) %>% summarize (prevalence = survey_mean (na.rm=T), N = survey_total (na.rm=T), prevalence_ci = survey_mean (na.rm=T, vartype = c ( "ci" ))) %>% mutate (X_STATE = as.character (X_STATE)) %>% left_join (state.id, by= c ( "X_STATE" = "VALUE" )) %>% mutate (X_STATE = as.character (X_STATE)) %>% left_join (ageg65yr.id %>% rename (X_AGEG5YR2=X_AGEG5YR), by= c ( "X_AGEG5YR" = "VALUE" )) BMI5CAT2.4 %>% filter (X_MRACE1a== "NHOPI" | X_MRACE1a== "White" | X_MRACE1a== "Asian" , X_AGEG5YR2!= "Dont know/Refused/Missing" ) %>% ggplot ( aes (X_AGEG5YR2, prevalence)) + geom_bar (stat = "identity" , aes (fill = X_BMI5CAT2a), col = "gray" , alpha = 0.7) + facet_grid (~X_MRACE1a) + scale_y_continuous (labels=scales::percent) + ylab ( "Prevalence (%)" ) + xlab ( "Age Group" ) + ggtitle ( "Computed body mass index categories (overweight or obese) in Hawaii by race and age" ) + theme (axis.text.x = element_text (angle = 90, hjust = 1), panel.background = element_rect (fill = 'white' )) + guides (fill = guide_legend (title = "Categories" )) |
Conclusion
Relying on high-level estimates alone can sometimes mislead your agenda. For example, after stratifying by race, we saw that the shared benefits of living in Hawaii were not equally shared by all groups. Further stratifying by age shows that onset by race occurred earlier for NHOPIs. These cuts of the data can help to narrow down public health efforts and hopefully prevent the usual spray-and-pray approach to solving problems.
This example only scratches the surface; there are other methods that can be explored. Analyzing data sets for public health such as BRFSS is more than just getting stuck on statistical models and debate on R vs. SAS vs. STATA. While it’s important that we continue to leverage statistical methods, analyzing data for public health is more than being a data monkey and computer babysitter. In recent years, with the onslaught of articles promoting “big data” and “data is power”, I’ve experienced this elitism both in the workplace and in various communities. In my wiser years, I’ve realized that power doesn’t come from hogging the ability to analyze the data, but from contributing to the discussion as a team-player.
Bonus: Download the code from this post!