高级统计方法 | Advanced statistical methods
来自选修的一门统计课程:Advanced statistical methods
理论性较弱,实践性很强的工具类课程,学完后可以直接拿R来分析数据。
课程目录:
- Introduction to R
- Regression model in R
- Applied regression I
- Applied regression II
- Applied regression III
- Conditional logistic regression and propensity score method
- Inverse probability weighting and meta analysis
- Instrumental variable analysis
课程结业标准:
- Appropriate analytic method
- Accurate numerical results
- Clear presentation of the results and choice of methods
- Interpretation of the results relevant to the public health context
1 Introduction to R
- Use R to perform basic algebraic operations
- Work with variables, vectors and matrices in R
- Produce clear and well formatted graphs in R
- Install and load R packages for specific needs
数据基本操作
基本运算符:+ - * / ^
基本运算函数:sqrt、exp、log、abs、round
帮助:?、??
数据基本操作函数:rep、seq、length、sum、mean、sd、median、min、max、var、sort、order、which、summary、sample、runif
矩阵运算:%*%、solve、t、colSums、colMeans、dim、cbind、rbind
逻辑运算:! & |
判断:is.na is.factor
类型转换:as.factor
数据转换:aggregate、plyr包、melt、table、prop.table
文件读取
文件存储
绘图
基本绘图
plot
pairs
hist
boxplot
points
lines
text
abline
polygon
legend
title
axis
par
windows
layout
dev.off
高级绘图
ggplot2
cowplot
2 Regression model in R
生成随机分布的数据:
runif - 均匀分布
rbinom
rnorm
sample
set.seed
cut
factor
relevel
线性回归
simple linear regression
multiple linear regression
interactions
lm
summary
Residuals - the difference between the actual observed response values
Coefficients
【必须了解summary结果里面的每一个指标及其意义】
CI
confint
QUICK GUIDE: INTERPRETING SIMPLE LINEAR MODEL OUTPUT IN R
3 Applied regression I
针对特定的数据使用合适的模型
- Apply poisson and negative binomial regression models to count data
- Identify and apply suitable model to overdispersed data
count data
- Nonnegative
- positively skewed
- Variance tends to increase with mean
- 不符合Homoscedasticity, Normality
Generalized Linear Model (GLM)
maximum likelihood
很奇怪,对1回归,summary(glm(deaths ~ 1, data=horse, family=poisson))?
Dispersion parameter for poisson family taken to be 1
glm的summary结果解读
Model checking
compare the observed event counts to data that we might have expected, under a Poisson(0.61) model
Formal model goodness-of-fit
residual deviance/df should not be too much bigger than 1
A Poisson model with covariates in R
summary(glm(deaths~corps, data=horse, family=poisson))
Incidence rate ratios (IRR) / relative risks
Poisson regression with offsets
Overdispersion - Negative Binomial model
the variance (823.475) is much larger than the mean (28.41)
summary(glm.nb(y~1, data=epilepsy))
Comparing models
A lower AIC indicates a ‘better’ model
4 Applied regression II
- Apply Poisson and negative binomial regression models to count data
- Identify and apply suitable model to overdispersed data
- Identify influential observations影响点,去掉某点后的影响力大小
- Perform model diagnostics
- Understand and deal with multicollinearity
hatvalues(mvc.r.lm)
sort(round(cooks.distance(mvc.r.lm),2), decreasing=T)
Model diagnostics
Estimation method and statistical tests are based on model assumptions
- potential violated assumptions
- extent of violation
- Acknowledge limitation
- alternative statistical model
Assumptions of linear regression model
- Linearity
- Homoscedasticity
- Normality of the errors
- Independence
Residual plot against fitted values
Q-Q Plot
P-P Plot
ACF plot
Multicollinearity
VIF
5 Applied regression III
- Identify and handle multicollinearity
- Account for confounding factors in regression model
- Assess potential effect modifiers in regression model
- Perform basic mediation analysis
6 Conditional logistic regression and propensity score method
- Fit conditional logistic regression model to data from case control study
- Understand the assumptions of the propensity score method
- Interpret results from propensity score method
7 Inverse probability weighting and meta analysis
- Appreciate the use of inverse probability weighting
- Apply inverse probability weighting for analysis of missing data
- Perform meta analysis to obtain overall estimate of an intervention effect from multiple studies
8 Instrumental variable analysis
- Estimate treatment effect using instrumental variable analysis for noncontrolled experiment
- Understand the assumptions instrumental variable analysis
- Interpret results from instrumental variable analysis
基本概念:
RR
OR和β(estimated coefficients)
Final exam
An investigator conducted a retrospective analysis on the association between statin therapy and psychological disorders, based on a database of medical records. The analysis adjusted for potential confounders such as age, sex, BMI and comorbidity.
研究人员根据病历数据库对他汀类(statin)药物治疗与心理疾病之间的关联进行了回顾性分析(retrospective analysis)。 该分析针对潜在的混杂因素(例如年龄,性别,BMI和合并症)进行了调整。
变量Variable name
- Id
- Male
- Age
- Bmi
- comorbid.s, Charlson comorbidity index
- Statin, Statin users
- Psych, Psychological disorder
id male age bmi comorbid.s statin psych 1 1 0 54 20.9 1 0 0 2 2 0 42 19.1 0 0 0 3 3 1 46 23.9 1 1 0 4 4 1 58 23.5 0 0 1 5 5 1 43 28.7 1 1 0 6 6 1 46 26.6 0 1 0
-
问题:
(A) Carry out a standard regression analysis to estimate the effect of statin therapy on psychological disorder, adjusting for sex, age, BMI and comorbidity. Present the odds ratios with 95% confidence intervals for the variables in a
table. [10%] 标准的线性模型
The investigator also decided to carry out a propensity score analysis. PSA分析参考作业2
(B) Fit a propensity score model to predict statin use. You may consider main effects only (even when not all patient characteristics can be satisfactorily balanced). Present and interpret the model results. [8%]
(C) Based on your propensity score model, how well the patient characteristics were balanced across statin users and non-users with similar propensity scores? [6%]
(D) State the key assumptions of propensity score analysis and assess if they are satisfied. [6%]
(E) Do you think it is appropriate to use propensity score analysis in this setting? Briefly explain why. [4%]
(F) Estimate the effect of statin therapy (and the corresponding 95% CI) on psychological disorder and compare with the results in (A). [8%]
(G) Based on the results in (A) - (F), summarize and interpret the main findings from the analyses. [8%]
结题思路:
1. 可以用的模型,标准linear regression;GLM:possion、NB;clogit等