Summary of analysis for factors that influence IMDB ratings

image

Summary of analysis

Population

films in IMDB film database

Objective

This project intends to use a dataset from the IMDB film database with 5 explanatory variables and the IMDB ratings to investigate which of these factors can affect whether a film's rating is above 7.

Variables

The films are rated from 0 to 10 while the related properties from the dataset are:

  • Year of release of the film in cinemas (discrete variable)
  • Length of the film in minutes (discrete variable)
  • Budget for the films production (one hundred thousand dollars) (continuous variable)
  • Number of positive votes received by viewers (discrete variable)
  • Genre of the film (factor variable) including Action, Animation, Comedy, Documentary, Drama, Romance and Short

Missing data procedures

There are 131 films in this dataset without length information. That's about 5% of all observations, which is not too much considering the sample size. So I choose to drop them out.

Models fitted and chosen

Logistic regression model is used for this analysis.

Rating is used as a response variable and the five relevant properties are used as explanatory variables.

Genre, which is a factor variable, works a little different from the other 4 variables. The coefficients of the model are different for genres with significant influence.

The significant variables are chosen firstly considering the significant test for these coefficients. I chose the significance of 5% for all of the tests used in this analysis. Actually, I wrote it wrong on the poster.

Then we considered AIC in a backward stepwise. Drop the variables that can reduce the AIC of the model.

The coefficients of votes and two genres namely Animation and Romance are not significant.

And the model without votes has the smallest AIC. Also, we can notice that dropping the variable year can only increase the AIC by 11.8 which is much smaller than the effect of other variables.

So, the second model we remove year, and merge the genres of Animation and Romance into the baseline genre Action.

Use likelihood ratio chi-square test to test the second model. Terms are added sequentially from first to last to check which ones have no significant improvement on the model's accuracy.

The outcome is that the year is not significant. Thus, for the next model we remove year. Use likelihood ratio chi-square test to check the model. All terms are significant. So that is our final model.

Finally, use Durbin-Watson test to check the independence of the residuals. The outcome is that there is no significant evidence that the residuals are not independent.

Interpretation

The final model is as presented.

About the interpretation, there is actually another mistake on my poster. The table shown is actually the mean probability of rating above 7 and mean odds for different genres when the length and the budget of the film is both 0. Which is not possible in real world. Perhaps the mean probability of the films with mean length and budget would be better.

From the table, it is clear that on average, Drama, Action, Animation and Romance films have fairly high probability of rating above 7. Comedies are less likely to have ratings above 7. While Documentaries and Short films have relatively less probability of rating above 7.

As for the remaining variables,

holding all other explanatory variables fixed, a one unit increase in length would increase the mean odds ratio of rating by 5.5% while a one unit increase in budget would decrease the mean odds by 39.5%.

That means the longer the length, the higher the probability of rating above 7.

The lower the budget, the higher the probability of rating above 7.

And most of my references are from notes from last semester

I didn't write references. I should have found more formal references.

Further work

  • In this work I didn't check the interaction term. So whether the coefficient of a variable would change when other variables are not fixed still remained to be explored.
  • I didn't explore whether there is collinearity, as there are only 5 variables. Maybe it need more exploration.
  • Other link function like Probit can be tried for this analysis.

Possible Questions

  • Why does the higher budget lead to the smaller mean probability of rating above 7?
    I guess it is because that higher budget would lead to more votes as more people would watch these films and the variety in these votes are large.
    You know those films with lower budget. I think Most of these films' viewers would watch them because they are attracted by them. But a person would watch a film just because its budget is really high.
  • Why Animation, Romance and Action are similar in rating?
    I think most people like these kinds of films. They are just three kinds of films that easy to satisfy most people. There might be some common features in them. You know inner connection we don't know...
posted @ 2021-10-08 09:32  ZZN而已  阅读(62)  评论(0编辑  收藏  举报