Predicting the Revenues of Movies Before They’re Released IMDbot (bfoulon, xzhan199, __, __ ) Goal Movie revenue matters a lot for investors. Is it possible to predict the movie revenue when only knowing casts to reduce the investment risk? This project aims to predict a movie’s revenue before it’s released. Based on imdb data, twitter sentiment analysis.

Table of Contents

QUESTION

Predicting the Revenues of Movies Before They’re Released

IMDbot (bfoulon, xzhan199, __, __ )

Don't use plagiarized sources. Get Your Custom Essay on
Predicting the Revenues of Movies Before They’re Released IMDbot (bfoulon, xzhan199, __, __ ) Goal Movie revenue matters a lot for investors. Is it possible to predict the movie revenue when only knowing casts to reduce the investment risk? This project aims to predict a movie’s revenue before it’s released. Based on imdb data, twitter sentiment analysis.
Just from $13/Page
Order Essay

 

Goal

Movie revenue matters a lot for investors. Is it possible to predict the movie revenue when only knowing casts to reduce the investment risk? This project aims to predict a movie’s revenue before it’s released. Based on imdb data, twitter sentiment analysis.

 

Data

Our data came from IMDB and Twitter. We used IMDB’s dataset to get the title, year, genres, director, writer, actors, and runtime of movies released from 2009 to 2019. We removed movies that did not have gross revenue listed. Since IMDB’s dataset does not include revenue info, we used web scraping to collect this data for each movie.

 

Tweets mentioning the director and writer of each movie were collected using Twitter’s API via Python’s Tweepy package. Then, sentiment analysis was performed on those tweets using the Python’s TextBlob package. We recorded (1) number of tweets and (2) average sentiment analysis for each writer and director in our database.

 

For our data preprocess, we give each numerical column an extra binary column representing if this value is null or not. We divide string columns to category or simple string columns. For simple string columns, we just tokenize the string to numbers. For each category column, we apply vectorizing limiting with Max Feature. This is done by ordering based on appearance frequency first. Then decides how many features to keep as binary vectors. For example, we classify genres as a category column then apply vectorizing. After analysing all the genres, we keep the top 20 frequent genres (this number is max_features) and adding columns for each of genres as a binary column representing appearance. For each movie you can see 20 genre columns with values 1 or 0. For this data preprocessing part, we have to explore which columns to use and also which string columns should become category columns.

 

 

Model+Evaluation Setup

Comparing different models. The models we are using include Random Forest, Gradient Boosting,  Decision Tree, and Ridge Regression etc. Random Forest and Gradient Boosting are the best performance models.

Random Forest consists of a large number of individual decision trees and each individual tree in the random forest spits out a class prediction and we choose from those different directions’ prediction with the most votes to be the model’s predictions. Low correlation is the key. The success and good prediction accuracy of the Random Forest model lies in that a large number of relatively uncorrelated models(trees) operating as a committee will outperform any of the individual constituent models. The trees protect each other from their individual errors. Even some trees may be wrong, other trees will be right, so generally as a group the trees are able to move in the correct direction.

Gradient Boosting takes the residual of every step and weights the classified instance by giving the wrong classified models high weights and right ones low weights. Later, the model can focus on  those wrong ones.

We use k-fold cross validation split train and test data. It can improve consistency of our estimation.. (We can use the result from trends with changing included features. add the figures w/ k-fold?)

Evaluation: r square (correlation), mse (used but not good, since error is huge, square will be too big), mae(mostly considered when talking about our error).

Our goal is predicting the revenue of unreleased movies. As we can see from our data, revenue can range from 72  to 9.37E+08 (USA gross) and almost 0 (the lowest non-zero value in our data is 13)  to 2.8E+09 (world gross). The difference between the lowest and highest value can be 10^8 to 10^9 dimension. In this case, MSE and MAE are huge. It does not provide much information about how the model performs. On the contrary,  R-squared is a normalized version of MSE, which means it does not depend on the scale of the data. Thus, R-squared provides us a good reference for evaluating the performance of our model.

Evaluation also includes plotting to see the trends. Dividing data based on features to see error? total weighted error??

 

 

model_names categories rsquare mse mae
Random_Forest wdcag 0.5537235057 3.19749E+15 14601515.35
Gradient_Boosting wdcag 0.5516677244 3.19828E+15 15811718.01
K_Nearest_neighbors_ball_tree_5 wdcag -0.1917098133 8.31445E+15 26877886.23
Decision_tree wadcg 0.1065421454 6.17054E+15 17768505.85
KernelRidge ag 0.3610958389 4.55574E+15 25288269.56

 

 

Results and Analysis

(Good projects should have more than one claim/observation. The first should report the performance of the model against some relevant baseline. The following should offer insight into features, error patterns, overfitting, etc.)

 

 

Claim #1: Using 30 as max features for vectorizing gives the best balance between r-square and mean absolute error.

Support for Claim #1: After we fix the category to be “wdcag”, we start testing for different max features for vectorizing category columns. We can see from the figure below that increasing max features gives lower mean absolute error. However, r-square starts decreasing after 20 max-features. Therefore, we choose 30 which balance the r-square and mae for our model. We will use 30 for the rest of claims.

 

Claim #2: Genres play an important role in movie revenues.

Support for Claim #2: We train the model with different combinations of categories while controlling other variables. By comparing the result, we can see that categories with Genres (g in graph) boost our prediction r-square value. Also, treated all writers, directors, country, actors and genres as category columns gives us the best prediction result.

 

 

Claim #3: Prediction model works better with adventure, sci-fi, action, animation and comedy genres

Support for Claim #3: Then we take a deep look at how different genres change the prediction models, we use the same prediction setting to train models to focus on data with different genres.

Our result shows that for adventure, action, comedy, animation and sci-fi genres, our prediction models work better than others which suggest our prediction result can draw the correlation with real gross revenue more  . However, it’s important to notice that some movies have small sample sizes which might be the reason that our model performs badly.

 

 

Claim #4: When the world gross increases, our model prediction absolute error also increases.

Support for Claim #4: We plot the world gross in increasing order and absolute error between prediction and real gross. Normally, if gross gets bigger, our model also tends to have bigger errors. However, our model can capture the basic trend for movie range which means we normally can successfully predict if this will be classify as high revenue movie or low revenue  .

 

Claim #5: If we train and test only for USA movies, our model gives better results.

Support for Claim #5: We tried to train and test based on each countries’ data only. USA is the only one that gives higher rsquare value (0.626) compared to overall movies. The US also has the largest number of movies from 2009 to 2019 in our record which are 4490 movies. Also, if we train and test in USA data with  adventure, sci-fi, action, animation and comedy genres, our model works even with high rsquare value (0.631).

 

 

The MAE numbers are larger for the high r2 ones too though )except for comedy…)

There also seems to be a threshold at ~1e9 after which error goes higher. Could comment on that too

ANSWER

Analysis and Interpretation of Movie Revenue Prediction Results

In this project, the goal was to predict the revenue of movies before their release by leveraging data from IMDB and Twitter. The dataset included information such as movie titles, genres, director, writer, actors, and runtime, along with the movie’s revenue obtained through web scraping. The Twitter API was used to collect tweets mentioning the director and writer of each movie, and sentiment analysis was performed on those tweets. Various models, including Random Forest, Gradient Boosting, Decision Tree, and Ridge Regression, were employed for revenue prediction, and evaluation was done using metrics such as R-squared, mean squared error (MSE), and mean absolute error (MAE).

The results obtained from the models provide insights into the predictive capabilities of the selected features and models. Here are the key claims and observations drawn from the analysis:

Claim #1: Using 30 as max features for vectorizing gives the best balance between R-squared and mean absolute error (MAE).

Support for Claim #1: By testing different values for the maximum number of features used in vectorizing the category columns, it was observed that increasing the number of max features reduced MAE. However, R-squared started to decrease after 20 max features. Therefore, a value of 30 was chosen as it struck a balance between R-squared and MAE, resulting in the best performance for the model.

Claim #2: Genres play an important role in movie revenues.

Support for Claim #2: When comparing different combinations of categories while controlling other variables, it was found that including genres (g in the graph) significantly improved the prediction R-squared value. Treating writers, directors, country, actors, and genres as category columns provided the best prediction results.

Claim #3: Prediction model works better with adventure, sci-fi, action, animation, and comedy genres.

Support for Claim #3: Further analysis was conducted to examine the impact of different genres on the prediction models. The models were trained using different genre categories, and it was observed that the prediction models performed better for adventure, action, comedy, animation, and sci-fi genres compared to others. This suggests that the prediction results can establish a stronger correlation with the real gross revenue for movies belonging to these genres. However, it is important to note that some genres may have smaller sample sizes, which can affect the model’s performance.

Claim #4: As the world gross revenue increases, the model’s prediction absolute error also increases.

Support for Claim #4: To explore the relationship between world gross revenue and prediction errors, a plot was created with increasing world gross values on the x-axis and the absolute error between the predictions and real gross revenue on the y-axis. It was observed that, generally, as the gross revenue increased, the model tended to have larger errors. However, the model was still able to capture the basic trend for the movie range, indicating that it could successfully classify movies as high or low revenue.

Claim #5: Training and testing the model exclusively on USA movies yield better results.

Support for Claim #5: The model was trained and tested on data specific to each country, and it was found that training and testing on USA movies alone resulted in a higher R-squared value (0.626) compared to the overall dataset. Moreover, the USA had the largest number of movies in the dataset (4490 movies). Additionally, when training and testing were performed on USA movies belonging to the adventure, sci-fi, action, animation, and comedy genres, the model exhibited even higher R-squared value (0.631).

Furthermore, it is worth noting that the MAE numbers were larger for the high R-squared predictions, except for the comedy genre. Additionally, there seems to be a threshold at around

 1e9 (1 billion) for movie revenue, after which the prediction error tends to increase. This observation suggests that the model may struggle to accurately predict revenues for movies with exceptionally high gross amounts.

In conclusion, the analysis of movie revenue prediction based on IMDB data and Twitter sentiment analysis provided valuable insights into the factors influencing movie revenues. The results emphasized the importance of genres, with certain genres such as adventure, sci-fi, action, animation, and comedy showing a stronger correlation with revenue. Training and testing the model specifically on USA movies yielded better prediction results, indicating the influence of regional factors. However, it is essential to consider the limitations of the dataset and the potential impact of sample sizes on the model’s performance.

 

Homework Writing Bay
Calculator

Calculate the price of your paper

Total price:$26
Our features

We've got everything to become your favourite writing service

Need a better grade?
We've got you covered.

Order your paper