
My Video Game Data Analysis
Introduction
In this post, I am presenting my explanatory data analysis and descriptive models of video game sales data using machine learning techniques. Hope you would enjoy and get some good ideas about the video game industry :)
Find raw data here
​
Data Wrangling
The number of observations after 2016 in this data is very small, which does not allow to construct a reliable prediction model. The data prior to 2011 is outdated and may not help to describe the current video game industry. This lead me to filter data to analyze for only 8th generation consoles (PS3, XBox 360, Wii, and 3DS) and the year between 2011 and 2016. After data filtering, I obtained 3538 observations. I made dummy variables in Excel. for logistic and regression model construction.

Video Game Data
Explanatory Data Analysis (EDA)
Python libraries for EDA: Pandas, Numpy, and Seaborn
EDA: Game Sales by Genre (2011- 2015)
Total Game Sales by Genre in Global (2011 - 2015)
As the total game sales significantly declined between 2011 and 2012 and continues to drop until 2015, sales of all game genres decreased throughout the years but sports.

Total Game Sales by Genre in North America (2011 - 2015)
While Action and Platform game sales experienced a significant decline, game sales in other genres are relatively staying flat over the years.

Total Game Sales by Genre in Japan (2011 - 2015)
In Japanese game market, Role-Playing has been the most popular game genre for over all time. However, the recent sales data shows that it is not as popular as before 2014 and sales of action games is replacing the place.

EDA: Game Sales by Publisher
Electronic Arts and Activision are dominating the market followed by Ubisoft and Nintendo. In contrast, Nintendo monopolizes the video game market of Japan.
Number of Total Game Sales by Publisher in Global

Number of Total Game Sales by Publisher in North America

Number of Total Game Sales by Publisher in Japan

EDA: Game Sales by Console
In Japan, games published in 3DS records the highest sales. XBox 360 is the most popular platform as it sales top in North America and Playstation 3 has the top game sales in the global game market.
Number of Total Games Sales by Console in Global

Number of Total Games Sales by Console in North America

Number of Total Game Sales by Console in Japan

Video Game Market Descriptive Model: Logistic & Linear Regression (OLS)
Python libraries for Machine learning: Scikit-Learn
Two descriptive methods, logistic regression and Ordinary Least Squares regression (OLS) are used to describe the market.
For the logistic models, dummy variables are created for the games having the cumulative sales higher than mean values ( Global sales > 0.523, NA sales > 0.222, JP sales > 0.064).
R-squared: The value of R-squared is a evaluation metric for logistic and linear regression model, value range from 0 to 1.. The closer R-squared to 1, the higher predictions fit the data.
AIC & BIC: Akaike Information Criterion & Bayesian Information Criterion: Lower AIC & BIC indicate better fit.
AUC: Area Under the ROC (Receiver Operating Characteristic) Curve, higher AUC score indicates that the model is better at predicting. Typically 0.7 to 0.8 is considered acceptable.
Global Sales
​
Based on the information from the exploratory data analysis, EA, Shooter, and PS3 dummy variables are included for the global market model. The R-squared value of the OLS model is 0.18 which tells that the model can explain only 18% of the sample. It also has a very high error rate.
In the logistic model, the very high AIC and BIC scores and very low R-squared indicate that the model is not robust to describe the games with sales over sales mean
OLS (Linear Regression) Model Result
Mean Absolute Error: 0.5029
Mean Squared Error: 0.9291
Root Mean Squared Error: 0.9639
.png)
Logistic Model Result
AIC: 2496.6201
BIC: 2519.8779
.png)
Model Evaluation
AUC score: 0.6262

NA Sales
The NA Sales model includes Shooter, XBox 360, and EA variables. The NA models show very similar results to the global sales models with a slightly lower error rate of approximately 0.48.
OLS (Linear Regression) Model Result
Mean Absolute Error: 0.2292
Mean Squared Error: 0.2265
Root Mean Squared Error: 0.4760
.png)
Logistic Model Result
AIC: 2461.6410
BIC: 2484.8986
.png)
Model Evaluation
AUC score: 0.6262

JP Sales
Independent variables including 3DS, RPG, and Nintendo are incorporated into the model for the JP market.
The models perform better as the OLS model has higher R-squared value with 0.271 although the error rate is not very improved compared to the models for the NA and global market. Meanwhile, the logit model for the JP market has lower AIC and BIC scores as well as higher AUC value. This tells that the models for JP Sales explain the market better in comparison to the models for North America and global game sales.
OLS (Linear Regression) Model Result
Mean Absolute Error: 0.2292
Mean Squared Error: 0.2265
Root Mean Squared Error: 0.4750
.png)
Logistic Regression Model Result
AIC: 2191.728
BIC: 2214.986
.png)
Model Evaluation
AUC score: 0.6832

Conclusion and Future Directions
The objective of this study is to determine the features shared among games sold high in the years between 2011-2015 (the time of 8th generation console).
The explanatory data analysis in this study revealed some information of the game sales over the years. For example, in North America, EA and Activision are the two most popular game publishers and Action and Shooter games are sold much more than other game genre. Also, XBox 360 is slightly more popular in North America while PS3 games are globally sold better. Japanese market is unique as Nintendo is almost monopolizing the market with Role-Playing games and 3DS is the most popular console in the market. However, given the raw data set which includes a limited number of observations and variables (thus, not enough information), the models created in this study are not robust to perform the predictive task, although the models for the Japanese game sales performs slightly better than models for other regions.
The models in this study include only with publisher, year, console, and genre. The data set does not have many features that may associate with game sales such as studio, promotion, which may have led to large error rates in the models, If more information is given, the model prediction accuracy will notable improve.
Get in Touch
Visit my Github for the technical index and more info! :