top of page
night-clouds-trees-stars.jpg

My Video Game Data Analysis

Introduction

In this post, I am presenting my explanatory data analysis and descriptive models of video game sales data using machine learning techniques. Hope you would enjoy and get some good ideas about the video game industry :) 

Find raw data here

​

Data Wrangling

The number of observations after 2016 in this data is very small, which does not allow to construct a reliable prediction model. The data prior to 2011 is outdated and may not help to describe the current video game industry. This lead me to filter data to analyze for  only 8th generation consoles (PS3, XBox 360, Wii, and 3DS) and the year between 2011 and 2016. After data filtering, I obtained 3538 observations. I made dummy variables in Excel. for logistic and regression model construction.

Video Game Data

Explanatory Data Analysis (EDA)

Python libraries for EDA: Pandas, Numpy, and Seaborn

EDA: Game Sales by Genre (2011- 2015)

Total Game Sales by Genre in Global (2011 - 2015)

As the total game sales significantly declined between 2011 and 2012 and continues to drop until 2015, sales of all game genres decreased throughout the years but sports.

1_edited.png

Total Game Sales by Genre in North America (2011 - 2015)

While Action and Platform game sales experienced a significant decline, game sales in other genres are relatively staying flat over the years.

2.png

Total Game Sales by Genre in Japan (2011 - 2015)

In Japanese game market, Role-Playing has been the most popular game genre for over all time. However, the recent sales data shows that it is not as popular as before 2014 and sales of action games is replacing the place.

3.png

EDA: Game Sales by Publisher

Electronic Arts and Activision are dominating the market followed by Ubisoft and Nintendo. In contrast, Nintendo monopolizes the video game market of Japan.

Number of Total Game Sales by Publisher in Global

4.png

Number of Total Game Sales by Publisher in North America

5.png

Number of Total Game Sales by Publisher in Japan

6.png

EDA: Game Sales by Console

In Japan, games published in 3DS records the highest sales. XBox 360 is the most popular platform as it sales top in North America and Playstation 3 has the top game sales in the global game market.

Number of Total Games Sales by Console in Global 

7.png

Number of Total Games Sales by Console in North America 

8.png

Number of Total Game Sales by Console in Japan

9.png

Video Game Market Descriptive Model: Logistic & Linear Regression (OLS)

Python libraries for Machine learning: Scikit-Learn

Two descriptive methods, logistic regression and Ordinary Least Squares regression (OLS) are used to describe the market. 
For the logistic models, dummy variables are created for the games having the cumulative sales higher than mean values  ( Global sales > 0.523, NA sales > 0.222, JP sales > 0.064).

R-squared: The value of R-squared is a evaluation metric for logistic and linear regression model, value range from 0 to 1.. The closer R-squared to 1, the higher predictions fit the data.  

AIC & BIC: Akaike  Information Criterion & Bayesian Information Criterion: Lower AIC & BIC indicate better fit.

AUC: Area Under the ROC (Receiver Operating Characteristic) Curve, higher AUC score indicates that the model is better at predicting. Typically 0.7 to 0.8 is considered acceptable.

 Global Sales

​

Based on the information from the exploratory data analysis, EA, Shooter, and PS3 dummy variables are included for the global market model. The R-squared value of the OLS model is 0.18 which tells that the model can explain only 18% of the sample. It also has a very high error rate.

 In the logistic model, the very high AIC and BIC scores and very low R-squared indicate that the model is not robust to describe the games with sales over sales mean

OLS (Linear Regression) Model Result

Mean Absolute Error: 0.5029
Mean Squared Error: 0.9291
Root Mean Squared Error: 0.9639

Screenshot (175).png

Logistic Model Result

AIC: 2496.6201
BIC: 2519.8779

Screenshot (177).png

   Model Evaluation

AUC score: 0.6262

g1.png

NA Sales

The NA Sales model includes Shooter, XBox 360, and EA variables. The NA models show very similar results to the global sales models with a slightly lower error rate of approximately 0.48.

OLS (Linear Regression) Model Result

Mean Absolute Error: 0.2292 
Mean Squared Error: 0.2265
Root Mean Squared Error: 0.4760

Screenshot (182).png

Logistic Model Result

AIC: 2461.6410
BIC: 2484.8986

Screenshot (189).png

   Model Evaluation

AUC score: 0.6262

na1.png

JP Sales

Independent variables including 3DS, RPG, and Nintendo are incorporated into the model for the JP market.
The models perform better as the OLS model has higher R-squared value with 0.271 although the error rate is not very improved compared to the models for the NA and global market. Meanwhile, the logit model for the JP market has lower AIC and BIC scores as well as higher AUC value. This tells that the models for JP Sales explain the market better in comparison to the models for North America and global game sales.

OLS (Linear Regression) Model Result

Mean Absolute Error: 0.2292
Mean Squared Error: 0.2265
Root Mean Squared Error: 0.4750

Screenshot (184).png

         Logistic Regression Model Result

AIC: 2191.728
BIC: 2214.986

Screenshot (187).png

Model Evaluation

AUC score: 0.6832

jp.png

Conclusion and Future Directions

The objective of this study is to determine the features shared among games sold high in the years between 2011-2015 (the time of 8th generation console).
The explanatory data analysis in this study revealed some information of the game sales over the years. For example, in North America, EA and Activision are the two most popular game publishers and Action and Shooter games are sold much more than other game genre. Also, XBox 360 is slightly more popular in North America while PS3 games are globally sold better. Japanese market is unique as Nintendo is almost monopolizing the market with Role-Playing games and 3DS is the most popular console in the market.  However, given the raw data set  which includes a limited number of observations and variables (thus, not enough information), the models created in this study are not robust to perform the predictive task, although the models for the Japanese game sales performs slightly better than models for other regions.
The models in this study include only with publisher, year, console, and genre. The data set does not have many features that may associate with game sales such as studio, promotion, which may have led to large error rates in the models, If more information is given, the model prediction accuracy will notable improve.

Get in Touch

Visit my Github for the technical index and more info! :

https://github.com/kilee722/IBM_Capstone

Thanks for submitting!

©2020 by Video Game Analysis. Proudly created with Wix.com

bottom of page