Git Repository for Project Group 4
Link to dataset: https://www.kaggle.com/datasets/thedevastator/discovering-hidden-trends-in-global-video-games
Link to data: https://sbcharitybucket.s3.us-west-2.amazonaws.com/Video_Games.csv
- Clean dataset
- Dataset with categorical variables encoded
- Dataset with Genre and Rating one-hot encoded
- Dataset with Genre, Rating, Publisher, Platform one-hot encoded
- Dataset for EU Action Games
- Dataset for JP Roleplay Games
- Dataset for NA Action Games
- Dataset for Top 10 Publishers
- Preliminary Data Analysis
- Notebook for producing clean dataset
- Notebook for producing dataset with categorical variables encoded
- Notebook for producing dataset with Genre and Rating one-hot encoded
- Notebook for producing dataset for Action Games for EU
- Notebook for producing dataset for Roleplay Games for JP
- Notebook for producing dataset for Action Games for NA
- Notebook for producing dataset for Top 10 Publishers
- Database files including ERD
- Data visualizations
- Model Predictions Notebook
- Modeling using R
- Presentation
- Dashboard
- Can we predict global revenues within the first year of a new game's release? Discover the key features for rev. predictions
- Is there a direct connection between score, rating, & sales?
- Our dataset holds all the video games launched from 1980 to 2020. Columns include game features like genre, launch year, publishers, sales my market, critic and user scores etc., which will allow us to understand the gaming industry and popular games over year.
- We are selecting this data because we are interested in gaming indusrtry and would love to find out what are the factors make games stand out from more than 15k competitors by looking at their revenue and scores.
- Python
- SQL
- Tableau
- R
- PgAdmin
The dataset has been divided into three tables: Games, Sales and Ratings. <"https://github.com/bbinvt/project-group-4/blob/main/Database/ERD.PNG">
- Drop null values
- Drop columns: Critic Count, User Count, Developer, Rating (Alternatively try filling in null values with averages of the column)
- Normalize critic score and revenue
- Weighted average of Revenues per year for direct comparisons between games - ie 80% of revenue comes within the first year
We are designing a model to predict the revenues of videogames by geography and ultimately global. First, we are examining the statistical relationship of the data set's features to determine relevance. Then we will employ a linear regression model to predict video game revenues. We will examine different methods to improve the accuracy of our model by altering label encoding, dropping variables, and can also change how data points are grouped.
- Best selling genre by market
- User/critic score by games/genres
- Sales by genre by region
Through R, we saw that the statistically important features are as follows:
- Years_on_Market
- Critic_Score
- Genre
- User_Score
- Publisher
- Rating
- Dataset overview (where we started)
- Why we selected this topic
- Bestselling game globally and it's platform
- Top user score game globally and it's platform
- Top critic score game globally and it's platform
- Top genre by each market
- Can we predict global revenues within the first year of a new game's release? Discover the key features for rev. predictions
- Is there a direct connection between score, rating, & sales?
- Tools we use
- Models we use
- How we improved the model
- Prediction results
- Accuracy score
- Visuals from both Python and Tableau
Linear regression models were used to predict global sales. The linear regression models performed quite poorly in general; the R2 value or the amount of variance that could be explained by the features varied from 6.3% to 17.1% depending on the features and type of encoding and binning used.
Using XG Boost to predict global sales has so far provided far better results. The best results so far have been an R2 value of 0.402 i.e. 40.2% of the variance can be explained by the features. For that model, the features that were used to predict global sales were Years_On_Market, Critic_Score, User_Score, Genre, Rating and Publisher, and Platform (with the last 4 features being one-hot encoded).
By creating models to predict sales for specific regions and genre of games, we were able to create high performing models with the highest performing model being the model used to predict North American action genre games (R squared score of 82.2%).