This project aims to predict misinformation in football transfer news using various machine learning algorithms, particularly ensemble learning models. The main techniques used in this project include:
- Collecting data from the BBC gossip column and verifying it using transfermarkt data and API-Football data
- Structuring the data using GPT-3
- Extracting additional features from the data using NLP & fuzzy matching
- Encoding the data using one-hot encoding and multi-label binarization techniques
- Correcting the imbalanced dataset using random over sampling
- Filling in missing data by either dropping rows or filling in some columns with the mean
- Visualizing the data using bar charts, tables, and box plots
- Training and evaluating ensemble learning models such as Random Forest, XGBoost, and AdaBoost to find the best features
The target audience for this project includes researchers, developers, football clubs, and anyone with a general interest in football or methods to detect misinformation. The project results show that the Random Forest model achieved the highest accuracy of 0.8639 after applying random over sampling. The top 5 features identified by the Random Forest model are the market value of a player, the time to the start/end of the transfer window, and the age of a player. The detailed results can be found in the results directory.
The main folders and files for this project are organized as follows:
- data/: Contains all the project data in CSV format.
- results/: Stores images of model evaluations and dataset analyses.
- src/: Contains Python scripts for the project:
- data_collection/: Scripts for collecting data.
- bbc_transfer_rumours_scraper.py: Scrapes BBC gossip column links.
- football_api.py: Collects player data from API-Football.
- transfer_news_scraper.py: Extracts transfer news from gossip column links.
- transfermarkt_scraper.py: Scrapes transfermarkt data for verifying transfer news.
- data_collecting.py: Runs all data collection scripts.
- data_preprocessing.py: Preprocesses the dataset (updates missing data, removes irrelevant data, encodes data).
- data_structuring.py: Structures raw data using GPT-3.
- data_wrangling.py: Identifies true/false transfer news by cross-referencing datasets.
- model_training.py: Evaluates and trains ensemble learning models, identifies important features.
- pipeline.py: Sets up the data pipeline, requiring possible user input.
- utils.py: Shared functions used across the project.
- visualization_and_analysis.py: Visualizes the dataset and analyzes results.
- data_collection/: Scripts for collecting data.
To set up the project environment, please follow these steps:
-
Ensure you have Python 3.6+ installed. You can download it from https://www.python.org/downloads/.
-
Install the required libraries and dependencies. You can find a list of the main libraries used in this project below:
- numpy
- pandas
- scikit-learn
- xgboost
- matplotlib
- seaborn
- thefuzz
- locationtagger
- google-api-python-client
- imbalanced-learn
- beautifulsoup4
- requests
- anaconda3
- juypter notebook
-
It is recommended to use Anaconda to manage the project dependencies:
- Create an Anaconda environment:
conda create -n <name of env> python=<python version>
- Activate the environment:
conda activate <name of env>
- Create an Anaconda environment:
-
Run Jupyter Notebook:
jupyter notebook
which will open up a new browser window containing python scripts in the project -
Install the required libraries within the Anaconda environment by running the following command:
pip install -r requirements.txt
in the command line
To use the project, you have two main options:
-
Run individual scripts from the command line or Jupyter Notebook: You can execute each script in the src directory (except utils.py) either by running python in the command line or in the corresponding .ipynb click on the cell containing the "main" function and pressing the play button in the Jupyter Notebook UI.
-
Run the entire data pipeline by executing the src/pipeline.py file or in the pipeline.ipynb file click on the cell containing the "main" function and pressing the play button in the Jupyter Notebook UI.
The data pipeline offers various options for running the different steps of the project. You can choose to run all steps at once, run steps interactively, or run only a single step. The available steps are:
- Collect data
- Structure data
- Preprocess data
- Wrangle data
- Train and evaluate models
- Visualize and analyze data
The pipeline script allows you to select an option using a simple menu. For example, you can choose to run all steps at once, run steps interactively (you will be prompted whether you want to run each step or not), or run only a single step by entering the corresponding number. The pipeline will execute the selected steps and display the results.
The project uses the following APIs:
- OpenAI GPT-3: https://platform.openai.com/
- API-Football: https://www.api-football.com/
- Google Custom Search: https://developers.google.com/custom-search
Docs on how to obtain API keys for these services can be found in the following links:
- https://platform.openai.com/docs/introduction - Use the key to set the environment variable OPENAI_API_KEY
- https://www.api-football.com/documentation-v3#section/Authentication - Use the key to set the environment variable FOOTBALL_API_KEY
- https://developers.google.com/custom-search/v1/overview - Use the key to set the environment variable GOOGLE_API_KEY and the search engine ID to set the environment variable CX_ID
Keys can be set in the .env file in the root directory of the project. An example .env file is provided in the root directory of the project.