The project consists in a Wikipedia scraper that retrieves data about TV series.
This piece of code is particularly useful for Machine Learning purposes, as in generating datasets to train certain ML models regarding TV series.
Currently, the script aims to scrap a list of TV series' episode plots and titles and output them in a file, or as a list of seasons where each season contains all of its episodes. The script implementation, as well as a basic how-to-use, are better explained in section Usage / Examples.
In the future, I plan to implement more features which are better described in section Future Works.
The script uses a progress bar known as tqdm
, in order to provide feedback of elapsed time to the user, and the library pandas
for data handling.
If any Python module is not installed on your current machine, simply install it via the following terminal command:
Using pip
:
pip install pandas
pip install tqdm
Using apt
:
sudo apt install python3-pandas
sudo apt install python3-tqdm
Simply copy & paste the functions or import the .py file and use accordingly.
Some pre-processing has already been implemented in the generated output such as the removal of Wikipedia text formatting, although any more pre-processing can be freely implemented as well as any other edit according to the License.
Currently, a list of Wikipedia pages needs to be given in input, in order to generate the desired output.
For example:
TV series | Input | Wikipedia URL title |
---|---|---|
How I Met Your Mother |
How I Met Your Mother | How_I_Met_Your_Mother |
Superstore |
Superstore (TV series) | Superstore_(TV_series) |
As shown, some TV series titles have the suffix (TV Series)
according to their title in the Wikipedia pages and their URL.
In this showcase I demonstrate how to scrape episodes' plot and title of the TV series How I Met Your Mother
:
tv_series_name = "How_I_Met_Your_Mother"
wiki_episodes_list = [ ]
wiki_season_list = get_wiki_seasons_list(tv_series_name)
for season_number, season in tqdm(enumerate(wiki_season_list, 1), desc="Scraping", total=len(wiki_season_list)):
wiki_episodes_list.append(get_episodes_data(season, season_number))
generate_output_file(tv_series_name, wiki_episodes_list, "csv")
generate_output_file(tv_series_name, wiki_episodes_list, "xlsx")
The output will be either a CSV file or a XLSX (Excel) file with the following structure:
Field | Description |
---|---|
season |
Season number of the TV series |
title |
Title of the episode |
plot |
Plot of the episode |
Each function gives the following output:
get_tvseries_season_episodes_number
: returns a List of TV series' number of seasons and episodes (*)get_tvseries_genres
: returns a List of TV series' genres (*)get_wiki_seasons_list
: returns a List of TV series' seasons that will be given as input to the functionget_episodes_data
for the next stepget_episodes_data
: returns a List of Dicts, where each Dict contains the data of a single episode for a (single) given seasongenerate_output_file
: generates a file containing freshly scraped data for a single TV series
(*): PLEASE NOTE: this function is not incorporated in the output files. Use the function as you wish.
The scraper might not fully work due to Wikipedia not always offering every episode's plot for a given TV series.
As a result of this, the generated data may have "holes" of missing episodes (or entire seasons) due to this unavailability.
Always check for a TV series completeness on its Wikipedia page(s) before extracting data.
As for the project improvement, I plan to add the following features:
- Add different output types such as CSV
- Add more categories to parse (such as n. of seasons, n. of episodes, genre, etc.)
- Find a way to search TV series without needing to input the exact Wikipedia name