Wikipedia TV Series Scraper

The project consists in a Wikipedia scraper that retrieves data about TV series.
This piece of code is particularly useful for Machine Learning purposes, as in generating datasets to train certain ML models regarding TV series.
Currently, the script aims to scrap a list of TV series' episode plots and titles and output them in a file, or as a list of seasons where each season contains all of its episodes. The script implementation, as well as a basic how-to-use, are better explained in section Usage / Examples.
In the future, I plan to implement more features which are better described in section Future Works.

Requirements

The script uses a progress bar known as tqdm, in order to provide feedback of elapsed time to the user, and the library pandas for data handling.
If any Python module is not installed on your current machine, simply install it via the following terminal command:

Using pip:

pip install pandas
pip install tqdm

Using apt:

sudo apt install python3-pandas
sudo apt install python3-tqdm

Usage / Examples

Simply copy & paste the functions or import the .py file and use accordingly.
Some pre-processing has already been implemented in the generated output such as the removal of Wikipedia text formatting, although any more pre-processing can be freely implemented as well as any other edit according to the License.
Currently, a list of Wikipedia pages needs to be given in input, in order to generate the desired output.

For example:

TV series	Input	Wikipedia URL title
`How I Met Your Mother`	How I Met Your Mother	How_I_Met_Your_Mother
`Superstore`	Superstore (TV series)	Superstore_(TV_series)

As shown, some TV series titles have the suffix (TV Series) according to their title in the Wikipedia pages and their URL.

Usage Showcase

In this showcase I demonstrate how to scrape episodes' plot and title of the TV series How I Met Your Mother:

tv_series_name = "How_I_Met_Your_Mother"
wiki_episodes_list = [ ]

wiki_season_list = get_wiki_seasons_list(tv_series_name)

for season_number, season in tqdm(enumerate(wiki_season_list, 1), desc="Scraping", total=len(wiki_season_list)):
    wiki_episodes_list.append(get_episodes_data(season, season_number))

generate_output_file(tv_series_name, wiki_episodes_list, "csv")
generate_output_file(tv_series_name, wiki_episodes_list, "xlsx")

The output will be either a CSV file or a XLSX (Excel) file with the following structure:

Field	Description
`season`	Season number of the TV series
`title`	Title of the episode
`plot`	Plot of the episode

Functions Return

Each function gives the following output:

get_tvseries_season_episodes_number: returns a List of TV series' number of seasons and episodes (*)
get_tvseries_genres: returns a List of TV series' genres (*)
get_wiki_seasons_list: returns a List of TV series' seasons that will be given as input to the function get_episodes_data for the next step
get_episodes_data: returns a List of Dicts, where each Dict contains the data of a single episode for a (single) given season
generate_output_file: generates a file containing freshly scraped data for a single TV series

(*): PLEASE NOTE: this function is not incorporated in the output files. Use the function as you wish.

Known Issues

The scraper might not fully work due to Wikipedia not always offering every episode's plot for a given TV series.
As a result of this, the generated data may have "holes" of missing episodes (or entire seasons) due to this unavailability.
Always check for a TV series completeness on its Wikipedia page(s) before extracting data.

Future Works

As for the project improvement, I plan to add the following features:

Add different output types such as CSV
Add more categories to parse (such as n. of seasons, n. of episodes, genre, etc.)
Find a way to search TV series without needing to input the exact Wikipedia name

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE.md		LICENSE.md
README.md		README.md
WikipediaEpisodeExtraction.py		WikipediaEpisodeExtraction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia TV Series Scraper

Requirements

Usage / Examples

Usage Showcase

Functions Return

Known Issues

Future Works

About

Languages

License

federicobass/wiki-tvseries-scraper

Folders and files

Latest commit

History

Repository files navigation

Wikipedia TV Series Scraper

Requirements

Usage / Examples

Usage Showcase

Functions Return

Known Issues

Future Works

About

Topics

Resources

License

Stars

Watchers

Forks

Languages