In this Webscraping Project
Jupyter notebook, we scrape the Wikipedia pages for Disney movies to create a Disney Movies dataset. We scrape data like Title
, Directed by
, Produced by
, Written by
, Narrated by
, Music by
, Cinematography
, Edited by
, Production company
, Distributed by
, Release date
, Running time
, Country
, Language
from Wikipedia. We also work with OMDb API to get imdb
, metascore
, rotten_tomatoes
data. The data is stored as JSON and CSV and intermediately using Pickle library in Python.
- Task 1: Scrape info box from Toy Story 3 Wiki page and save in python dictionary.
- Task 2: Scrape info box for all Disney movies and save in list of python dictionaries.
- Task 3: Clean the data!
- Strip out all references ([1], [2], etc)
- Split up long strings
- Convert 'Running time' field to integer
- Convert 'Budget' and 'Box office' fields to floats
- Convert dates to datetime objects
- Save data using Pickle
- Task 4: Attach IMDb, Rotten Tomatoes, Metascores to dataset using OMDb API.
- Task 5: Save final dataset as JSON and CSV files.
- Jupyter Notebook
- Beautiful Soup
- Requests
- Pickle
- Pandas
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project (click on
Fork
in the top-left corner) - Create your Feature Branch (
git checkout -b feature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature
) - Open a Pull Request
Sinjoy Saha