Web-Scraper

Requirements

Python modules:

BeautifulSoup, requests, wordpress-xmlrpc 2.3

BeautifulSoup

pip install beautifulsoup4

requests

pip install requests

wordpress-xmlrpc 2.3

pip install python-wordpress-xmlrpc 2.3

Overview

It is a demonstration of using modules like BeautifulSoup and Requests which helps in Web Scraping in Python.
You can just scrape the content of any desired page into a .txt or .cvs file on your system.
It has Real Time monitoring that means it will keep checking for any new content that needs to scraped and posted on the website.
The Project also uses wordpress-xmlrpc 2.3. It is a Python library to interface with a WordPress blog's XML-RPC API.

Work under the hood

The two scripts WebScraperMonitor.py(Real Time Monitoring) and WebScraper_NoMonitor.py(No Real Time Monitoring) scrapes the data from government sets and save it on your system in the form of a .txt file.
The script ImportingToWordpress.py iterate through the scraped text files on your system and post a New Post for every file on the website.
The script Web-Scraper.py scrapes the content of the new link available(on the government owned website) directly into a New Post on the website. It reads the suffix of the link(from which data needs to be scraped) and suffix of the heading for every post for the website from two text file.

Additional Features

To Automate the Web-Scraper I have made a batch file which runs Web-Scraper.py script or Web-Scraper.exe (can be made by using Pyinstaller).

Creating `.exe` application for Automation

Open command line and type:

>pip install pyinstaller
>pyinstaller Web-Scraper.py

Then set task for the created batch file using Task Scheduler (for Windows) or Cron Job (for Linux).
Web-Scraper will run on the desired time and day and will scrape the new data on the website.

About The Project

This Project was part of my Internship. You can see the Scraped Data from Government owned website judis.nic.in on the website of the Employer LegalWiki.in.

NOTE

The source from where the data is scraped is unavailable right now maybe because it has been shifted to a new address. This is one of the Links.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.idea		.idea
Images		Images
ImportingToWordpress.py		ImportingToWordpress.py
README.md		README.md
Web-Scraper.py		Web-Scraper.py
WebScraperMonitor.py		WebScraperMonitor.py
Webscraper_NoMonitor.py		Webscraper_NoMonitor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web-Scraper

Requirements

Python modules:

BeautifulSoup

requests

wordpress-xmlrpc 2.3

Overview

Work under the hood

Additional Features

Creating `.exe` application for Automation

About The Project

NOTE

About

Releases

Packages

Languages

Aadit-Bhojgi/Web-Scraper

Folders and files

Latest commit

History

Repository files navigation

Web-Scraper

Requirements

Python modules:

BeautifulSoup

requests

wordpress-xmlrpc 2.3

Overview

Work under the hood

Additional Features

Creating .exe application for Automation

About The Project

NOTE

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Creating `.exe` application for Automation

Packages