python-web-crawler-and-scraper

Overview

This python-selenium project crawls and scrapes data from different web pages and uploads local data files to FTP.

Motivation

Sometimes websites are not able to crawl using plugins on WordPress if javascript is disabled on your browser (any browser). And this was the motivation to write a Python-selenium script and cralw the webpages.

Technical Aspects

The actual flow of the script goes like:

Decide the webpages to crawl
Write xpath to reach to the product URLs
List out all the webpages and associated xpaths in .csv file
Read .csv file into the Python-selenium script
Check whether the directory is present or not (this is because this script is going to run daily using Windows Task Scheduler. As it is expected to have fresh webpages on daily basis, if the directory is present, delete it and create new each day)
Get and save source code of the category page into .html
Collect all the product page URLs, create appropriate URLs using these which will have FTP server domain name and save them in a .html file
Crate individual product pages and save source code of each of them in it
Once all the product pages of particular category have been crawled, it will check directory existance locally, check same directory is available on FTP or not, if directory is present on FTP, script will delete it and create new directory and upload all the files inside on FTP server
Further this data/files will be used to scrape data using plugin in WordPress

To achieve this, Selenium Chrome Webdriver and FTP Utilities have been used.

Let's see Installation of Python and it's modules/libraries using pip

Python and Selenium installation
To install the available Python modules and libraries you can use pip command e.g. type below command in command prompt and hit enter. It will install selenium in your virtual environment.

pip install selenium
To write xpaths you can refer this xpath cheatsheet
To run this project you can direct to the folder containing main.py file and hit below command to execute the project

python main.py

OR if you are comfortable with PyCharm, then you will have to open project directory in PyCharm and set the Configuration by setting up path to main.py as a Script Path and then Run the project by clicking on little green play arrow

Files used to run the code and directories/files created after it's execution are:

CSV file The CSV file webpages_to_crawl.csv will look somthing like this:
Directories and files uploaded on FTP:

Directories named as Category name on server

Specials page

Individual product pages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
logs		logs
savedwebsites		savedwebsites
test		test
websitecrawler		websitecrawler
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
main.py		main.py
webpages_to_crawl.csv		webpages_to_crawl.csv
websitecrawler.bat		websitecrawler.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

python-web-crawler-and-scraper

Overview

Motivation

Technical Aspects

Let's see Installation of Python and it's modules/libraries using pip

Files used to run the code and directories/files created after it's execution are:

About

Releases

Packages

Languages

durvaavachat/python-web-crawler-and-scraper

Folders and files

Latest commit

History

Repository files navigation

python-web-crawler-and-scraper

Overview

Motivation

Technical Aspects

Let's see Installation of Python and it's modules/libraries using pip

Files used to run the code and directories/files created after it's execution are:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages