This python-selenium project crawls and scrapes data from different web pages and uploads local data files to FTP.
Sometimes websites are not able to crawl using plugins on WordPress if javascript is disabled on your browser (any browser). And this was the motivation to write a Python-selenium script and cralw the webpages.
The actual flow of the script goes like:
- Decide the webpages to crawl
- Write xpath to reach to the product URLs
- List out all the webpages and associated xpaths in .csv file
- Read .csv file into the Python-selenium script
- Check whether the directory is present or not (this is because this script is going to run daily using Windows Task Scheduler. As it is expected to have fresh webpages on daily basis, if the directory is present, delete it and create new each day)
- Get and save source code of the category page into .html
- Collect all the product page URLs, create appropriate URLs using these which will have FTP server domain name and save them in a .html file
- Crate individual product pages and save source code of each of them in it
- Once all the product pages of particular category have been crawled, it will check directory existance locally, check same directory is available on FTP or not, if directory is present on FTP, script will delete it and create new directory and upload all the files inside on FTP server
- Further this data/files will be used to scrape data using plugin in WordPress
To achieve this, Selenium Chrome Webdriver and FTP Utilities have been used.
-
To install the available Python modules and libraries you can use pip command e.g. type below command in command prompt and hit enter. It will install selenium in your virtual environment.
pip install selenium
-
To write xpaths you can refer this xpath cheatsheet
-
To run this project you can direct to the folder containing main.py file and hit below command to execute the project
python main.py
OR if you are comfortable with PyCharm, then you will have to open project directory in PyCharm and set the Configuration by setting up path to main.py as a Script Path and then Run the project by clicking on little green play arrow