Google Reviews scraping

Scraping scripts for various service related data sources with dockerized selenium infrastructure

Setup

Inside the root folder of the application run $ docker-compose up -d. This will set up selenium grid as well as currently one node for both Firefox and Chrome.

Scraping

For automatic usage of selenium scrapy calls are wrapped by the entry- point crawl, used like following:

$ docker-compose run google-reviews crawl --company_name="aspria berlin ku'damm"

or

$ docker-compose run google-reviews crawl --company_name="Granvalora Limburg"

Note: The limitation of this script is that, It's currently scrolling only into the first page of reviews, I need to improve the code for scrolling further.

But one feature, which I has implemented, is that if there are multiple companies with the same name, it will automatically take the first company and get it's reviews.

you can see the scraped reviews, in ArangoDB. you can access it on

http://localhost:8529/

username and password can be found in env file inside compose directory. and the database is google-reviews.

Testing ETL Google Reviews.

for the sake of testing purpose, I had loaded the selenium driver, and open the reviews of "aspria berlin ku'damm", and manually scrolled down to load multiple reviews requests, and get all the content of those requests, and stored it in files, which you can see in file directory, and then inside input folder.

You can perform the ETL task, by Extracting raw review content. Transform transforming the review's raw content in review's data classes. Loading loading the transformed data as CSV in the files output folder.

you can just simply use the following command.

python -m virtualenv .venv

source .venv\bin\activate

pip install notebook

cd notebooks

jupyter notebook

it will load the jupyter notebook, you can test the ETL pipeline there.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.idea		.idea
compose		compose
files		files
notebooks		notebooks
requirements		requirements
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Google Reviews scraping

Setup

Scraping

Testing ETL Google Reviews.

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

saudBinHabib/google-reviews

Folders and files

Latest commit

History

Repository files navigation

Google Reviews scraping

Setup

Scraping

Testing ETL Google Reviews.

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages