You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+4-10Lines changed: 4 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -110,20 +110,14 @@ Note: The scrapers live in an independent environment not neccessarily in the sa
110
110
# enter the password when prompted. It can be any password that you wish to use.
111
111
# It is used for login to the admin website.
112
112
```
113
-
- Start up the webserver so we can create a user for the scraper.
113
+
- Start up the webserver
114
114
```bash
115
115
python3 manage.py runserver
116
116
```
117
-
- Visit localhost:8000/admin and follow the UI to add a new user named "scraper", set the password to whatever you would like but make note of it.
118
-
119
-
- In a new terminal tab, create a token for the scraper user using the following command
120
-
```bash
121
-
python3 manage.py drf_create_token scraper
122
-
```
123
-
Finally, the database is ready to go! We are now ready to run the server:
124
-
125
117
Navigate in your browser to `http://127.0.0.1:8000/admin`. Log in with the new admin user you just created. Click on Agencys and you should see a list of
126
-
agencies.
118
+
agencies created with the ``fill_agency_objects`` command.
119
+
120
+
To setup the scraper, read [the scraper README](scrapers/README.rst).
127
121
128
122
## Code formatting
129
123
GovLens enforces code style using [Black](https://github.com/psf/black) and pep8 rules using [Flake8](http://flake8.pycqa.org/en/latest/).
Copy file name to clipboardExpand all lines: scrapers/README.rst
+26-16Lines changed: 26 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,28 +27,38 @@ Directory Structure
27
27
├── security_scraper.py - scrapes for HTTPS & privacy policy
28
28
└── social_scraper.py - scrapes for phone number, email, address, social media
29
29
30
-
Requirements
31
-
============
30
+
Quick Start
31
+
===========
32
+
33
+
Configuration
34
+
~~~~~~~~~~~~~
35
+
36
+
There are a few required environmental variables. The easiest way to set them in development is to create a file called `.env` in the root directory of this repository (don't commit this file). The file (named `.env`) should contain the following text::
To get the ``GOOGLE_API_TOKEN``, you need to visit the following page: https://developers.google.com/speed/docs/insights/v5/get-started
43
+
44
+
To get the ``GOVLENS_API_TOKEN``, run ``python3 manage.py create_scraper_user``. Copy the token from the command output and paste it into the ``.env`` file.
45
+
46
+
Execution
47
+
~~~~~~~~~
32
48
33
-
Google Lighthouse API Key
34
-
~~~~~~~~~~~~~~~~~~~~~~~~~
35
-
Get the API key for accessing lighthouse from here: https://developers.google.com/speed/docs/insights/v5/get-started (click on the button get key)
49
+
Once you have created the `.env` file as mentioned above, run the scraper::
36
50
37
-
Put that key in GOOGLE_API_KEY environment variable.
51
+
# run the following from the root directory of the repository
52
+
python3 -m scrapers.scrape_handler
38
53
39
-
Running the Scrapers
40
-
====================
41
-
``scrape_handler.py`` is the entry point for scraping.
42
-
When we run from our local machine, we get the list of agencies and start scraping them.
43
-
But when deployed to AWS, the scraper is invoked by the schedule and ``scrape_handler.scrape_data()`` is the method hooked up to the lambda.
54
+
Design
55
+
======
44
56
45
-
Local
46
-
~~~~~
47
-
If running from local, the following command should run the scraper::
57
+
The scraper is intended to be used both locally and on AWS Lambda.
48
58
49
-
python scraper.py
59
+
The ``scrapers`` directory in the root of this repository is the top-level Python package for this project. This means that any absolute imports should begin with ``scrapers.MODULE_NAME_HERE``.
50
60
51
-
Make sure to set the environment variable to your local endpoint.
61
+
``scrapers/scrape_handler.py`` is the main Python module invoked. On AWS Lambda, the method ``scrape_handler.scrape_data()`` is imported and called directly.
MOBILE_FRIENDLY_ENDPOINT="https://searchconsole.googleapis.com/v1/urlTestingTools/mobileFriendlyTest:run"# from what i have tested, very hard to automate
0 commit comments