Skip to content

Commit cf792b8

Browse files
committed
Merge branch 'master' into update-requirements
2 parents a7f6aa9 + b33a8e1 commit cf792b8

File tree

13 files changed

+208
-88
lines changed

13 files changed

+208
-88
lines changed
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
name: Lint and Test
2+
3+
on:
4+
push:
5+
branches:
6+
- update-requirements
7+
8+
jobs:
9+
build:
10+
11+
runs-on: ubuntu-latest
12+
13+
steps:
14+
- uses: actions/checkout@v1
15+
- name: Set up Python 3.7
16+
uses: actions/setup-python@v1
17+
with:
18+
python-version: 3.7
19+
- name: Install dependencies
20+
run: |
21+
python -m pip install --upgrade pip
22+
pip install -r requirements.txt
23+
pip install -r dev_requirements.txt
24+
- name: Lint with flake8
25+
run: |
26+
# stop the build if there are Python syntax errors or undefined names
27+
flake8 . --count --ignore E501,W503,E203 --show-source --statistics
28+
- name: Lint with Black
29+
run: |
30+
black .
31+
- name: Test with django
32+
run: |
33+
python manage.py test

README.md

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -110,20 +110,14 @@ Note: The scrapers live in an independent environment not neccessarily in the sa
110110
# enter the password when prompted. It can be any password that you wish to use.
111111
# It is used for login to the admin website.
112112
```
113-
- Start up the webserver so we can create a user for the scraper.
113+
- Start up the webserver
114114
```bash
115115
python3 manage.py runserver
116116
```
117-
- Visit localhost:8000/admin and follow the UI to add a new user named "scraper", set the password to whatever you would like but make note of it.
118-
119-
- In a new terminal tab, create a token for the scraper user using the following command
120-
```bash
121-
python3 manage.py drf_create_token scraper
122-
```
123-
Finally, the database is ready to go! We are now ready to run the server:
124-
125117
Navigate in your browser to `http://127.0.0.1:8000/admin`. Log in with the new admin user you just created. Click on Agencys and you should see a list of
126-
agencies.
118+
agencies created with the ``fill_agency_objects`` command.
119+
120+
To setup the scraper, read [the scraper README](scrapers/README.rst).
127121

128122
## Code formatting
129123
GovLens enforces code style using [Black](https://github.com/psf/black) and pep8 rules using [Flake8](http://flake8.pycqa.org/en/latest/).
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
"""Idempotent management command to create the scraper user with a DRF token
2+
"""
3+
from django.core.management.base import BaseCommand
4+
from django.contrib.auth.models import User
5+
from rest_framework.authtoken.models import Token
6+
7+
SCRAPER_USERNAME = "scraper"
8+
9+
10+
class Command(BaseCommand):
11+
help = "Get or create a scraper user with a Django REST Framework token"
12+
13+
def add_arguments(self, parser):
14+
pass
15+
16+
def handle(self, *args, **options):
17+
user, created = User.objects.get_or_create(username=SCRAPER_USERNAME)
18+
user.save()
19+
20+
if created:
21+
self.stdout.write(f"Created new user with username {SCRAPER_USERNAME}")
22+
else:
23+
self.stdout.write(f"User {SCRAPER_USERNAME} already exists.")
24+
25+
token, created = Token.objects.get_or_create(user=user)
26+
self.stdout.write(f"The token for the user {SCRAPER_USERNAME} is {token}")

dev_requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
black
22
flake8
3+
coloredlogs==10.0

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,4 @@ simplejson==3.16.0
1313
sqlparse==0.3.0
1414
urllib3==1.24.2
1515
apscheduler==3.6.0
16+
python-dotenv==0.11.0

scrapers/README.rst

Lines changed: 26 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -27,28 +27,38 @@ Directory Structure
2727
├── security_scraper.py - scrapes for HTTPS & privacy policy
2828
   └── social_scraper.py - scrapes for phone number, email, address, social media
2929

30-
Requirements
31-
============
30+
Quick Start
31+
===========
32+
33+
Configuration
34+
~~~~~~~~~~~~~
35+
36+
There are a few required environmental variables. The easiest way to set them in development is to create a file called `.env` in the root directory of this repository (don't commit this file). The file (named `.env`) should contain the following text::
37+
38+
GOVLENS_API_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
39+
GOVLENS_API_ENDPOINT=http://127.0.0.1:8000/api/agencies/
40+
GOOGLE_API_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXX
41+
42+
To get the ``GOOGLE_API_TOKEN``, you need to visit the following page: https://developers.google.com/speed/docs/insights/v5/get-started
43+
44+
To get the ``GOVLENS_API_TOKEN``, run ``python3 manage.py create_scraper_user``. Copy the token from the command output and paste it into the ``.env`` file.
45+
46+
Execution
47+
~~~~~~~~~
3248

33-
Google Lighthouse API Key
34-
~~~~~~~~~~~~~~~~~~~~~~~~~
35-
Get the API key for accessing lighthouse from here: https://developers.google.com/speed/docs/insights/v5/get-started (click on the button get key)
49+
Once you have created the `.env` file as mentioned above, run the scraper::
3650

37-
Put that key in GOOGLE_API_KEY environment variable.
51+
# run the following from the root directory of the repository
52+
python3 -m scrapers.scrape_handler
3853

39-
Running the Scrapers
40-
====================
41-
``scrape_handler.py`` is the entry point for scraping.
42-
When we run from our local machine, we get the list of agencies and start scraping them.
43-
But when deployed to AWS, the scraper is invoked by the schedule and ``scrape_handler.scrape_data()`` is the method hooked up to the lambda.
54+
Design
55+
======
4456

45-
Local
46-
~~~~~
47-
If running from local, the following command should run the scraper::
57+
The scraper is intended to be used both locally and on AWS Lambda.
4858

49-
python scraper.py
59+
The ``scrapers`` directory in the root of this repository is the top-level Python package for this project. This means that any absolute imports should begin with ``scrapers.MODULE_NAME_HERE``.
5060

51-
Make sure to set the environment variable to your local endpoint.
61+
``scrapers/scrape_handler.py`` is the main Python module invoked. On AWS Lambda, the method ``scrape_handler.scrape_data()`` is imported and called directly.
5262

5363
AWS Lambda
5464
~~~~~~~~~~

scrapers/__init__.py

Whitespace-only changes.

scrapers/agency_api_service.py

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,30 @@
1-
import os
1+
import logging
22

33
import requests
44

5+
from . import settings
6+
7+
logger = logging.getLogger(__name__)
8+
59

610
class AgencyApiService:
711
def __init__(self):
8-
# If environment variable is set, we use the corresponding api(usually local). otherwise govlens api
9-
if os.environ.get("govlens_api", None) is None:
10-
self.base_url = (
11-
"http://govlens.us-east-2.elasticbeanstalk.com/api/agencies/"
12-
)
13-
else:
14-
self.base_url = os.environ["govlens_api"]
12+
self.base_url = settings.GOVLENS_API_ENDPOINT
1513

1614
def get_all_agencies(self):
1715
try:
1816
all_agency_list = self._get(self.base_url)
1917
return all_agency_list
2018
except Exception as ex:
21-
print(f"Error while retrieving all the agency information: {str(ex)}")
19+
logger.error(ex, "Error while retrieving all the agency information")
2220

2321
def _get(self, url):
24-
response = requests.get(url, headers={"Content-type": "application/json"})
22+
response = requests.get(
23+
url,
24+
headers={
25+
"Content-type": "application/json",
26+
"Authorization": "Token {}".format(settings.GOVLENS_API_TOKEN),
27+
},
28+
)
29+
response.raise_for_status()
2530
return response.json()

scrapers/lighthouse.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
from scrapers.base_api_client import ApiClient
1+
from .scrapers.base_api_client import ApiClient
2+
from . import settings
23

34

4-
GOOGLE_API_KEY = "" # os.environ['GOOGLE_API_KEY']
55
PAGE_INSIGHTS_ENDPOINT = "https://www.googleapis.com/pagespeedonline/v5/runPagespeed"
66
MOBILE_FRIENDLY_ENDPOINT = "https://searchconsole.googleapis.com/v1/urlTestingTools/mobileFriendlyTest:run" # from what i have tested, very hard to automate
77

@@ -15,7 +15,7 @@
1515

1616

1717
class PageInsightsClient(ApiClient):
18-
def __init__(self, api_uri=PAGE_INSIGHTS_ENDPOINT, api_key=GOOGLE_API_KEY):
18+
def __init__(self, api_uri=PAGE_INSIGHTS_ENDPOINT, api_key=settings.GOOGLE_API_KEY):
1919
ApiClient.__init__(self, api_uri, api_key)
2020

2121
def get_page_insights(self, url, category):
@@ -24,7 +24,9 @@ def get_page_insights(self, url, category):
2424

2525

2626
class GoogleMobileFriendlyClient(ApiClient):
27-
def __init__(self, api_uri=MOBILE_FRIENDLY_ENDPOINT, api_key=GOOGLE_API_KEY):
27+
def __init__(
28+
self, api_uri=MOBILE_FRIENDLY_ENDPOINT, api_key=settings.GOOGLE_API_KEY
29+
):
2830
self.urls = []
2931
self.results = []
3032
ApiClient.__init__(self, api_uri, api_key)

scrapers/process_agency_info.py

Lines changed: 13 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
1-
import os
21
import requests
32
import logging
4-
from scrapers.social_scraper import SocialScraper
5-
from scrapers.security_scraper import SecurityScraper
6-
from scrapers.accessibility_scraper import AccessibilityScraper
7-
from agency_dataaccessor import AgencyDataAccessor
3+
from .scrapers.social_scraper import SocialScraper
4+
from .scrapers.security_scraper import SecurityScraper
5+
from .scrapers.accessibility_scraper import AccessibilityScraper
6+
from .agency_dataaccessor import AgencyDataAccessor
7+
from . import settings
8+
9+
logger = logging.getLogger(__name__)
810

911

1012
class AgencyInfo:
@@ -24,15 +26,12 @@ def process_agency_info(self):
2426
# HTTP Get on agency url
2527
agency_url = self.agency.get("website", None)
2628
if agency_url is None or agency_url == "":
27-
print(
28-
f"Website url is not available for {self.agency['id']}, name: {self.agency['name']}"
29-
)
30-
logging.error(
29+
logger.error(
3130
f"Website url is not available for {self.agency['id']}, name: {self.agency['name']}"
3231
)
3332
self.agency_dataaccessor.update_agency_info(self.agency)
3433
return
35-
print(f"Scraping the website {agency_url}")
34+
logger.info(f"Scraping the website {agency_url}")
3635
page = requests.get(agency_url, timeout=30)
3736
# Initialize scrapers
3837
socialScraper = SocialScraper(page, agency_url)
@@ -45,7 +44,7 @@ def process_agency_info(self):
4544
# Figure out the google_api_key and then fix the below buckets
4645
for bucket in self.buckets:
4746
if bucket == "security_and_privacy":
48-
if os.environ.get("GOOGLE_API_KEY", None) is not None:
47+
if settings.GOOGLE_API_KEY:
4948
profile_info[
5049
bucket
5150
] = securityScraper.get_security_privacy_info()
@@ -56,7 +55,7 @@ def process_agency_info(self):
5655
social_media_info, contact_info
5756
)
5857
elif bucket == "website_accessibility":
59-
if os.environ.get("GOOGLE_API_KEY", None) is not None:
58+
if settings.GOOGLE_API_KEY:
6059
profile_info[
6160
bucket
6261
] = accessibilityScraper.get_website_accessibility_info()
@@ -71,9 +70,6 @@ def process_agency_info(self):
7170
self.agency_dataaccessor.enrich_agency_info_with_scrape_info(agency_details)
7271
return agency_details
7372
except Exception as ex:
74-
logging.error(
75-
f"An error occurred while processing the agency information: {str(ex)}"
76-
)
77-
print(
78-
f"An error occurred while processing the agency information: {str(ex)}"
73+
logger.error(
74+
ex, "An error occurred while processing the agency information"
7975
)

0 commit comments

Comments
 (0)