Skip to content

Commit b33a8e1

Browse files
committed
2 parents 2b968f9 + b885aea commit b33a8e1

File tree

12 files changed

+175
-88
lines changed

12 files changed

+175
-88
lines changed

README.md

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -110,20 +110,14 @@ Note: The scrapers live in an independent environment not neccessarily in the sa
110110
# enter the password when prompted. It can be any password that you wish to use.
111111
# It is used for login to the admin website.
112112
```
113-
- Start up the webserver so we can create a user for the scraper.
113+
- Start up the webserver
114114
```bash
115115
python3 manage.py runserver
116116
```
117-
- Visit localhost:8000/admin and follow the UI to add a new user named "scraper", set the password to whatever you would like but make note of it.
118-
119-
- In a new terminal tab, create a token for the scraper user using the following command
120-
```bash
121-
python3 manage.py drf_create_token scraper
122-
```
123-
Finally, the database is ready to go! We are now ready to run the server:
124-
125117
Navigate in your browser to `http://127.0.0.1:8000/admin`. Log in with the new admin user you just created. Click on Agencys and you should see a list of
126-
agencies.
118+
agencies created with the ``fill_agency_objects`` command.
119+
120+
To setup the scraper, read [the scraper README](scrapers/README.rst).
127121

128122
## Code formatting
129123
GovLens enforces code style using [Black](https://github.com/psf/black) and pep8 rules using [Flake8](http://flake8.pycqa.org/en/latest/).
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
"""Idempotent management command to create the scraper user with a DRF token
2+
"""
3+
from django.core.management.base import BaseCommand
4+
from django.contrib.auth.models import User
5+
from rest_framework.authtoken.models import Token
6+
7+
SCRAPER_USERNAME = "scraper"
8+
9+
10+
class Command(BaseCommand):
11+
help = "Get or create a scraper user with a Django REST Framework token"
12+
13+
def add_arguments(self, parser):
14+
pass
15+
16+
def handle(self, *args, **options):
17+
user, created = User.objects.get_or_create(username=SCRAPER_USERNAME)
18+
user.save()
19+
20+
if created:
21+
self.stdout.write(f"Created new user with username {SCRAPER_USERNAME}")
22+
else:
23+
self.stdout.write(f"User {SCRAPER_USERNAME} already exists.")
24+
25+
token, created = Token.objects.get_or_create(user=user)
26+
self.stdout.write(f"The token for the user {SCRAPER_USERNAME} is {token}")

dev_requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
black
22
flake8
3+
coloredlogs==10.0

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,4 @@ simplejson==3.16.0
1313
sqlparse==0.3.0
1414
urllib3==1.24.2
1515
apscheduler==3.6.0
16+
python-dotenv==0.11.0

scrapers/README.rst

Lines changed: 26 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -27,28 +27,38 @@ Directory Structure
2727
├── security_scraper.py - scrapes for HTTPS & privacy policy
2828
   └── social_scraper.py - scrapes for phone number, email, address, social media
2929

30-
Requirements
31-
============
30+
Quick Start
31+
===========
32+
33+
Configuration
34+
~~~~~~~~~~~~~
35+
36+
There are a few required environmental variables. The easiest way to set them in development is to create a file called `.env` in the root directory of this repository (don't commit this file). The file (named `.env`) should contain the following text::
37+
38+
GOVLENS_API_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
39+
GOVLENS_API_ENDPOINT=http://127.0.0.1:8000/api/agencies/
40+
GOOGLE_API_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXX
41+
42+
To get the ``GOOGLE_API_TOKEN``, you need to visit the following page: https://developers.google.com/speed/docs/insights/v5/get-started
43+
44+
To get the ``GOVLENS_API_TOKEN``, run ``python3 manage.py create_scraper_user``. Copy the token from the command output and paste it into the ``.env`` file.
45+
46+
Execution
47+
~~~~~~~~~
3248

33-
Google Lighthouse API Key
34-
~~~~~~~~~~~~~~~~~~~~~~~~~
35-
Get the API key for accessing lighthouse from here: https://developers.google.com/speed/docs/insights/v5/get-started (click on the button get key)
49+
Once you have created the `.env` file as mentioned above, run the scraper::
3650

37-
Put that key in GOOGLE_API_KEY environment variable.
51+
# run the following from the root directory of the repository
52+
python3 -m scrapers.scrape_handler
3853

39-
Running the Scrapers
40-
====================
41-
``scrape_handler.py`` is the entry point for scraping.
42-
When we run from our local machine, we get the list of agencies and start scraping them.
43-
But when deployed to AWS, the scraper is invoked by the schedule and ``scrape_handler.scrape_data()`` is the method hooked up to the lambda.
54+
Design
55+
======
4456

45-
Local
46-
~~~~~
47-
If running from local, the following command should run the scraper::
57+
The scraper is intended to be used both locally and on AWS Lambda.
4858

49-
python scraper.py
59+
The ``scrapers`` directory in the root of this repository is the top-level Python package for this project. This means that any absolute imports should begin with ``scrapers.MODULE_NAME_HERE``.
5060

51-
Make sure to set the environment variable to your local endpoint.
61+
``scrapers/scrape_handler.py`` is the main Python module invoked. On AWS Lambda, the method ``scrape_handler.scrape_data()`` is imported and called directly.
5262

5363
AWS Lambda
5464
~~~~~~~~~~

scrapers/__init__.py

Whitespace-only changes.

scrapers/agency_api_service.py

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,30 @@
1-
import os
1+
import logging
22

33
import requests
44

5+
from . import settings
6+
7+
logger = logging.getLogger(__name__)
8+
59

610
class AgencyApiService:
711
def __init__(self):
8-
# If environment variable is set, we use the corresponding api(usually local). otherwise govlens api
9-
if os.environ.get("govlens_api", None) is None:
10-
self.base_url = (
11-
"http://govlens.us-east-2.elasticbeanstalk.com/api/agencies/"
12-
)
13-
else:
14-
self.base_url = os.environ["govlens_api"]
12+
self.base_url = settings.GOVLENS_API_ENDPOINT
1513

1614
def get_all_agencies(self):
1715
try:
1816
all_agency_list = self._get(self.base_url)
1917
return all_agency_list
2018
except Exception as ex:
21-
print(f"Error while retrieving all the agency information: {str(ex)}")
19+
logger.error(ex, "Error while retrieving all the agency information")
2220

2321
def _get(self, url):
24-
response = requests.get(url, headers={"Content-type": "application/json"})
22+
response = requests.get(
23+
url,
24+
headers={
25+
"Content-type": "application/json",
26+
"Authorization": "Token {}".format(settings.GOVLENS_API_TOKEN),
27+
},
28+
)
29+
response.raise_for_status()
2530
return response.json()

scrapers/lighthouse.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
from scrapers.base_api_client import ApiClient
1+
from .scrapers.base_api_client import ApiClient
2+
from . import settings
23

34

4-
GOOGLE_API_KEY = "" # os.environ['GOOGLE_API_KEY']
55
PAGE_INSIGHTS_ENDPOINT = "https://www.googleapis.com/pagespeedonline/v5/runPagespeed"
66
MOBILE_FRIENDLY_ENDPOINT = "https://searchconsole.googleapis.com/v1/urlTestingTools/mobileFriendlyTest:run" # from what i have tested, very hard to automate
77

@@ -15,7 +15,7 @@
1515

1616

1717
class PageInsightsClient(ApiClient):
18-
def __init__(self, api_uri=PAGE_INSIGHTS_ENDPOINT, api_key=GOOGLE_API_KEY):
18+
def __init__(self, api_uri=PAGE_INSIGHTS_ENDPOINT, api_key=settings.GOOGLE_API_KEY):
1919
ApiClient.__init__(self, api_uri, api_key)
2020

2121
def get_page_insights(self, url, category):
@@ -24,7 +24,9 @@ def get_page_insights(self, url, category):
2424

2525

2626
class GoogleMobileFriendlyClient(ApiClient):
27-
def __init__(self, api_uri=MOBILE_FRIENDLY_ENDPOINT, api_key=GOOGLE_API_KEY):
27+
def __init__(
28+
self, api_uri=MOBILE_FRIENDLY_ENDPOINT, api_key=settings.GOOGLE_API_KEY
29+
):
2830
self.urls = []
2931
self.results = []
3032
ApiClient.__init__(self, api_uri, api_key)

scrapers/process_agency_info.py

Lines changed: 13 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
1-
import os
21
import requests
32
import logging
4-
from scrapers.social_scraper import SocialScraper
5-
from scrapers.security_scraper import SecurityScraper
6-
from scrapers.accessibility_scraper import AccessibilityScraper
7-
from agency_dataaccessor import AgencyDataAccessor
3+
from .scrapers.social_scraper import SocialScraper
4+
from .scrapers.security_scraper import SecurityScraper
5+
from .scrapers.accessibility_scraper import AccessibilityScraper
6+
from .agency_dataaccessor import AgencyDataAccessor
7+
from . import settings
8+
9+
logger = logging.getLogger(__name__)
810

911

1012
class AgencyInfo:
@@ -24,15 +26,12 @@ def process_agency_info(self):
2426
# HTTP Get on agency url
2527
agency_url = self.agency.get("website", None)
2628
if agency_url is None or agency_url == "":
27-
print(
28-
f"Website url is not available for {self.agency['id']}, name: {self.agency['name']}"
29-
)
30-
logging.error(
29+
logger.error(
3130
f"Website url is not available for {self.agency['id']}, name: {self.agency['name']}"
3231
)
3332
self.agency_dataaccessor.update_agency_info(self.agency)
3433
return
35-
print(f"Scraping the website {agency_url}")
34+
logger.info(f"Scraping the website {agency_url}")
3635
page = requests.get(agency_url, timeout=30)
3736
# Initialize scrapers
3837
socialScraper = SocialScraper(page, agency_url)
@@ -45,7 +44,7 @@ def process_agency_info(self):
4544
# Figure out the google_api_key and then fix the below buckets
4645
for bucket in self.buckets:
4746
if bucket == "security_and_privacy":
48-
if os.environ.get("GOOGLE_API_KEY", None) is not None:
47+
if settings.GOOGLE_API_KEY:
4948
profile_info[
5049
bucket
5150
] = securityScraper.get_security_privacy_info()
@@ -56,7 +55,7 @@ def process_agency_info(self):
5655
social_media_info, contact_info
5756
)
5857
elif bucket == "website_accessibility":
59-
if os.environ.get("GOOGLE_API_KEY", None) is not None:
58+
if settings.GOOGLE_API_KEY:
6059
profile_info[
6160
bucket
6261
] = accessibilityScraper.get_website_accessibility_info()
@@ -71,9 +70,6 @@ def process_agency_info(self):
7170
self.agency_dataaccessor.enrich_agency_info_with_scrape_info(agency_details)
7271
return agency_details
7372
except Exception as ex:
74-
logging.error(
75-
f"An error occurred while processing the agency information: {str(ex)}"
76-
)
77-
print(
78-
f"An error occurred while processing the agency information: {str(ex)}"
73+
logger.error(
74+
ex, "An error occurred while processing the agency information"
7975
)

scrapers/scrape_handler.py

Lines changed: 9 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,30 @@
1-
import os
21
import logging
32
from .process_agency_info import AgencyInfo
43
from .agency_api_service import AgencyApiService
54

5+
from . import settings
6+
7+
settings.setup_logging()
8+
9+
logger = logging.getLogger(__name__)
10+
611

712
# method invoked by lambda
813
def scrape_data(event, context=None):
914
agencies = event["agencies"]
1015
if event.get("agencies", None) is None or len(agencies) <= 0:
11-
print("No Agency information was passed to scrape")
16+
logger.warning("No Agency information was passed to scrape")
1217
return
1318

1419
for agency in agencies:
1520
agency_instance = AgencyInfo(agency)
1621
agency_instance.process_agency_info()
1722

1823

19-
# if running from local, we get the list of agencies and scrape one by one.
2024
if __name__ == "__main__":
21-
# If running from local, set the environment variable to your local
22-
logging.basicConfig(
23-
filename="Scraper_Errors.log",
24-
level=logging.ERROR,
25-
format="%(asctime)s %(message)s",
26-
)
27-
os.environ[
28-
"govlens_api"
29-
] = "http://govlens.us-east-2.elasticbeanstalk.com/api/agencies/"
30-
os.environ["GOOGLE_API_KEY"] = ""
25+
3126
agency_api_service = AgencyApiService()
3227
agencies = agency_api_service.get_all_agencies()
3328
event = {"agencies": agencies}
3429
scrape_data(event)
35-
print("SCRAPED")
30+
logger.info("Finished scraping")

0 commit comments

Comments
 (0)