Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
SimmonsRitchie authored Aug 10, 2024
0 parents commit bb6050a
Show file tree
Hide file tree
Showing 25 changed files with 1,747 additions and 0 deletions.
10 changes: 10 additions & 0 deletions .deploy.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/bin/bash
pipenv run scrapy list | xargs -I {} pipenv run scrapy crawl {} -s LOG_ENABLED=False &

# Output to the screen every 9 minutes to prevent a travis timeout
# https://stackoverflow.com/a/40800348
export PID=$!
while [[ `ps -p $PID | tail -n +2` ]]; do
echo 'Deploying'
sleep 540
done
9 changes: 9 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
[flake8]
ignore = E203,E741,W503,W504
exclude =
.git,
.venv,
venv,
*/__pycache__/*,
tests/files/*,
max-line-length = 88
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
tests/files/* linguist-generated
28 changes: 28 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
## What's this PR do?
<!-- eg. This PR updates the scraper for Cleveland City Council because of changes to how they display their meeting schedule. -->

## Why are we doing this?
<!-- eg. The website's layout was recently updated, causing our existing scraper to fail. This change ensures our scraper remains functional and continues to provide timely updates on council meetings. -->

## Steps to manually test
<!-- Text here is not always necessary but it is generally recommended in order to aid a reviewer.
eg.
1. Ensure the project is installed:
```
pipenv sync --dev
```
2. Activate the virtual env and enter the pipenv shell:
```
pipenv shell
```
3. Run the spider:
```
scrapy crawl <spider-name> -O test_output.csv
```
4. Monitor the output and ensure no errors are raised.
5. Inspect `test_output.csv` to ensure the data looks valid.
-->

## Are there any smells or added technical debt to note?
<!-- eg. The new scraping logic includes a more complex parsing routine, which might be less efficient. Future optimization or a more robust parsing strategy may be needed if the website's layout continues to evolve. -->
48 changes: 48 additions & 0 deletions .github/workflows/archive.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: Archive

on:
schedule:
- cron: "7 11 * * *"
workflow_dispatch:

env:
CI: true
PYTHON_VERSION: 3.9
PIPENV_VENV_IN_PROJECT: true
SCRAPY_SETTINGS_MODULE: city_scrapers.settings.archive
AUTOTHROTTLE_MAX_DELAY: 30.0
AUTOTHROTTLE_START_DELAY: 1.5
AUTOTHROTTLE_TARGET_CONCURRENCY: 3.0

jobs:
crawl:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}

- name: Install Pipenv
run: pip install --user pipenv

- name: Cache Python dependencies
uses: actions/cache@v4
with:
path: .venv
key: ${{ env.PYTHON_VERSION }}-${{ hashFiles('**/Pipfile.lock') }}
restore-keys: |
${{ env.PYTHON_VERSION }}-
pip-
- name: Install dependencies
run: pipenv sync
env:
PIPENV_DEFAULT_PYTHON_VERSION: ${{ env.PYTHON_VERSION }}

- name: Run scrapers
run: |
export PYTHONPATH=$(pwd):$PYTHONPATH
./.deploy.sh
71 changes: 71 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
name: CI

on: [push, pull_request]

env:
CI: true
PIPENV_VENV_IN_PROJECT: true
AUTOTHROTTLE_MAX_DELAY: 30.0
AUTOTHROTTLE_START_DELAY: 1.5
AUTOTHROTTLE_TARGET_CONCURRENCY: 3.0

jobs:
check:
runs-on: ubuntu-latest
strategy:
max-parallel: 4
matrix:
python-version: [3.9]

steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install Pipenv
run: pip install --user pipenv

- name: Cache Python dependencies
uses: actions/cache@v4
with:
path: .venv
key: pip-${{ matrix.python-version }}-${{ hashFiles('**/Pipfile.lock') }}
restore-keys: |
pip-${{ matrix.python-version }}-
pip-
- name: Install dependencies
run: pipenv sync --dev
env:
PIPENV_DEFAULT_PYTHON_VERSION: ${{ matrix.python-version }}

- name: Check imports with isort
run: |
pipenv run isort . --check-only
- name: Check style with black
run: |
pipenv run black . --check
- name: Lint with flake8
run: |
pipenv run flake8 .
- name: Test with pytest
# Ignores exit code 5 (no tests collected)
run: |
pipenv run pytest || [ $? -eq 5 ]
- name: Validate output with scrapy
if: github.event_name == 'pull_request'
run: |
git checkout ${{ github.base_ref }}
git checkout $(git show-ref | grep pull | awk '{ print $2 }')
git diff-index --name-only --diff-filter=d $(git merge-base HEAD ${{ github.base_ref }}) | \
grep -Pio '(?<=/spiders/).*(?=\.py)' | \
xargs pipenv run scrapy validate
73 changes: 73 additions & 0 deletions .github/workflows/cron.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
name: Cron

on:
schedule:
# Set any time that you'd like scrapers to run (in UTC)
- cron: "1 6 * * *"
workflow_dispatch:

env:
CI: true
PYTHON_VERSION: 3.9
PIPENV_VENV_IN_PROJECT: true
SCRAPY_SETTINGS_MODULE: city_scrapers.settings.prod
WAYBACK_ENABLED: true
AUTOTHROTTLE_MAX_DELAY: 30.0
AUTOTHROTTLE_START_DELAY: 1.5
AUTOTHROTTLE_TARGET_CONCURRENCY: 3.0
# Add secrets for the platform you're using and uncomment here
# AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
# AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
# S3_BUCKET: ${{ secrets.S3_BUCKET }}
# AZURE_ACCOUNT_KEY: ${{ secrets.AZURE_ACCOUNT_KEY }}
# AZURE_ACCOUNT_NAME: ${{ secrets.AZURE_ACCOUNT_NAME }}
# AZURE_CONTAINER: ${{ secrets.AZURE_CONTAINER }}
# AZURE_STATUS_CONTAINER = os.getenv("AZURE_STATUS_CONTAINER")
# GOOGLE_APPLICATION_CREDENTIALS = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
# GCS_BUCKET = os.getenv("GCS_BUCKET")
# Setup Sentry, add the DSN to secrets and uncomment here
# SENTRY_DSN: ${{ secrets.SENTRY_DSN }}

jobs:
crawl:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}

- name: Install Pipenv
run: pip install --user pipenv

- name: Cache Python dependencies
uses: actions/cache@v4
with:
path: .venv
key: ${{ env.PYTHON_VERSION }}-${{ hashFiles('**/Pipfile.lock') }}
restore-keys: |
${{ env.PYTHON_VERSION }}-
pip-
- name: Install dependencies
run: pipenv sync
env:
PIPENV_DEFAULT_PYTHON_VERSION: ${{ env.PYTHON_VERSION }}

- name: Run scrapers
run: |
export PYTHONPATH=$(pwd):$PYTHONPATH
./.deploy.sh
- name: Combine output feeds
run: |
export PYTHONPATH=$(pwd):$PYTHONPATH
pipenv run scrapy combinefeeds -s LOG_ENABLED=False
- name: Prevent workflow deactivation
uses: gautamkrishnar/keepalive-workflow@v1
with:
committer_username: "citybureau-bot"
committer_email: "documenters@citybureau.org"
136 changes: 136 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
docs/_site/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
.pytest_cache
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy
tutorial/

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# dotenv
.env

# IDEs and editors
/.idea
.vscode
.project
.classpath
.c9/
*.launch
.settings/
*.sublime-workspace

# virtualenv
.venv
venv/
ENV/
documenters-aggregator/
city-scrapers/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/

# cyrus
json_conversions.md
.citybureau/
city_scrapers/caps.py

# legistar cache
_cache/

# src dir from git commit packages in requirements.txt
src/

# OS files
.DS_Store

# validation logs
logs/*.log
travis/*.json

# output files: local gitignore added to city_scrapers/local_outputs/
Loading

0 comments on commit bb6050a

Please sign in to comment.