Skip to content

Commit

Permalink
refactor pipeline
Browse files Browse the repository at this point in the history
  • Loading branch information
ceroberoz committed Aug 23, 2024
1 parent 8335a98 commit 2c0a6aa
Show file tree
Hide file tree
Showing 6 changed files with 94 additions and 29 deletions.
56 changes: 29 additions & 27 deletions .github/workflows/scrape.yml
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
name: Scrape and Upload to Google Sheets
name: Adjust Column Widths in Google Sheets

on:
push:
branches: [master]
pull_request:
branches: [master]
schedule:
- cron: "0 0 * * *"
# Commenting out the schedule for testing purposes
# schedule:
# - cron: "0 0 * * *"

jobs:
scrape-and-upload:
adjust-column-widths:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
Expand All @@ -25,31 +26,32 @@ jobs:
python -m pip install --upgrade pip
pip install -r requirements.txt pandas google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client
- name: Install Playwright browsers
run: playwright install --with-deps chromium firefox webkit

- name: Run scraping
env:
PYTHONUNBUFFERED: 1
run: |
chmod +x ./scrape.sh
./scrape.sh
- name: Upload to Google Sheets
# Commenting out non-relevant steps
# - name: Install Playwright browsers
# run: playwright install --with-deps chromium firefox webkit

# - name: Run scraping
# env:
# PYTHONUNBUFFERED: 1
# run: |
# chmod +x ./pipeline/scrape.sh
# ./pipeline/scrape.sh

# - name: Upload to Google Sheets
# env:
# GCP_JSON: ${{ secrets.GCP_JSON }}
# GOOGLE_SHEETS_ID: ${{ secrets.GOOGLE_SHEETS_ID }}
# PYTHONUNBUFFERED: 1
# run: python pipeline/upload_to_sheets.py

- name: Adjust Column Widths
env:
GCP_JSON: ${{ secrets.GCP_JSON }}
GOOGLE_SHEETS_ID: ${{ secrets.GOOGLE_SHEETS_ID }}
PYTHONUNBUFFERED: 1
run: python upload_to_sheets.py

- name: Archive production artifacts
uses: actions/upload-artifact@v3
with:
name: csv-files
path: public/*.csv
retention-days: 5
run: python pipeline/adjust_column_widths.py

- name: Cleanup
if: always()
run: |
rm -rf output public
# - name: Cleanup
# if: always()
# run: |
# rm -rf output
2 changes: 1 addition & 1 deletion QUICKSTART.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This guide provides instructions for setting up and running the id-jobs project
## Prerequisites

- Git
- Python 3.15+
- Python 3.12+

## Setup

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![Scrape and Upload to Google Sheets](https://github.com/ceroberoz/id-jobs/actions/workflows/scrape.yml/badge.svg)](https://github.com/ceroberoz/id-jobs/actions/workflows/scrape.yml)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![Python 3.15+](https://img.shields.io/badge/python-3.15+-blue.svg)](https://www.python.org/downloads/)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
![Made with Scrapy](https://img.shields.io/badge/Made%20with-Scrapy-green.svg)
![Made with Playwright](https://img.shields.io/badge/Made%20with-Playwright-orange.svg)

Expand Down
63 changes: 63 additions & 0 deletions pipeline/adjust_column_widths.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
import os
import json
from google.oauth2 import service_account
from googleapiclient.discovery import build

def get_env_var(var_name):
value = os.environ.get(var_name)
if not value:
raise ValueError(f"{var_name} environment variable is not set or is empty")
return value

def setup_credentials():
gcp_json = get_env_var('GCP_JSON')
creds_dict = json.loads(gcp_json)
return service_account.Credentials.from_service_account_info(
creds_dict, scopes=["https://www.googleapis.com/auth/spreadsheets"])

def adjust_column_widths(spreadsheet_id):
creds = setup_credentials()
service = build("sheets", "v4", credentials=creds)

column_widths = {
'job_title': 684,
'job_location': 255,
'job_department': 548,
'job_url': 662,
'first_seen': 130,
'base_salary': 304,
'job_type': 112,
'job_level': 64,
'job_apply_end_date': 225,
'last_seen': 170,
'is_active': 63,
'company': 467,
'company_url': 646,
'job_board': 72,
'job_board_url': 213
}

requests = []
for index, (column_name, width) in enumerate(column_widths.items()):
requests.append({
"updateDimensionProperties": {
"range": {
"sheetId": 0,
"dimension": "COLUMNS",
"startIndex": index,
"endIndex": index + 1
},
"properties": {
"pixelSize": width
},
"fields": "pixelSize"
}
})

body = {"requests": requests}
service.spreadsheets().batchUpdate(spreadsheetId=spreadsheet_id, body=body).execute()
print("Column widths adjusted successfully.")

if __name__ == "__main__":
spreadsheet_id = get_env_var('GOOGLE_SHEETS_ID')
adjust_column_widths(spreadsheet_id)
File renamed without changes.
File renamed without changes.

0 comments on commit 2c0a6aa

Please sign in to comment.