Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🏗 Build Spider: Cincinnati Civil Service Commission #10

Merged
merged 5 commits into from
Dec 23, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 83 additions & 0 deletions city_scrapers/spiders/cinoh_Civil_Service.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
import datetime

import scrapy
from city_scrapers_core.constants import COMMISSION
from city_scrapers_core.items import Meeting
from city_scrapers_core.spiders import CityScrapersSpider
from dateutil.parser import parse


class CinohCivilServiceSpider(CityScrapersSpider):
name = "cinoh_Civil_Service"
agency = "Cincinnati Civil Service Commission"
timezone = "America/New_York"
committee_id = "A9HCN931D6BA"
custom_settings = {
"ROBOTSTXT_OBEY": False,
}

# original URL: https://go.boarddocs.com/oh/csc/Board.nsf/vpublic?open
# clicking on meetings tab takes you to meetings index and uses API
# we scrape API instead via POST request and ignore robots file
def start_requests(self):
url = "https://go.boarddocs.com/oh/csc/Board.nsf/BD-GetMeetingsList"
form_data = {"current_committee_id": self.committee_id}
# send the POST request and use parse method when response is returned
yield scrapy.FormRequest(url, formdata=form_data, callback=self.parse)
Comment on lines +23 to +27
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for API requests

The request implementation should handle potential API failures gracefully. Consider adding error handling and retries.

     def start_requests(self):
         url = "https://go.boarddocs.com/oh/csc/Board.nsf/BD-GetMeetingsList"
         form_data = {"current_committee_id": self.committee_id}
-        yield scrapy.FormRequest(url, formdata=form_data, callback=self.parse)
+        yield scrapy.FormRequest(
+            url,
+            formdata=form_data,
+            callback=self.parse,
+            errback=self.errback_httpbin,
+            dont_filter=True,
+            meta={'dont_retry': False, 'max_retry_times': 3}
+        )
+
+    def errback_httpbin(self, failure):
+        self.logger.error(f"Request failed: {failure.value}")

Committable suggestion skipped: line range outside the PR's diff.


def parse(self, response):
"""
Parse JSON response.
"""

year = str(datetime.datetime.today().year)

data = response.json()

for item in data:
numb = item.get("numberdate")

# skip iteration if meeting is not for current year
if numb is None or year not in numb:
cruznunez marked this conversation as resolved.
Show resolved Hide resolved
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Improve data validation and error handling

The JSON parsing lacks validation for required fields and could fail silently if the response format changes.

-        data = response.json()
+        try:
+            data = response.json()
+            if not isinstance(data, list):
+                raise ValueError("Expected JSON array in response")
+        except (ValueError, KeyError) as e:
+            self.logger.error(f"Failed to parse JSON response: {e}")
+            return

         for item in data:
-            numb = item.get("numberdate")
+            try:
+                numb = item["numberdate"]
+                name = item["name"]
+                unique = item["unique"]
+            except KeyError as e:
+                self.logger.warning(f"Missing required field in item: {e}")
+                continue

Committable suggestion skipped: line range outside the PR's diff.


# if we are in current year then parse meeting
meeting = Meeting(
title=item["name"],
description="",
classification=COMMISSION,
start=parse(numb),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure proper timezone handling in date parsing

The date parsing should explicitly handle timezone information to prevent any ambiguity.

-                start=parse(numb),
+                start=parse(numb).replace(tzinfo=None),  # Strip any existing TZ info
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
start=parse(numb),
start=parse(numb).replace(tzinfo=None), # Strip any existing TZ info

end=None,
all_day=False,
time_notes="",
location=self._parse_location(item),
links=self._parse_links(item),
source=self._parse_source(response),
)

meeting["status"] = self._get_status(meeting)
meeting["id"] = self._get_id(meeting)

yield meeting

def _parse_location(self, item):
"""Generate location."""
return {
"name": "Cincinnati Civil Service Commission",
"address": "805 Central Ave, Suite 200, Cincinnati, OH 45202",
}

def _parse_links(self, item):
"""Generate links."""
href = (
f"https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?"
f"open&id={item['unique']}&current_committee_id={self.committee_id}"
)
return [{"title": "Agenda", "href": href}]
Comment on lines +74 to +80
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add URL validation for generated links

The link generation should validate the URL components and handle potential errors.

     def _parse_links(self, item):
         """Generate links."""
+        if not item.get('unique'):
+            self.logger.warning("Missing unique ID for agenda link")
+            return []
+
         href = (
             f"https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?"
             f"open&id={item['unique']}&current_committee_id={self.committee_id}"
         )
+        
+        # Validate URL length and characters
+        if len(href) > 2048:  # Standard URL length limit
+            self.logger.warning("Generated URL exceeds length limit")
+            return []
+
         return [{"title": "Agenda", "href": href}]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _parse_links(self, item):
"""Generate links."""
href = (
f"https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?"
f"open&id={item['unique']}&current_committee_id={self.committee_id}"
)
return [{"title": "Agenda", "href": href}]
def _parse_links(self, item):
"""Generate links."""
if not item.get('unique'):
self.logger.warning("Missing unique ID for agenda link")
return []
href = (
f"https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?"
f"open&id={item['unique']}&current_committee_id={self.committee_id}"
)
# Validate URL length and characters
if len(href) > 2048: # Standard URL length limit
self.logger.warning("Generated URL exceeds length limit")
return []
return [{"title": "Agenda", "href": href}]


def _parse_source(self, response):
"""
Generate source. Instead of returning API URL
we return the more user-friendly web page we can see this data from.
"""
return "https://go.boarddocs.com/oh/csc/Board.nsf/vpublic?open#tab-meetings"
Loading