-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🏗 Build Spider: Cincinnati Civil Service Commission #10
🏗 Build Spider: Cincinnati Civil Service Commission #10
Conversation
Warning Rate limit exceeded@cruznunez has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 10 minutes and 17 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughThe changes introduce a new spider, Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range and nitpick comments (4)
tests/test_cinoh_Civil_Service.py (2)
25-27
: Add context for the expected count.The assertion
len(parsed_items) == 21
lacks context. Consider adding a comment explaining why exactly 21 items are expected, or make this test more dynamic based on the actual test data.def test_count(): + # Expecting 21 items as per the test data JSON file + # This represents all meetings for the current year assert len(parsed_items) == 21
29-31
: Consider making date assertions more maintainable.Instead of hardcoding dates, consider deriving them from the frozen time to make tests more maintainable.
+from datetime import timedelta def test_title(): - assert parsed_items[0]["title"] == "November 7, 2024 Civil Service Commission" + expected_date = (datetime(2024, 11, 6) + timedelta(days=1)).strftime("%B %-d, %Y") + assert parsed_items[0]["title"] == f"{expected_date} Civil Service Commission" def test_start(): - assert parsed_items[0]["start"] == datetime(2024, 11, 7, 0, 0) + assert parsed_items[0]["start"] == datetime(2024, 11, 6) + timedelta(days=1)Also applies to: 37-39
city_scrapers/spiders/cinoh_Civil_Service.py (2)
15-17
: Document rationale for ignoring robots.txtWhile ignoring robots.txt might be necessary for this scraper, it's important to document why this exception is being made and ensure it complies with the site's terms of service.
Add a comment explaining the rationale:
custom_settings = { + # BoardDocs API requires bypassing robots.txt to access meeting data + # This has been verified as acceptable use per <reference/justification> "ROBOTSTXT_OBEY": False, }
63-68
: Move hardcoded location to configurationConsider moving the hardcoded location details to a configuration file for better maintainability.
Create a configuration file (e.g.,
config.py
):LOCATIONS = { "cinoh_Civil_Service": { "name": "Cincinnati Civil Service Commission", "address": "805 Central Ave, Suite 200, Cincinnati, OH 45202", } }Then update the method:
def _parse_location(self, item): - return { - "name": "Cincinnati Civil Service Commission", - "address": "805 Central Ave, Suite 200, Cincinnati, OH 45202", - } + from .config import LOCATIONS + return LOCATIONS[self.name]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
city_scrapers/spiders/cinoh_Civil_Service.py
(1 hunks)tests/files/cinoh_Civil_Service.json
(1 hunks)tests/test_cinoh_Civil_Service.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/files/cinoh_Civil_Service.json
🔇 Additional comments (2)
tests/test_cinoh_Civil_Service.py (2)
1-9
: LGTM! Well-organized imports.
The imports are properly organized and include all necessary dependencies for testing.
11-22
: Verify test data file existence.
The test relies on a JSON file at tests/files/cinoh_Civil_Service.json
. Let's verify its presence and content.
✅ Verification successful
Test data file is properly set up and valid
The verification confirms that:
- The test data file exists at the expected location
- The JSON structure is valid
- The file contains 219 records, which is a reasonable size for test data
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the existence and basic structure of the test data file
# Test 1: Check if the JSON file exists
if [ -f "tests/files/cinoh_Civil_Service.json" ]; then
echo "✓ Test data file exists"
# Test 2: Validate JSON structure
jq . "tests/files/cinoh_Civil_Service.json" > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "✓ JSON is valid"
# Test 3: Show basic structure
jq 'length' "tests/files/cinoh_Civil_Service.json"
else
echo "✗ Invalid JSON format"
fi
else
echo "✗ Test data file not found"
fi
Length of output: 262
tests/test_cinoh_Civil_Service.py
Outdated
|
||
def test_count(): | ||
assert len(parsed_items) == 21 | ||
|
||
|
||
def test_title(): | ||
assert parsed_items[0]["title"] == "November 7, 2024 Civil Service Commission" | ||
|
||
|
||
def test_description(): | ||
assert parsed_items[0]["description"] == "" | ||
|
||
|
||
def test_start(): | ||
assert parsed_items[0]["start"] == datetime(2024, 11, 7, 0, 0) | ||
|
||
|
||
def test_end(): | ||
assert parsed_items[0]["end"] is None | ||
|
||
|
||
def test_time_notes(): | ||
assert parsed_items[0]["time_notes"] == "" | ||
|
||
|
||
def test_id(): | ||
assert ( | ||
parsed_items[0]["id"] | ||
== "cinoh_Civil_Service/202411070000/x/november_7_2024_civil_service_commission" | ||
) | ||
|
||
|
||
def test_status(): | ||
assert parsed_items[0]["status"] == "tentative" | ||
|
||
|
||
def test_location(): | ||
assert parsed_items[0]["location"] == { | ||
"name": "Cincinnati Civil Service Commission", | ||
"address": "805 Central Ave, Suite 200, Cincinnati, OH 45202", | ||
} | ||
|
||
|
||
def test_source(): | ||
assert ( | ||
parsed_items[0]["source"] | ||
== "https://go.boarddocs.com/oh/csc/Board.nsf/vpublic?open#tab-meetings" | ||
) | ||
|
||
|
||
def test_links(): | ||
assert parsed_items[0]["links"] == [ | ||
{ | ||
"title": "Agenda", | ||
"href": "https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?open&id=CZQLFH5631AD¤t_committee_id=A9HCN931D6BA", # noqa | ||
} | ||
] | ||
|
||
|
||
def test_classification(): | ||
assert parsed_items[0]["classification"] == COMMISSION | ||
|
||
|
||
@pytest.mark.parametrize("item", parsed_items) | ||
def test_all_day(item): | ||
assert item["all_day"] is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider adding edge case tests.
The test suite thoroughly covers the happy path but could benefit from additional tests for:
- Empty or malformed JSON responses
- Meetings with missing required fields
- Meetings outside the current year
Would you like me to provide examples of these additional test cases?
title=item["name"], | ||
description="", | ||
classification=COMMISSION, | ||
start=parse(numb), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Ensure proper timezone handling in date parsing
The date parsing should explicitly handle timezone information to prevent any ambiguity.
- start=parse(numb),
+ start=parse(numb).replace(tzinfo=None), # Strip any existing TZ info
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
start=parse(numb), | |
start=parse(numb).replace(tzinfo=None), # Strip any existing TZ info |
def start_requests(self): | ||
url = "https://go.boarddocs.com/oh/csc/Board.nsf/BD-GetMeetingsList" | ||
form_data = {"current_committee_id": self.committee_id} | ||
# send the POST request and use parse method when response is returned | ||
yield scrapy.FormRequest(url, formdata=form_data, callback=self.parse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for API requests
The request implementation should handle potential API failures gracefully. Consider adding error handling and retries.
def start_requests(self):
url = "https://go.boarddocs.com/oh/csc/Board.nsf/BD-GetMeetingsList"
form_data = {"current_committee_id": self.committee_id}
- yield scrapy.FormRequest(url, formdata=form_data, callback=self.parse)
+ yield scrapy.FormRequest(
+ url,
+ formdata=form_data,
+ callback=self.parse,
+ errback=self.errback_httpbin,
+ dont_filter=True,
+ meta={'dont_retry': False, 'max_retry_times': 3}
+ )
+
+ def errback_httpbin(self, failure):
+ self.logger.error(f"Request failed: {failure.value}")
Committable suggestion skipped: line range outside the PR's diff.
data = response.json() | ||
|
||
for item in data: | ||
numb = item.get("numberdate") | ||
|
||
# skip iteration if meeting is not for current year | ||
if numb is None or year not in numb: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve data validation and error handling
The JSON parsing lacks validation for required fields and could fail silently if the response format changes.
- data = response.json()
+ try:
+ data = response.json()
+ if not isinstance(data, list):
+ raise ValueError("Expected JSON array in response")
+ except (ValueError, KeyError) as e:
+ self.logger.error(f"Failed to parse JSON response: {e}")
+ return
for item in data:
- numb = item.get("numberdate")
+ try:
+ numb = item["numberdate"]
+ name = item["name"]
+ unique = item["unique"]
+ except KeyError as e:
+ self.logger.warning(f"Missing required field in item: {e}")
+ continue
Committable suggestion skipped: line range outside the PR's diff.
def _parse_links(self, item): | ||
"""Generate links.""" | ||
href = ( | ||
f"https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?" | ||
f"open&id={item['unique']}¤t_committee_id={self.committee_id}" | ||
) | ||
return [{"title": "Agenda", "href": href}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add URL validation for generated links
The link generation should validate the URL components and handle potential errors.
def _parse_links(self, item):
"""Generate links."""
+ if not item.get('unique'):
+ self.logger.warning("Missing unique ID for agenda link")
+ return []
+
href = (
f"https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?"
f"open&id={item['unique']}¤t_committee_id={self.committee_id}"
)
+
+ # Validate URL length and characters
+ if len(href) > 2048: # Standard URL length limit
+ self.logger.warning("Generated URL exceeds length limit")
+ return []
+
return [{"title": "Agenda", "href": href}]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def _parse_links(self, item): | |
"""Generate links.""" | |
href = ( | |
f"https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?" | |
f"open&id={item['unique']}¤t_committee_id={self.committee_id}" | |
) | |
return [{"title": "Agenda", "href": href}] | |
def _parse_links(self, item): | |
"""Generate links.""" | |
if not item.get('unique'): | |
self.logger.warning("Missing unique ID for agenda link") | |
return [] | |
href = ( | |
f"https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?" | |
f"open&id={item['unique']}¤t_committee_id={self.committee_id}" | |
) | |
# Validate URL length and characters | |
if len(href) > 2048: # Standard URL length limit | |
self.logger.warning("Generated URL exceeds length limit") | |
return [] | |
return [{"title": "Agenda", "href": href}] |
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
city_scrapers/spiders/cinoh_Civil_Service.py (2)
23-27
:⚠️ Potential issueAdd error handling for API requests
The current implementation lacks error handling for potential API failures.
Add error handling and retries:
def start_requests(self): url = "https://go.boarddocs.com/oh/csc/Board.nsf/BD-GetMeetingsList" form_data = {"current_committee_id": self.committee_id} - yield scrapy.FormRequest(url, formdata=form_data, callback=self.parse) + yield scrapy.FormRequest( + url, + formdata=form_data, + callback=self.parse, + errback=self.errback_httpbin, + dont_filter=True, + meta={'dont_retry': False, 'max_retry_times': 3} + ) + +def errback_httpbin(self, failure): + self.logger.error(f"Request failed: {failure.value}")
76-82
:⚠️ Potential issueValidate link generation parameters
The link generation should validate the required parameters.
Add validation:
def _parse_links(self, item): """Generate links.""" + if not item.get('unique'): + self.logger.warning("Missing unique ID for agenda link") + return [] + href = ( f"https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?" f"open&id={item['unique']}¤t_committee_id={self.committee_id}" ) + + # Validate URL components + if len(href) > 2048: # Standard URL length limit + self.logger.warning("Generated URL exceeds length limit") + return [] + return [{"title": "Agenda", "href": href}]
🧹 Nitpick comments (2)
city_scrapers/spiders/cinoh_Civil_Service.py (2)
16-18
: Consider the implications of ignoring robots.txtWhile disabling
ROBOTSTXT_OBEY
might be necessary for this scraper, it's important to:
- Document why this is necessary
- Ensure we're not overwhelming the server with requests
- Consider implementing rate limiting
Consider adding rate limiting settings:
custom_settings = { "ROBOTSTXT_OBEY": False, + "DOWNLOAD_DELAY": 1, # Add 1 second delay between requests + "CONCURRENT_REQUESTS_PER_DOMAIN": 1, # Limit concurrent requests }
69-74
: Consider making location configurableThe hardcoded location might not handle special meetings or venue changes.
Consider making it configurable:
+ default_location = { + "name": "Cincinnati Civil Service Commission", + "address": "805 Central Ave, Suite 200, Cincinnati, OH 45202", + } + def _parse_location(self, item): """Generate location.""" + # Check if meeting has a special location + special_location = item.get("location") + if special_location: + return { + "name": special_location.get("name", self.default_location["name"]), + "address": special_location.get("address", self.default_location["address"]), + } - return { - "name": "Cincinnati Civil Service Commission", - "address": "805 Central Ave, Suite 200, Cincinnati, OH 45202", - } + return self.default_location
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
city_scrapers/spiders/cinoh_Civil_Service.py
(1 hunks)tests/test_cinoh_Civil_Service.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_cinoh_Civil_Service.py
🔇 Additional comments (1)
city_scrapers/spiders/cinoh_Civil_Service.py (1)
34-36
:
Add validation for JSON response
The JSON parsing lacks validation and error handling.
Add proper validation:
+ try:
data = response.json()
+ if not isinstance(data, list):
+ raise ValueError("Expected JSON array in response")
+ except (ValueError, json.JSONDecodeError) as e:
+ self.logger.error(f"Failed to parse JSON response: {e}")
+ return
Likely invalid or redundant comment.
meeting_date = parse(numb) | ||
if meeting_date < lower_limit: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve date comparison logic
When comparing dates, ensure both dates are timezone-aware or naive to prevent unexpected behavior.
- meeting_date = parse(numb)
+ meeting_date = parse(numb).replace(tzinfo=None)
- if meeting_date < lower_limit:
+ if meeting_date < lower_limit.replace(tzinfo=None):
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
meeting_date = parse(numb) | |
if meeting_date < lower_limit: | |
continue | |
meeting_date = parse(numb).replace(tzinfo=None) | |
if meeting_date < lower_limit.replace(tzinfo=None): | |
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What's this PR do?
This PR adds a scraper for Cincinnati Civil Service Commission. The scraper fetches an API endpoint via POST request. We have to ignore robots file.
Why are we doing this?
Scraper requested from spreadsheet.
Steps to manually test
test_output.csv
to ensure the data looks valid.Are there any smells or added technical debt to note?
No
Summary by CodeRabbit
New Features
CinohCivilServiceSpider
for scraping meeting data from the Cincinnati Civil Service Commission.Tests
CinohCivilServiceSpider
, validating parsing logic and ensuring data accuracy through various assertions.