Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🏗 Build Spider: Cincinnati Civil Service Commission #10

Merged
merged 5 commits into from
Dec 23, 2024

Conversation

cruznunez
Copy link
Contributor

@cruznunez cruznunez commented Nov 8, 2024

What's this PR do?

This PR adds a scraper for Cincinnati Civil Service Commission. The scraper fetches an API endpoint via POST request. We have to ignore robots file.

Why are we doing this?

Scraper requested from spreadsheet.

Steps to manually test

  1. Ensure the project is installed:
pipenv sync --dev
  1. Activate the virtual env and enter the pipenv shell:
pipenv shell
  1. Run the spider:
scrapy crawl cinoh_Civil_Service -O test_output.csv
  1. Monitor the output and ensure no errors are raised.
  2. Inspect test_output.csv to ensure the data looks valid.
  3. Ensure all tests pass
pytest

Are there any smells or added technical debt to note?

No

Summary by CodeRabbit

  • New Features

    • Introduced the CinohCivilServiceSpider for scraping meeting data from the Cincinnati Civil Service Commission.
    • Added a structured JSON file containing comprehensive meeting records for easy integration.
  • Tests

    • Implemented a new test suite for the CinohCivilServiceSpider, validating parsing logic and ensuring data accuracy through various assertions.

Copy link
Contributor

coderabbitai bot commented Nov 8, 2024

Warning

Rate limit exceeded

@cruznunez has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 10 minutes and 17 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 548441c and afb8a21.

📒 Files selected for processing (1)
  • city_scrapers/spiders/cinoh_Civil_Service.py (1 hunks)

Walkthrough

The changes introduce a new spider, CinohCivilServiceSpider, designed to scrape meeting data from the Cincinnati Civil Service Commission. It includes a JSON file with structured meeting records and a test suite to validate the spider's functionality. The spider performs a POST request to retrieve meeting data, processes it to create Meeting objects, and ensures only relevant meetings are included based on the defined criteria. The test suite checks the parsing logic against expected data to ensure accuracy.

Changes

File Change Summary
city_scrapers/spiders/cinoh_Civil_Service.py Added CinohCivilServiceSpider class to scrape meeting data, including methods for requests and parsing. Configured to ignore robots.txt and handle JSON responses, filtering out old meetings.
tests/files/cinoh_Civil_Service.json Updated JSON file to include structured meeting records with fields: unique, name, current, preliveoak, numberdate, and unid.
tests/test_cinoh_Civil_Service.py Introduced a test suite for CinohCivilServiceSpider using pytest, validating the parsing logic with assertions against the updated JSON response. Includes frozen time context for consistent datetime values.

Poem

🐰 In the city where meetings convene,
A spider now dances, swift and keen.
With data so fresh, it hops through the year,
Gathering tales of the meetings we cheer!
JSON records, all neat and bright,
Validated with tests, oh what a sight! 🌟


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (4)
tests/test_cinoh_Civil_Service.py (2)

25-27: Add context for the expected count.

The assertion len(parsed_items) == 21 lacks context. Consider adding a comment explaining why exactly 21 items are expected, or make this test more dynamic based on the actual test data.

 def test_count():
+    # Expecting 21 items as per the test data JSON file
+    # This represents all meetings for the current year
     assert len(parsed_items) == 21

29-31: Consider making date assertions more maintainable.

Instead of hardcoding dates, consider deriving them from the frozen time to make tests more maintainable.

+from datetime import timedelta

 def test_title():
-    assert parsed_items[0]["title"] == "November 7, 2024 Civil Service Commission"
+    expected_date = (datetime(2024, 11, 6) + timedelta(days=1)).strftime("%B %-d, %Y")
+    assert parsed_items[0]["title"] == f"{expected_date} Civil Service Commission"

 def test_start():
-    assert parsed_items[0]["start"] == datetime(2024, 11, 7, 0, 0)
+    assert parsed_items[0]["start"] == datetime(2024, 11, 6) + timedelta(days=1)

Also applies to: 37-39

city_scrapers/spiders/cinoh_Civil_Service.py (2)

15-17: Document rationale for ignoring robots.txt

While ignoring robots.txt might be necessary for this scraper, it's important to document why this exception is being made and ensure it complies with the site's terms of service.

Add a comment explaining the rationale:

     custom_settings = {
+        # BoardDocs API requires bypassing robots.txt to access meeting data
+        # This has been verified as acceptable use per <reference/justification>
         "ROBOTSTXT_OBEY": False,
     }

63-68: Move hardcoded location to configuration

Consider moving the hardcoded location details to a configuration file for better maintainability.

Create a configuration file (e.g., config.py):

LOCATIONS = {
    "cinoh_Civil_Service": {
        "name": "Cincinnati Civil Service Commission",
        "address": "805 Central Ave, Suite 200, Cincinnati, OH 45202",
    }
}

Then update the method:

     def _parse_location(self, item):
-        return {
-            "name": "Cincinnati Civil Service Commission",
-            "address": "805 Central Ave, Suite 200, Cincinnati, OH 45202",
-        }
+        from .config import LOCATIONS
+        return LOCATIONS[self.name]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 975f98e and 5ba89df.

📒 Files selected for processing (3)
  • city_scrapers/spiders/cinoh_Civil_Service.py (1 hunks)
  • tests/files/cinoh_Civil_Service.json (1 hunks)
  • tests/test_cinoh_Civil_Service.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/files/cinoh_Civil_Service.json
🔇 Additional comments (2)
tests/test_cinoh_Civil_Service.py (2)

1-9: LGTM! Well-organized imports.

The imports are properly organized and include all necessary dependencies for testing.


11-22: Verify test data file existence.

The test relies on a JSON file at tests/files/cinoh_Civil_Service.json. Let's verify its presence and content.

✅ Verification successful

Test data file is properly set up and valid

The verification confirms that:

  • The test data file exists at the expected location
  • The JSON structure is valid
  • The file contains 219 records, which is a reasonable size for test data
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and basic structure of the test data file

# Test 1: Check if the JSON file exists
if [ -f "tests/files/cinoh_Civil_Service.json" ]; then
    echo "✓ Test data file exists"
    # Test 2: Validate JSON structure
    jq . "tests/files/cinoh_Civil_Service.json" > /dev/null 2>&1
    if [ $? -eq 0 ]; then
        echo "✓ JSON is valid"
        # Test 3: Show basic structure
        jq 'length' "tests/files/cinoh_Civil_Service.json"
    else
        echo "✗ Invalid JSON format"
    fi
else
    echo "✗ Test data file not found"
fi

Length of output: 262

Comment on lines 24 to 89

def test_count():
assert len(parsed_items) == 21


def test_title():
assert parsed_items[0]["title"] == "November 7, 2024 Civil Service Commission"


def test_description():
assert parsed_items[0]["description"] == ""


def test_start():
assert parsed_items[0]["start"] == datetime(2024, 11, 7, 0, 0)


def test_end():
assert parsed_items[0]["end"] is None


def test_time_notes():
assert parsed_items[0]["time_notes"] == ""


def test_id():
assert (
parsed_items[0]["id"]
== "cinoh_Civil_Service/202411070000/x/november_7_2024_civil_service_commission"
)


def test_status():
assert parsed_items[0]["status"] == "tentative"


def test_location():
assert parsed_items[0]["location"] == {
"name": "Cincinnati Civil Service Commission",
"address": "805 Central Ave, Suite 200, Cincinnati, OH 45202",
}


def test_source():
assert (
parsed_items[0]["source"]
== "https://go.boarddocs.com/oh/csc/Board.nsf/vpublic?open#tab-meetings"
)


def test_links():
assert parsed_items[0]["links"] == [
{
"title": "Agenda",
"href": "https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?open&id=CZQLFH5631AD&current_committee_id=A9HCN931D6BA", # noqa
}
]


def test_classification():
assert parsed_items[0]["classification"] == COMMISSION


@pytest.mark.parametrize("item", parsed_items)
def test_all_day(item):
assert item["all_day"] is False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider adding edge case tests.

The test suite thoroughly covers the happy path but could benefit from additional tests for:

  • Empty or malformed JSON responses
  • Meetings with missing required fields
  • Meetings outside the current year

Would you like me to provide examples of these additional test cases?

city_scrapers/spiders/cinoh_Civil_Service.py Outdated Show resolved Hide resolved
title=item["name"],
description="",
classification=COMMISSION,
start=parse(numb),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure proper timezone handling in date parsing

The date parsing should explicitly handle timezone information to prevent any ambiguity.

-                start=parse(numb),
+                start=parse(numb).replace(tzinfo=None),  # Strip any existing TZ info
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
start=parse(numb),
start=parse(numb).replace(tzinfo=None), # Strip any existing TZ info

Comment on lines +22 to +26
def start_requests(self):
url = "https://go.boarddocs.com/oh/csc/Board.nsf/BD-GetMeetingsList"
form_data = {"current_committee_id": self.committee_id}
# send the POST request and use parse method when response is returned
yield scrapy.FormRequest(url, formdata=form_data, callback=self.parse)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for API requests

The request implementation should handle potential API failures gracefully. Consider adding error handling and retries.

     def start_requests(self):
         url = "https://go.boarddocs.com/oh/csc/Board.nsf/BD-GetMeetingsList"
         form_data = {"current_committee_id": self.committee_id}
-        yield scrapy.FormRequest(url, formdata=form_data, callback=self.parse)
+        yield scrapy.FormRequest(
+            url,
+            formdata=form_data,
+            callback=self.parse,
+            errback=self.errback_httpbin,
+            dont_filter=True,
+            meta={'dont_retry': False, 'max_retry_times': 3}
+        )
+
+    def errback_httpbin(self, failure):
+        self.logger.error(f"Request failed: {failure.value}")

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 35 to 42
data = response.json()

for item in data:
numb = item.get("numberdate")

# skip iteration if meeting is not for current year
if numb is None or year not in numb:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Improve data validation and error handling

The JSON parsing lacks validation for required fields and could fail silently if the response format changes.

-        data = response.json()
+        try:
+            data = response.json()
+            if not isinstance(data, list):
+                raise ValueError("Expected JSON array in response")
+        except (ValueError, KeyError) as e:
+            self.logger.error(f"Failed to parse JSON response: {e}")
+            return

         for item in data:
-            numb = item.get("numberdate")
+            try:
+                numb = item["numberdate"]
+                name = item["name"]
+                unique = item["unique"]
+            except KeyError as e:
+                self.logger.warning(f"Missing required field in item: {e}")
+                continue

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +70 to +76
def _parse_links(self, item):
"""Generate links."""
href = (
f"https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?"
f"open&id={item['unique']}&current_committee_id={self.committee_id}"
)
return [{"title": "Agenda", "href": href}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add URL validation for generated links

The link generation should validate the URL components and handle potential errors.

     def _parse_links(self, item):
         """Generate links."""
+        if not item.get('unique'):
+            self.logger.warning("Missing unique ID for agenda link")
+            return []
+
         href = (
             f"https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?"
             f"open&id={item['unique']}&current_committee_id={self.committee_id}"
         )
+        
+        # Validate URL length and characters
+        if len(href) > 2048:  # Standard URL length limit
+            self.logger.warning("Generated URL exceeds length limit")
+            return []
+
         return [{"title": "Agenda", "href": href}]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _parse_links(self, item):
"""Generate links."""
href = (
f"https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?"
f"open&id={item['unique']}&current_committee_id={self.committee_id}"
)
return [{"title": "Agenda", "href": href}]
def _parse_links(self, item):
"""Generate links."""
if not item.get('unique'):
self.logger.warning("Missing unique ID for agenda link")
return []
href = (
f"https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?"
f"open&id={item['unique']}&current_committee_id={self.committee_id}"
)
# Validate URL length and characters
if len(href) > 2048: # Standard URL length limit
self.logger.warning("Generated URL exceeds length limit")
return []
return [{"title": "Agenda", "href": href}]

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
city_scrapers/spiders/cinoh_Civil_Service.py (2)

23-27: ⚠️ Potential issue

Add error handling for API requests

The current implementation lacks error handling for potential API failures.

Add error handling and retries:

 def start_requests(self):
     url = "https://go.boarddocs.com/oh/csc/Board.nsf/BD-GetMeetingsList"
     form_data = {"current_committee_id": self.committee_id}
-    yield scrapy.FormRequest(url, formdata=form_data, callback=self.parse)
+    yield scrapy.FormRequest(
+        url,
+        formdata=form_data,
+        callback=self.parse,
+        errback=self.errback_httpbin,
+        dont_filter=True,
+        meta={'dont_retry': False, 'max_retry_times': 3}
+    )
+
+def errback_httpbin(self, failure):
+    self.logger.error(f"Request failed: {failure.value}")

76-82: ⚠️ Potential issue

Validate link generation parameters

The link generation should validate the required parameters.

Add validation:

     def _parse_links(self, item):
         """Generate links."""
+        if not item.get('unique'):
+            self.logger.warning("Missing unique ID for agenda link")
+            return []
+
         href = (
             f"https://go.boarddocs.com/oh/csc/Board.nsf/Download-AgendaDetailed?"
             f"open&id={item['unique']}&current_committee_id={self.committee_id}"
         )
+        
+        # Validate URL components
+        if len(href) > 2048:  # Standard URL length limit
+            self.logger.warning("Generated URL exceeds length limit")
+            return []
+
         return [{"title": "Agenda", "href": href}]
🧹 Nitpick comments (2)
city_scrapers/spiders/cinoh_Civil_Service.py (2)

16-18: Consider the implications of ignoring robots.txt

While disabling ROBOTSTXT_OBEY might be necessary for this scraper, it's important to:

  1. Document why this is necessary
  2. Ensure we're not overwhelming the server with requests
  3. Consider implementing rate limiting

Consider adding rate limiting settings:

 custom_settings = {
     "ROBOTSTXT_OBEY": False,
+    "DOWNLOAD_DELAY": 1,  # Add 1 second delay between requests
+    "CONCURRENT_REQUESTS_PER_DOMAIN": 1,  # Limit concurrent requests
 }

69-74: Consider making location configurable

The hardcoded location might not handle special meetings or venue changes.

Consider making it configurable:

+    default_location = {
+        "name": "Cincinnati Civil Service Commission",
+        "address": "805 Central Ave, Suite 200, Cincinnati, OH 45202",
+    }
+
     def _parse_location(self, item):
         """Generate location."""
+        # Check if meeting has a special location
+        special_location = item.get("location")
+        if special_location:
+            return {
+                "name": special_location.get("name", self.default_location["name"]),
+                "address": special_location.get("address", self.default_location["address"]),
+            }
-        return {
-            "name": "Cincinnati Civil Service Commission",
-            "address": "805 Central Ave, Suite 200, Cincinnati, OH 45202",
-        }
+        return self.default_location
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0cebebe and 548441c.

📒 Files selected for processing (2)
  • city_scrapers/spiders/cinoh_Civil_Service.py (1 hunks)
  • tests/test_cinoh_Civil_Service.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_cinoh_Civil_Service.py
🔇 Additional comments (1)
city_scrapers/spiders/cinoh_Civil_Service.py (1)

34-36: ⚠️ Potential issue

Add validation for JSON response

The JSON parsing lacks validation and error handling.

Add proper validation:

+    try:
         data = response.json()
+        if not isinstance(data, list):
+            raise ValueError("Expected JSON array in response")
+    except (ValueError, json.JSONDecodeError) as e:
+        self.logger.error(f"Failed to parse JSON response: {e}")
+        return

Likely invalid or redundant comment.

Comment on lines +46 to +48
meeting_date = parse(numb)
if meeting_date < lower_limit:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Improve date comparison logic

When comparing dates, ensure both dates are timezone-aware or naive to prevent unexpected behavior.

-    meeting_date = parse(numb)
+    meeting_date = parse(numb).replace(tzinfo=None)
-    if meeting_date < lower_limit:
+    if meeting_date < lower_limit.replace(tzinfo=None):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
meeting_date = parse(numb)
if meeting_date < lower_limit:
continue
meeting_date = parse(numb).replace(tzinfo=None)
if meeting_date < lower_limit.replace(tzinfo=None):
continue

@cruznunez cruznunez requested a review from lamle-ea December 20, 2024 14:53
Copy link
Contributor

@lamle-ea lamle-ea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lamle-ea lamle-ea merged commit 7ca4020 into City-Bureau:main Dec 23, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants