Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🏗️ Build spider: Atlantic City #9

Merged
merged 5 commits into from
Dec 23, 2024

Conversation

msrezaie
Copy link
Contributor

@msrezaie msrezaie commented Dec 6, 2024

What's this PR do?

Adds a spider to scrape meeting information of Atlantic City. The scraper uses API endpoints to fetch the meeting data.

Why are we doing this?

Scraper requested from spreadsheet.

Steps to manually test

  1. Ensure the project is installed:
pipenv sync --dev
  1. Activate the virtual env and enter the pipenv shell:
pipenv shell
  1. Run the spider:
scrapy crawl atconj_Atlantic_City -O test_output.csv
  1. Monitor the output and ensure no errors are raised.
  2. Inspect test_output.csv to ensure the data looks valid.
  3. Ensure all tests pass
pytest

Are there any smells or added technical debt to note?

No

Summary by CodeRabbit

  • New Features

    • Introduced a web scraper for Atlantic City meetings, enabling users to access meeting data from a dedicated API.
    • Added JSON files containing meeting entries and detailed meeting information for enhanced data management.
  • Tests

    • Implemented a suite of unit tests to validate the functionality of the Atlantic City scraper, ensuring accurate data parsing and integrity.

Copy link
Contributor

coderabbitai bot commented Dec 6, 2024

Walkthrough

The changes introduce a web scraper for Atlantic City meetings, implemented in the atconj_Atlantic_City.py file. This scraper, defined by the AtlanticCitySpider class, fetches meeting data from specified API endpoints, processes it, and constructs meeting objects with relevant details. Additionally, two JSON files are added to provide sample meeting data and detailed meeting information. A new test suite is created to validate the functionality of the scraper, ensuring it accurately parses and processes the meeting data.

Changes

File Change Summary
city_scrapers/spiders/atconj_Atlantic_City.py Added AtlanticCitySpider class with methods for scraping meeting data and processing responses.
tests/files/atconj_Atlantic_City.json New JSON file added containing a series of meeting entries with various properties.
tests/files/atconj_Atlantic_City_meeting_detail.json New JSON file added with detailed information about a specific meeting.
tests/test_atconj_Atlantic_City.py New test suite created for AtlanticCitySpider to validate parsing and processing of meeting data.

Poem

🐇 In the city where meetings convene,
A spider now weaves through the screen.
With data it gathers, both near and far,
Atlantic City shines like a star.
JSONs and tests, all in a row,
Hopping along, watch the changes flow! 🌟


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fca950 and 68a85bc.

📒 Files selected for processing (1)
  • city_scrapers/spiders/atconj_Atlantic_City.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • city_scrapers/spiders/atconj_Atlantic_City.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@msrezaie msrezaie requested review from LienDang and lamle-ea December 6, 2024 17:42
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 85c2a84 and d61e302.

📒 Files selected for processing (4)
  • city_scrapers/spiders/atconj_Atlantic_City.py (1 hunks)
  • tests/files/atconj_Atlantic_City.json (1 hunks)
  • tests/files/atconj_Atlantic_City_meeting_detail.json (1 hunks)
  • tests/test_atconj_Atlantic_City.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/files/atconj_Atlantic_City_meeting_detail.json
🔇 Additional comments (1)
tests/files/atconj_Atlantic_City.json (1)

1-842: No issues found with the test data file

The JSON structure appears correct and aligns with the expected data format for testing.

city_scrapers/spiders/atconj_Atlantic_City.py Outdated Show resolved Hide resolved
city_scrapers/spiders/atconj_Atlantic_City.py Outdated Show resolved Hide resolved
city_scrapers/spiders/atconj_Atlantic_City.py Outdated Show resolved Hide resolved
city_scrapers/spiders/atconj_Atlantic_City.py Outdated Show resolved Hide resolved
tests/test_atconj_Atlantic_City.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d61e302 and a41e14c.

📒 Files selected for processing (2)
  • city_scrapers/spiders/atconj_Atlantic_City.py (1 hunks)
  • tests/test_atconj_Atlantic_City.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_atconj_Atlantic_City.py
🔇 Additional comments (7)
city_scrapers/spiders/atconj_Atlantic_City.py (7)

1-23: LGTM! Imports and class setup look good.

The imports are comprehensive and appropriate for the spider's functionality. The class definition follows the project's naming conventions and includes the necessary configuration.


46-46: Fix the misspelling of 'calender_source' to 'calendar_source'

The variable name contains a spelling error.


48-63: LGTM! Request handling is well implemented.

The request methods follow Scrapy best practices and properly handle data flow between requests.


85-92: Adjust logic in _parse_classification method


93-120: LGTM! Location and links parsing is well implemented.

The methods handle edge cases appropriately and provide good defaults when needed.


122-127: Ensure timezone-aware datetime comparisons in _get_status


24-26: Verify the necessity of disabling robots.txt

While the setting might be necessary for accessing the API endpoints, it's important to ensure this doesn't violate the site's terms of service.

✅ Verification successful

Disabling robots.txt is appropriate for this API-based scraper

The spider uses API endpoints (/api/data/GetCalendarMeetings and /api/data/GetMeeting) to fetch meeting data, not web scraping of HTML pages. The robots.txt from acnj.gov doesn't restrict API access, and APIs typically manage access through other means (rate limiting, authentication). The comprehensive docstring clearly explains this API-based approach.

  • The spider accesses two API endpoints:
    • GetCalendarMeetings: Lists all meetings
    • GetMeeting: Retrieves detailed meeting information
  • The calendar data is also available through a user-friendly interface at https://www.acnj.gov/calendar
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if robots.txt exists and its contents
curl -s https://www.acnj.gov/robots.txt

# Check if the API endpoints are specifically mentioned in robots.txt
curl -s https://www.acnj.gov/robots.txt | grep -i "api/data"

Length of output: 282


Script:

#!/bin/bash
# Check the spider implementation to understand API endpoint usage
rg -l "api" city_scrapers/spiders/atconj_Atlantic_City.py

# Look for any comments or documentation about the API usage
rg -A 5 "class AtconjAtlanticCitySpider" city_scrapers/spiders/atconj_Atlantic_City.py

Length of output: 192


Script:

#!/bin/bash
# Get the full content of the spider file to understand the API usage and context
cat city_scrapers/spiders/atconj_Atlantic_City.py

# Check if similar robots.txt settings are used in other spiders
rg "ROBOTSTXT_OBEY.*False" city_scrapers/spiders/

Length of output: 4382

field of the meeting since it is more user friendly to navigate
than the api endpoints.
"""
meetings_url = "https://www.acnj.gov/api/data/GetCalendarMeetings?end=06%2F30%2F2025+12:00+am&meetingTypeID=all&start=06%2F01%2F2024+12:00+am" # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider parameterizing the date range in meetings_url

The URL contains hardcoded dates (2024-2025) which will need manual updates in the future. Consider making these dates dynamic based on the current date.

-    meetings_url = "https://www.acnj.gov/api/data/GetCalendarMeetings?end=06%2F30%2F2025+12:00+am&meetingTypeID=all&start=06%2F01%2F2024+12:00+am"  # noqa
+    @property
+    def meetings_url(self):
+        start_date = datetime.now().strftime("%m%%2F01%%2F%Y")
+        end_date = (datetime.now().replace(year=datetime.now().year + 1)).strftime("%m%%2F30%%2F%Y")
+        return f"https://www.acnj.gov/api/data/GetCalendarMeetings?end={end_date}+12:00+am&meetingTypeID=all&start={start_date}+12:00+am"

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 64 to 83
def parse_meeting(self, response, item):
meeting_detail = json.loads(response.text)

meeting = Meeting(
title=item["title"],
description="",
classification=self._parse_classification(meeting_detail),
start=parse(item["start"]),
end=None,
all_day=item["allDay"],
time_notes="",
location=self._parse_location(meeting_detail),
links=self._parse_links(meeting_detail),
source=self.calender_source,
)

meeting["status"] = self._get_status(meeting_detail)
meeting["id"] = int(item["id"])

yield meeting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for API responses

The parsing assumes the API response will always contain the expected fields. Consider adding error handling for missing or malformed data.

     def parse_meeting(self, response, item):
-        meeting_detail = json.loads(response.text)
+        try:
+            meeting_detail = json.loads(response.text)
+            if not isinstance(meeting_detail, dict):
+                raise ValueError("Expected dictionary response")
+        except (json.JSONDecodeError, ValueError) as e:
+            self.logger.error(f"Failed to parse meeting detail: {e}")
+            return

         meeting = Meeting(
-            title=item["title"],
+            title=item.get("title", ""),
             description="",
             classification=self._parse_classification(meeting_detail),
-            start=parse(item["start"]),
+            start=parse(item.get("start")) if item.get("start") else None,
             end=None,
-            all_day=item["allDay"],
+            all_day=item.get("allDay", False),
             time_notes="",
             location=self._parse_location(meeting_detail),
             links=self._parse_links(meeting_detail),
             source=self.calender_source,
         )

Committable suggestion skipped: line range outside the PR's diff.

Copy link

@cruznunez cruznunez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides the issue with the hardcoded date, we can also remove the json dependency.

city_scrapers/spiders/atconj_Atlantic_City.py Outdated Show resolved Hide resolved
- made the start and end parameters of the API url dynamic
Copy link

@cruznunez cruznunez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work with relativedelta. That's pretty cool. 1 more change with json dependency and we should be good.

city_scrapers/spiders/atconj_Atlantic_City.py Outdated Show resolved Hide resolved
Copy link

@cruznunez cruznunez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@msrezaie msrezaie requested a review from lamle-ea December 18, 2024 18:18
Copy link
Contributor

@lamle-ea lamle-ea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lamle-ea lamle-ea merged commit 121a82b into City-Bureau:main Dec 23, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants