🏗️ Build spider: Atlantic City #9

msrezaie · 2024-12-06T17:41:45Z

What's this PR do?

Adds a spider to scrape meeting information of Atlantic City. The scraper uses API endpoints to fetch the meeting data.

Why are we doing this?

Scraper requested from spreadsheet.

Steps to manually test

Ensure the project is installed:

pipenv sync --dev

Activate the virtual env and enter the pipenv shell:

pipenv shell

Run the spider:

scrapy crawl atconj_Atlantic_City -O test_output.csv

Monitor the output and ensure no errors are raised.
Inspect test_output.csv to ensure the data looks valid.
Ensure all tests pass

pytest

Are there any smells or added technical debt to note?

No

Summary by CodeRabbit

New Features
- Introduced a web scraper for Atlantic City meetings, enabling users to access meeting data from a dedicated API.
- Added JSON files containing meeting entries and detailed meeting information for enhanced data management.
Tests
- Implemented a suite of unit tests to validate the functionality of the Atlantic City scraper, ensuring accurate data parsing and integrity.

coderabbitai · 2024-12-06T17:41:53Z

Walkthrough

The changes introduce a web scraper for Atlantic City meetings, implemented in the atconj_Atlantic_City.py file. This scraper, defined by the AtlanticCitySpider class, fetches meeting data from specified API endpoints, processes it, and constructs meeting objects with relevant details. Additionally, two JSON files are added to provide sample meeting data and detailed meeting information. A new test suite is created to validate the functionality of the scraper, ensuring it accurately parses and processes the meeting data.

Changes

File	Change Summary
`city_scrapers/spiders/atconj_Atlantic_City.py`	Added `AtlanticCitySpider` class with methods for scraping meeting data and processing responses.
`tests/files/atconj_Atlantic_City.json`	New JSON file added containing a series of meeting entries with various properties.
`tests/files/atconj_Atlantic_City_meeting_detail.json`	New JSON file added with detailed information about a specific meeting.
`tests/test_atconj_Atlantic_City.py`	New test suite created for `AtlanticCitySpider` to validate parsing and processing of meeting data.

Poem

🐇 In the city where meetings convene,
A spider now weaves through the screen.
With data it gathers, both near and far,
Atlantic City shines like a star.
JSONs and tests, all in a row,
Hopping along, watch the changes flow! 🌟

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fca950 and 68a85bc.

📒 Files selected for processing (1)

city_scrapers/spiders/atconj_Atlantic_City.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

city_scrapers/spiders/atconj_Atlantic_City.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 85c2a84 and d61e302.

📒 Files selected for processing (4)

city_scrapers/spiders/atconj_Atlantic_City.py (1 hunks)
tests/files/atconj_Atlantic_City.json (1 hunks)
tests/files/atconj_Atlantic_City_meeting_detail.json (1 hunks)
tests/test_atconj_Atlantic_City.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

tests/files/atconj_Atlantic_City_meeting_detail.json

🔇 Additional comments (1)

tests/files/atconj_Atlantic_City.json (1)

1-842: No issues found with the test data file

The JSON structure appears correct and aligns with the expected data format for testing.

city_scrapers/spiders/atconj_Atlantic_City.py

tests/test_atconj_Atlantic_City.py

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d61e302 and a41e14c.

📒 Files selected for processing (2)

city_scrapers/spiders/atconj_Atlantic_City.py (1 hunks)
tests/test_atconj_Atlantic_City.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/test_atconj_Atlantic_City.py

🔇 Additional comments (7)

city_scrapers/spiders/atconj_Atlantic_City.py (7)

1-23: LGTM! Imports and class setup look good.

The imports are comprehensive and appropriate for the spider's functionality. The class definition follows the project's naming conventions and includes the necessary configuration.

46-46: Fix the misspelling of 'calender_source' to 'calendar_source'

The variable name contains a spelling error.

48-63: LGTM! Request handling is well implemented.

The request methods follow Scrapy best practices and properly handle data flow between requests.

85-92: Adjust logic in _parse_classification method

93-120: LGTM! Location and links parsing is well implemented.

The methods handle edge cases appropriately and provide good defaults when needed.

122-127: Ensure timezone-aware datetime comparisons in _get_status

24-26: Verify the necessity of disabling robots.txt

While the setting might be necessary for accessing the API endpoints, it's important to ensure this doesn't violate the site's terms of service.

✅ Verification successful

Disabling robots.txt is appropriate for this API-based scraper

The spider uses API endpoints (/api/data/GetCalendarMeetings and /api/data/GetMeeting) to fetch meeting data, not web scraping of HTML pages. The robots.txt from acnj.gov doesn't restrict API access, and APIs typically manage access through other means (rate limiting, authentication). The comprehensive docstring clearly explains this API-based approach.

The spider accesses two API endpoints:
- GetCalendarMeetings: Lists all meetings
- GetMeeting: Retrieves detailed meeting information
The calendar data is also available through a user-friendly interface at https://www.acnj.gov/calendar

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if robots.txt exists and its contents
curl -s https://www.acnj.gov/robots.txt

# Check if the API endpoints are specifically mentioned in robots.txt
curl -s https://www.acnj.gov/robots.txt | grep -i "api/data"

Length of output: 282

Script:

#!/bin/bash
# Check the spider implementation to understand API endpoint usage
rg -l "api" city_scrapers/spiders/atconj_Atlantic_City.py

# Look for any comments or documentation about the API usage
rg -A 5 "class AtconjAtlanticCitySpider" city_scrapers/spiders/atconj_Atlantic_City.py

Length of output: 192

Script:

#!/bin/bash
# Get the full content of the spider file to understand the API usage and context
cat city_scrapers/spiders/atconj_Atlantic_City.py

# Check if similar robots.txt settings are used in other spiders
rg "ROBOTSTXT_OBEY.*False" city_scrapers/spiders/

Length of output: 4382

coderabbitai · 2024-12-06T17:49:19Z

city_scrapers/spiders/atconj_Atlantic_City.py

+    field of the meeting since it is more user friendly to navigate
+    than the api endpoints.
+    """
+    meetings_url = "https://www.acnj.gov/api/data/GetCalendarMeetings?end=06%2F30%2F2025+12:00+am&meetingTypeID=all&start=06%2F01%2F2024+12:00+am"  # noqa


🛠️ Refactor suggestion

Consider parameterizing the date range in meetings_url

The URL contains hardcoded dates (2024-2025) which will need manual updates in the future. Consider making these dates dynamic based on the current date.

- meetings_url = "https://www.acnj.gov/api/data/GetCalendarMeetings?end=06%2F30%2F2025+12:00+am&meetingTypeID=all&start=06%2F01%2F2024+12:00+am" # noqa + @property + def meetings_url(self): + start_date = datetime.now().strftime("%m%%2F01%%2F%Y") + end_date = (datetime.now().replace(year=datetime.now().year + 1)).strftime("%m%%2F30%%2F%Y") + return f"https://www.acnj.gov/api/data/GetCalendarMeetings?end={end_date}+12:00+am&meetingTypeID=all&start={start_date}+12:00+am"

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-12-06T17:49:19Z

city_scrapers/spiders/atconj_Atlantic_City.py

+    def parse_meeting(self, response, item):
+        meeting_detail = json.loads(response.text)
+
+        meeting = Meeting(
+            title=item["title"],
+            description="",
+            classification=self._parse_classification(meeting_detail),
+            start=parse(item["start"]),
+            end=None,
+            all_day=item["allDay"],
+            time_notes="",
+            location=self._parse_location(meeting_detail),
+            links=self._parse_links(meeting_detail),
+            source=self.calender_source,
+        )
+
+        meeting["status"] = self._get_status(meeting_detail)
+        meeting["id"] = int(item["id"])
+
+        yield meeting


🛠️ Refactor suggestion

Add error handling for API responses

The parsing assumes the API response will always contain the expected fields. Consider adding error handling for missing or malformed data.

def parse_meeting(self, response, item): - meeting_detail = json.loads(response.text) + try: + meeting_detail = json.loads(response.text) + if not isinstance(meeting_detail, dict): + raise ValueError("Expected dictionary response") + except (json.JSONDecodeError, ValueError) as e: + self.logger.error(f"Failed to parse meeting detail: {e}") + return meeting = Meeting( - title=item["title"], + title=item.get("title", ""), description="", classification=self._parse_classification(meeting_detail), - start=parse(item["start"]), + start=parse(item.get("start")) if item.get("start") else None, end=None, - all_day=item["allDay"], + all_day=item.get("allDay", False), time_notes="", location=self._parse_location(meeting_detail), links=self._parse_links(meeting_detail), source=self.calender_source, )

Committable suggestion skipped: line range outside the PR's diff.

- fixed typo

city_scrapers/spiders/atconj_Atlantic_City.py

cruznunez

besides the issue with the hardcoded date, we can also remove the json dependency.

city_scrapers/spiders/atconj_Atlantic_City.py

- made the start and end parameters of the API url dynamic

cruznunez

Nice work with relativedelta. That's pretty cool. 1 more change with json dependency and we should be good.

city_scrapers/spiders/atconj_Atlantic_City.py

cruznunez

LGTM

lamle-ea

LGTM

spider for Atlantic City

d61e302

msrezaie requested review from LienDang and lamle-ea December 6, 2024 17:42

coderabbitai bot reviewed Dec 6, 2024

View reviewed changes

fixed typo

a41e14c

coderabbitai bot reviewed Dec 6, 2024

View reviewed changes

refactored a conditional logic

0fa8dd3

- fixed typo

lamle-ea reviewed Dec 9, 2024

View reviewed changes

city_scrapers/spiders/atconj_Atlantic_City.py Outdated Show resolved Hide resolved

city_scrapers/spiders/atconj_Atlantic_City.py Outdated Show resolved Hide resolved

cruznunez requested changes Dec 10, 2024

View reviewed changes

city_scrapers/spiders/atconj_Atlantic_City.py Outdated Show resolved Hide resolved

made improvements

6fca950

- made the start and end parameters of the API url dynamic

cruznunez requested changes Dec 18, 2024

View reviewed changes

city_scrapers/spiders/atconj_Atlantic_City.py Outdated Show resolved Hide resolved

removed unused package

68a85bc

cruznunez approved these changes Dec 18, 2024

View reviewed changes

msrezaie requested a review from lamle-ea December 18, 2024 18:18

lamle-ea approved these changes Dec 23, 2024

View reviewed changes

lamle-ea merged commit 121a82b into City-Bureau:main Dec 23, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🏗️ Build spider: Atlantic City #9

🏗️ Build spider: Atlantic City #9

msrezaie commented Dec 6, 2024 •

edited

Loading

coderabbitai bot commented Dec 6, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot Dec 6, 2024

coderabbitai bot Dec 6, 2024

cruznunez left a comment

cruznunez left a comment

cruznunez left a comment

lamle-ea left a comment

🏗️ Build spider: Atlantic City #9

🏗️ Build spider: Atlantic City #9

Conversation

msrezaie commented Dec 6, 2024 • edited Loading

What's this PR do?

Why are we doing this?

Steps to manually test

Are there any smells or added technical debt to note?

Summary by CodeRabbit

coderabbitai bot commented Dec 6, 2024 • edited Loading

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 6, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 6, 2024

Choose a reason for hiding this comment

cruznunez left a comment

Choose a reason for hiding this comment

cruznunez left a comment

Choose a reason for hiding this comment

cruznunez left a comment

Choose a reason for hiding this comment

lamle-ea left a comment

Choose a reason for hiding this comment

msrezaie commented Dec 6, 2024 •

edited

Loading

coderabbitai bot commented Dec 6, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)