-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🏗️ Build spider: Atlantic City #9
Conversation
WalkthroughThe changes introduce a web scraper for Atlantic City meetings, implemented in the Changes
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
city_scrapers/spiders/atconj_Atlantic_City.py
(1 hunks)tests/files/atconj_Atlantic_City.json
(1 hunks)tests/files/atconj_Atlantic_City_meeting_detail.json
(1 hunks)tests/test_atconj_Atlantic_City.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/files/atconj_Atlantic_City_meeting_detail.json
🔇 Additional comments (1)
tests/files/atconj_Atlantic_City.json (1)
1-842
: No issues found with the test data file
The JSON structure appears correct and aligns with the expected data format for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
city_scrapers/spiders/atconj_Atlantic_City.py
(1 hunks)tests/test_atconj_Atlantic_City.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_atconj_Atlantic_City.py
🔇 Additional comments (7)
city_scrapers/spiders/atconj_Atlantic_City.py (7)
1-23
: LGTM! Imports and class setup look good.
The imports are comprehensive and appropriate for the spider's functionality. The class definition follows the project's naming conventions and includes the necessary configuration.
46-46
: Fix the misspelling of 'calender_source' to 'calendar_source'
The variable name contains a spelling error.
48-63
: LGTM! Request handling is well implemented.
The request methods follow Scrapy best practices and properly handle data flow between requests.
85-92
: Adjust logic in _parse_classification
method
93-120
: LGTM! Location and links parsing is well implemented.
The methods handle edge cases appropriately and provide good defaults when needed.
122-127
: Ensure timezone-aware datetime comparisons in _get_status
24-26
: Verify the necessity of disabling robots.txt
While the setting might be necessary for accessing the API endpoints, it's important to ensure this doesn't violate the site's terms of service.
✅ Verification successful
Disabling robots.txt is appropriate for this API-based scraper
The spider uses API endpoints (/api/data/GetCalendarMeetings
and /api/data/GetMeeting
) to fetch meeting data, not web scraping of HTML pages. The robots.txt from acnj.gov doesn't restrict API access, and APIs typically manage access through other means (rate limiting, authentication). The comprehensive docstring clearly explains this API-based approach.
- The spider accesses two API endpoints:
GetCalendarMeetings
: Lists all meetingsGetMeeting
: Retrieves detailed meeting information
- The calendar data is also available through a user-friendly interface at
https://www.acnj.gov/calendar
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if robots.txt exists and its contents
curl -s https://www.acnj.gov/robots.txt
# Check if the API endpoints are specifically mentioned in robots.txt
curl -s https://www.acnj.gov/robots.txt | grep -i "api/data"
Length of output: 282
Script:
#!/bin/bash
# Check the spider implementation to understand API endpoint usage
rg -l "api" city_scrapers/spiders/atconj_Atlantic_City.py
# Look for any comments or documentation about the API usage
rg -A 5 "class AtconjAtlanticCitySpider" city_scrapers/spiders/atconj_Atlantic_City.py
Length of output: 192
Script:
#!/bin/bash
# Get the full content of the spider file to understand the API usage and context
cat city_scrapers/spiders/atconj_Atlantic_City.py
# Check if similar robots.txt settings are used in other spiders
rg "ROBOTSTXT_OBEY.*False" city_scrapers/spiders/
Length of output: 4382
field of the meeting since it is more user friendly to navigate | ||
than the api endpoints. | ||
""" | ||
meetings_url = "https://www.acnj.gov/api/data/GetCalendarMeetings?end=06%2F30%2F2025+12:00+am&meetingTypeID=all&start=06%2F01%2F2024+12:00+am" # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider parameterizing the date range in meetings_url
The URL contains hardcoded dates (2024-2025) which will need manual updates in the future. Consider making these dates dynamic based on the current date.
- meetings_url = "https://www.acnj.gov/api/data/GetCalendarMeetings?end=06%2F30%2F2025+12:00+am&meetingTypeID=all&start=06%2F01%2F2024+12:00+am" # noqa
+ @property
+ def meetings_url(self):
+ start_date = datetime.now().strftime("%m%%2F01%%2F%Y")
+ end_date = (datetime.now().replace(year=datetime.now().year + 1)).strftime("%m%%2F30%%2F%Y")
+ return f"https://www.acnj.gov/api/data/GetCalendarMeetings?end={end_date}+12:00+am&meetingTypeID=all&start={start_date}+12:00+am"
Committable suggestion skipped: line range outside the PR's diff.
def parse_meeting(self, response, item): | ||
meeting_detail = json.loads(response.text) | ||
|
||
meeting = Meeting( | ||
title=item["title"], | ||
description="", | ||
classification=self._parse_classification(meeting_detail), | ||
start=parse(item["start"]), | ||
end=None, | ||
all_day=item["allDay"], | ||
time_notes="", | ||
location=self._parse_location(meeting_detail), | ||
links=self._parse_links(meeting_detail), | ||
source=self.calender_source, | ||
) | ||
|
||
meeting["status"] = self._get_status(meeting_detail) | ||
meeting["id"] = int(item["id"]) | ||
|
||
yield meeting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for API responses
The parsing assumes the API response will always contain the expected fields. Consider adding error handling for missing or malformed data.
def parse_meeting(self, response, item):
- meeting_detail = json.loads(response.text)
+ try:
+ meeting_detail = json.loads(response.text)
+ if not isinstance(meeting_detail, dict):
+ raise ValueError("Expected dictionary response")
+ except (json.JSONDecodeError, ValueError) as e:
+ self.logger.error(f"Failed to parse meeting detail: {e}")
+ return
meeting = Meeting(
- title=item["title"],
+ title=item.get("title", ""),
description="",
classification=self._parse_classification(meeting_detail),
- start=parse(item["start"]),
+ start=parse(item.get("start")) if item.get("start") else None,
end=None,
- all_day=item["allDay"],
+ all_day=item.get("allDay", False),
time_notes="",
location=self._parse_location(meeting_detail),
links=self._parse_links(meeting_detail),
source=self.calender_source,
)
Committable suggestion skipped: line range outside the PR's diff.
- fixed typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
besides the issue with the hardcoded date, we can also remove the json dependency.
- made the start and end parameters of the API url dynamic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work with relativedelta
. That's pretty cool. 1 more change with json dependency and we should be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What's this PR do?
Adds a spider to scrape meeting information of Atlantic City. The scraper uses API endpoints to fetch the meeting data.
Why are we doing this?
Scraper requested from spreadsheet.
Steps to manually test
test_output.csv
to ensure the data looks valid.Are there any smells or added technical debt to note?
No
Summary by CodeRabbit
New Features
Tests