Skip to content

Commit c124239

Browse files
committed
Update README and setup.py to reflect package name change and version bump to 1.0.8; enhance installation and usage instructions.
1 parent 44ce05e commit c124239

File tree

2 files changed

+35
-4
lines changed

2 files changed

+35
-4
lines changed

README.md

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Description
44

5-
**Excel API Web Scraper** is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from NYC InfoHub. It features a modular, object-oriented design with built-in **security checks** (virus scanning and MIME-type validation) for downloaded Excel files, **with an option to skip antivirus scans on Windows** if ClamAV isn’t readily available.
5+
**Excel API Web Scraper** is a Python-based project/package that automates the process of web scraping, downloading, and storing Excel files from NYC InfoHub. It features a modular, object-oriented design with built-in **security checks** (virus scanning and MIME-type validation) for downloaded Excel files, **with an option to skip antivirus scans on Windows** if ClamAV isn’t readily available.
66

77
### Highlights
88

@@ -65,6 +65,35 @@ security_manager = SecurityManager(skip_windows_scan=False)
6565

6666
---
6767

68+
## Package
69+
70+
[![PyPI version](https://badge.fury.io/py/nyc_infohub_excel_api_access.svg)](https://pypi.org/project/nyc_infohub_excel_api_access/)
71+
72+
**Version: 1.0.8**
73+
74+
A Python package for scraping and downloading Excel datasets from NYC InfoHub using Selenium, httpx, asyncio, and virus/MIME validation is available.
75+
76+
---
77+
78+
## 📦 Installation
79+
80+
```bash
81+
pip install nyc_infohub_excel_api_access
82+
```
83+
84+
---
85+
86+
## 🚀 Usage
87+
88+
Run from the command line:
89+
```bash
90+
nyc-infohub-scraper
91+
```
92+
93+
Installing this package gives you access to the CLI tool nyc-infohub-scraper, which launches the scraper pipeline from the terminal with a single command.
94+
95+
---
96+
6897
## Requirements
6998

7099
### System Requirements
@@ -320,6 +349,8 @@ A GitHub Actions workflow is set up in `.github/workflows/ci-cd.yml`. It:
320349
- **Connection Pooling**: Addressed by a persistent `httpx.AsyncClient`.
321350
- **Redundant Downloads**: Prevented by storing file hashes and only updating on changes.
322351
- **Virus Scan Overhead**: In-memory scanning might add overhead, but ensures security.
352+
- **Virus Scan Failures**: If ClamAV is unavailable or fails (e.g., due to socket errors or size limits), the scraper falls back to MIME-type validation for Excel files instead of discarding them.
353+
- **Fallback Traceability**: All skipped or MIME-only approved files are logged in `quarantine.log` with timestamp, reason, MIME type, and file size for audit and debugging.
323354
- **Size Limit Errors**: If you see “INSTREAM: Size limit reached” warnings, increase `StreamMaxLength` in `clamd.conf`.
324355
- **Windows Skipping**: If you can’t run ClamAV natively, the skip mechanism means the scraper still works without throwing errors.
325356

setup.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
from setuptools import setup, find_packages
22

33
setup(
4-
name="excel_api_access",
5-
version="1.0.6",
4+
name="nyc_infohub_excel_api_access",
5+
version="1.0.8",
66
author="Dylan Picart",
77
author_email="dylanpicart@mail.adelphi.edu",
88
description="A Python scraper for downloading Excel datasets from NYC InfoHub.",
@@ -34,5 +34,5 @@
3434
"Operating System :: OS Independent",
3535
],
3636
python_requires='>=3.8',
37-
entry_points={'console_scripts': ['nyc-excel-scraper = excel_scraper:run_scraper']}
37+
entry_points={'console_scripts': ['nyc-infohub-scraper = excel_scraper:run_scraper']}
3838
)

0 commit comments

Comments
 (0)