|
2 | 2 |
|
3 | 3 | ## Description |
4 | 4 |
|
5 | | -**Excel API Web Scraper** is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from NYC InfoHub. It features a modular, object-oriented design with built-in **security checks** (virus scanning and MIME-type validation) for downloaded Excel files, **with an option to skip antivirus scans on Windows** if ClamAV isn’t readily available. |
| 5 | +**Excel API Web Scraper** is a Python-based project/package that automates the process of web scraping, downloading, and storing Excel files from NYC InfoHub. It features a modular, object-oriented design with built-in **security checks** (virus scanning and MIME-type validation) for downloaded Excel files, **with an option to skip antivirus scans on Windows** if ClamAV isn’t readily available. |
6 | 6 |
|
7 | 7 | ### Highlights |
8 | 8 |
|
@@ -65,6 +65,35 @@ security_manager = SecurityManager(skip_windows_scan=False) |
65 | 65 |
|
66 | 66 | --- |
67 | 67 |
|
| 68 | +## Package |
| 69 | + |
| 70 | +[](https://pypi.org/project/nyc_infohub_excel_api_access/) |
| 71 | + |
| 72 | +**Version: 1.0.8** |
| 73 | + |
| 74 | +A Python package for scraping and downloading Excel datasets from NYC InfoHub using Selenium, httpx, asyncio, and virus/MIME validation is available. |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +## 📦 Installation |
| 79 | + |
| 80 | +```bash |
| 81 | +pip install nyc_infohub_excel_api_access |
| 82 | +``` |
| 83 | + |
| 84 | +--- |
| 85 | + |
| 86 | +## 🚀 Usage |
| 87 | + |
| 88 | +Run from the command line: |
| 89 | +```bash |
| 90 | +nyc-infohub-scraper |
| 91 | +``` |
| 92 | + |
| 93 | +Installing this package gives you access to the CLI tool nyc-infohub-scraper, which launches the scraper pipeline from the terminal with a single command. |
| 94 | + |
| 95 | +--- |
| 96 | + |
68 | 97 | ## Requirements |
69 | 98 |
|
70 | 99 | ### System Requirements |
@@ -320,6 +349,8 @@ A GitHub Actions workflow is set up in `.github/workflows/ci-cd.yml`. It: |
320 | 349 | - **Connection Pooling**: Addressed by a persistent `httpx.AsyncClient`. |
321 | 350 | - **Redundant Downloads**: Prevented by storing file hashes and only updating on changes. |
322 | 351 | - **Virus Scan Overhead**: In-memory scanning might add overhead, but ensures security. |
| 352 | +- **Virus Scan Failures**: If ClamAV is unavailable or fails (e.g., due to socket errors or size limits), the scraper falls back to MIME-type validation for Excel files instead of discarding them. |
| 353 | +- **Fallback Traceability**: All skipped or MIME-only approved files are logged in `quarantine.log` with timestamp, reason, MIME type, and file size for audit and debugging. |
323 | 354 | - **Size Limit Errors**: If you see “INSTREAM: Size limit reached” warnings, increase `StreamMaxLength` in `clamd.conf`. |
324 | 355 | - **Windows Skipping**: If you can’t run ClamAV natively, the skip mechanism means the scraper still works without throwing errors. |
325 | 356 |
|
|
0 commit comments