Update README and setup.py to reflect package name change and version bump to 1.0.8; enhance installation and usage instructions.

dylanpicart · dylanpicart · commit c1242399ff0c · 2025-06-02T17:02:43.000-04:00
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 ## Description
 
-**Excel API Web Scraper** is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from NYC InfoHub. It features a modular, object-oriented design with built-in **security checks** (virus scanning and MIME-type validation) for downloaded Excel files, **with an option to skip antivirus scans on Windows** if ClamAV isn’t readily available.
+**Excel API Web Scraper** is a Python-based project/package that automates the process of web scraping, downloading, and storing Excel files from NYC InfoHub. It features a modular, object-oriented design with built-in **security checks** (virus scanning and MIME-type validation) for downloaded Excel files, **with an option to skip antivirus scans on Windows** if ClamAV isn’t readily available.
 
 ### Highlights
 
@@ -65,6 +65,35 @@ security_manager = SecurityManager(skip_windows_scan=False)
 
 ---
 
+## Package
+
+[![PyPI version](https://badge.fury.io/py/nyc_infohub_excel_api_access.svg)](https://pypi.org/project/nyc_infohub_excel_api_access/)
+
+**Version: 1.0.8**
+
+A Python package for scraping and downloading Excel datasets from NYC InfoHub using Selenium, httpx, asyncio, and virus/MIME validation is available.
+
+---
+
+## 📦 Installation
+
+```bash
+pip install nyc_infohub_excel_api_access
+```
+
+---
+
+## 🚀 Usage
+
+Run from the command line:
+```bash
+nyc-infohub-scraper
+```
+
+Installing this package gives you access to the CLI tool nyc-infohub-scraper, which launches the scraper pipeline from the terminal with a single command.
+
+---
+
 ## Requirements
 
 ### System Requirements
@@ -320,6 +349,8 @@ A GitHub Actions workflow is set up in `.github/workflows/ci-cd.yml`. It:
 - **Connection Pooling**: Addressed by a persistent `httpx.AsyncClient`.
 - **Redundant Downloads**: Prevented by storing file hashes and only updating on changes.
 - **Virus Scan Overhead**: In-memory scanning might add overhead, but ensures security.
+- **Virus Scan Failures**: If ClamAV is unavailable or fails (e.g., due to socket errors or size limits), the scraper falls back to MIME-type validation for Excel files instead of discarding them.
+- **Fallback Traceability**: All skipped or MIME-only approved files are logged in `quarantine.log` with timestamp, reason, MIME type, and file size for audit and debugging.
 - **Size Limit Errors**: If you see “INSTREAM: Size limit reached” warnings, increase `StreamMaxLength` in `clamd.conf`.
 - **Windows Skipping**: If you can’t run ClamAV natively, the skip mechanism means the scraper still works without throwing errors.
 
diff --git a/setup.py b/setup.py
@@ -1,8 +1,8 @@
 from setuptools import setup, find_packages
 
 setup(
-    name="excel_api_access",
-    version="1.0.6",
+    name="nyc_infohub_excel_api_access",
+    version="1.0.8",
     author="Dylan Picart",
     author_email="dylanpicart@mail.adelphi.edu",
     description="A Python scraper for downloading Excel datasets from NYC InfoHub.",
@@ -34,5 +34,5 @@
         "Operating System :: OS Independent",
     ],
     python_requires='>=3.8',
-    entry_points={'console_scripts': ['nyc-excel-scraper = excel_scraper:run_scraper']}
+    entry_points={'console_scripts': ['nyc-infohub-scraper = excel_scraper:run_scraper']}
 )