Merge pull request #40 from milistu/scraper

Scraper Update
milistu · May 16, 2024 · a37770d · a37770d
2 parents 786e9eb + 88d54eb
commit a37770d
Show file tree

Hide file tree

Showing 4 changed files with 167 additions and 275 deletions.
diff --git a/scraper/README.md b/scraper/README.md
@@ -0,0 +1,27 @@
+# Amazon Laws Scraper
+
+This script scrapes law articles from a list of URLs and saves them as JSON files.
+
+## Usage
+
+To run the script, use the following command:
+
+```bash
+python scraper/scraper.py --file scraper/urls.txt --output-dir laws_test
+```
+
+## Arguments
+- `--url`: A single URL to scrape.
+- `--file`: Path to a text file containing URLs separated by newlines.
+- `--output-dir`: Directory to save the JSON files (default is scraper/laws).
+
+## Example
+To scrape law articles from a list of URLs in urls.txt and save the output in the `scraper/laws` directory:
+
+```bash
+python scraper/scraper.py --file scraper/urls.txt --output-dir scraper/laws
+```
+> ⚠️ _**Note**: Ensure you are in the root directory of the project before running the script._
+
+## Output
+The output JSON files will be saved in the specified output directory, with each file named after the corresponding URL's stem.
diff --git a/scraper/scraper-dev.ipynb b/scraper/scraper-dev.ipynb