This is a Go-based scraper for finding and downloading CPNS-related PDF files from government websites. The program uses Google search queries to locate the PDFs and allows the user to choose whether or not to download them.
- Scrapes PDF files from
.go.id
,.kab.go.id
, and.prov.go.id
domains. - Optionally downloads the PDF files.
- Saves the scraped data (domain and PDF link) in a
results.csv
file. - Customizable number of
- Go installed on your machine
- Internet access to perform the web scraping
- Clone this repository:
git clone https://github.com/baguswijaksono/cpns-scrap.git
- Navigate to the project directory:
cd cpns-scrap
- Build the project:
go build
You can run the program using various flags to customize its behavior.
Flag | Description | Example |
---|---|---|
-p |
The number of Google search result pages to scrape. Default is 1. | -p 5 |
-d |
Enable downloading of PDFs. This is a boolean flag. | -d |
-k |
Include scraping from kabupaten (.kab.go.id ) and provincial (.prov.go.id ) domains. This is a boolean flag. |
-k |
-
Scrape 5 pages without downloading PDFs:
go run main.go -p 5
-
Scrape 5 pages and download the PDFs:
go run main.go -p 5 -d
-
Scrape 5 pages, download PDFs, and include kabupaten/provinsi domains:
go run main.go -p 5 -d -k
The scraper will generate a results.csv
file in the root of the project. This file contains two columns:
Domain
: The domain from which the PDF file was found.File PDF
: The direct URL to the PDF file.
If the -d
flag is used, the PDFs will be downloaded to a directory called downloads
. This directory is created automatically by the program.
- User-Agent: The scraper uses a custom
User-Agent
header to mimic a browser request. - Rate Limiting: Be mindful of Google’s rate limits when running the scraper with high page counts.
This project is licensed under the MIT License. search result pages to scrape.