CPNS Scraper

This is a Go-based scraper for finding and downloading CPNS-related PDF files from government websites. The program uses Google search queries to locate the PDFs and allows the user to choose whether or not to download them.

Features

Scrapes PDF files from .go.id, .kab.go.id, and .prov.go.id domains.
Optionally downloads the PDF files.
Saves the scraped data (domain and PDF link) in a results.csv file.
Customizable number of

Prerequisites

Go installed on your machine
Internet access to perform the web scraping

Installation

Clone this repository:

git clone https://github.com/baguswijaksono/cpns-scrap.git

Navigate to the project directory:
```
cd cpns-scrap
```
Build the project:
```
go build
```

Usage

You can run the program using various flags to customize its behavior.

Flags

Flag	Description	Example
`-p`	The number of Google search result pages to scrape. Default is 1.	`-p 5`
`-d`	Enable downloading of PDFs. This is a boolean flag.	`-d`
`-k`	Include scraping from kabupaten (`.kab.go.id`) and provincial (`.prov.go.id`) domains. This is a boolean flag.	`-k`

Examples

Scrape 5 pages without downloading PDFs:
```
go run main.go -p 5
```
Scrape 5 pages and download the PDFs:
```
go run main.go -p 5 -d
```
Scrape 5 pages, download PDFs, and include kabupaten/provinsi domains:
```
go run main.go -p 5 -d -k
```

CSV Output

The scraper will generate a results.csv file in the root of the project. This file contains two columns:

Domain: The domain from which the PDF file was found.
File PDF: The direct URL to the PDF file.

Download Directory

If the -d flag is used, the PDFs will be downloaded to a directory called downloads. This directory is created automatically by the program.

Notes

User-Agent: The scraper uses a custom User-Agent header to mimic a browser request.
Rate Limiting: Be mindful of Google’s rate limits when running the scraper with high page counts.

License

This project is licensed under the MIT License. search result pages to scrape.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CPNS Scraper

Features

Prerequisites

Installation

Usage

Flags

Examples

CSV Output

Download Directory

Notes

License

About

Releases

Languages

baguswijaksono/cpns-scrp

Folders and files

Latest commit

History

Repository files navigation

CPNS Scraper

Features

Prerequisites

Installation

Usage

Flags

Examples

CSV Output

Download Directory

Notes

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages