A fun Microservice for scraping stuff from sites for Hellomouse Apps
- Download webpages as HTML (with assets like CSS, videos, images, etc... embedded as base64), PDF, WEBP (screenshot)
- Special handling for certain websites, currently we have:
- Twitter / X: Tweets are downloaded as HTML + attached media (images, videos)
- Reddit: Posts and comments are downloaded with any attached assets
- Soundcloud: Songs are downloaded with metadata (HTML + audio)
- Newgrounds: Songs are downloaded with metadata (HTML + audio)
- Imgur: Albums and gallerys are downloaded with all images and metadata (HTML + images / videos)
- Youtube: Videos are downloaded
- Pixiv: Albums are downloaded
- Bilibili: Videos are downloaded
Install dependencies
npm install
Setup the config. You will need a PostgresSQL database running as well as the hellomouse-apps-api
server (run the server first to generate the required tables).
There is an example config in the root directory. Copy it and rename it to config.js
. Here are the properties:
export const dbUser = 'hellomouse_board'; // PostgresSQL user
export const dbIp = '127.0.0.1'; // Postgres Server location
export const dbPort = 5433; // Postgres Server port
export const dbPassword = 'my password'; // Postgres Server password
export const dbName = 'hellomouse_board'; // Postgres Server DB name
export const fileDir = './saves'; // Path to store all files, in general, web files are stored under this path/site_downloads/file.ext
To setup yt-dlp (optional) you can place your browser cookies in secret/yt-cookies.txt
for use in downloading youtube videos, and
secret/bilibili-cookies.txt
for downloading bilibili videos.
To setup pixiv cookies (optional, for bypassing rate limiting and age restrictions) you can place your browser cookies (exported as a JS array of objects like [{ name: ... }])
) and put the result in secret/pixiv-cookies.txt
.
Run the server:
node index.js