p2pspider - DHT Spider

A daemon that crawls the BitTorrent DHT network and an Express web application that provides a searchable database of magnet links with real-time updates through WebSockets.

Intro

DHT Spider can index over 1 million magnets per 24 hours on modest hardware (2GB of RAM and around 2MB/s connection). It's resource-intensive and will use available CPU and RAM, which can be controlled via the 'ecosystem.json' file. On 2GB RAM, it's recommended to use 8 instances of the daemon and 2 of the webserver, all limited at 175MB.

Screenshots

Getting Started

# Install dependencies
npm install

# Set up configuration
cp .env.sample .env
# Edit .env file as needed

# Run the application
npm start  # Start the unified application (both crawler and web interface)

# Alternatively, use PM2 for process management
npm install -g pm2
npm run start:pm2  # Uses the ecosystem.json file
pm2 monit

Configuration

You will need to have port 6881 (or your configured port) open to the internet for the DHT crawler to function properly.

The application can be configured through the .env file:

# Database and server configuration
REDIS_URI=redis://127.0.0.1:6379
MONGO_URI=mongodb://127.0.0.1/magnetdb
SITE_HOSTNAME=http://127.0.0.1:8080
SITE_NAME=DHT Spider
SITE_PORT=8080

# Database options: "mongodb" or "sqlite"
DB_TYPE=sqlite

# Redis options: "true" or "false"
USE_REDIS=false

# SQLite database file location (only used if DB_TYPE=sqlite)
SQLITE_PATH=./data/magnet.db

# Elasticsearch options: "true" or "false"
USE_ELASTICSEARCH=false

# Elasticsearch connection
ELASTICSEARCH_NODE=http://localhost:9200
ELASTICSEARCH_INDEX=magnets

# Component control options: "true" or "false"
RUN_DAEMON=true
RUN_WEBSERVER=true

You can also fine-tune the crawler performance in the daemon.js file:

const p2p = P2PSpider({
    nodesMaxSize: 250,
    maxConnections: 500,
    timeout: 1000
});

It's not recommended to change the nodesMaxSize or maxConnections, but adjusting the timeout may increase indexing speed. Higher timeout values may require more RAM; the maximum recommended value is 5000ms.

Component Control

DHT Spider now allows you to run the daemon and webserver components independently:

RUN_DAEMON: Set to "true" to run the P2P Spider daemon, or "false" to disable it
RUN_WEBSERVER: Set to "true" to run the web server, or "false" to disable it

This flexibility allows you to:

Run only the daemon for dedicated crawling
Run only the webserver for serving existing data
Run both components together (default behavior)

Example usage:

# Run both components (default)
node app.js

# Run only the daemon
RUN_WEBSERVER=false node app.js

# Run only the webserver
RUN_DAEMON=false node app.js

Database and Redis Configuration

DHT Spider supports both MongoDB and SQLite as database options, and Redis usage can be toggled on/off:

DB_TYPE: Choose between "mongodb" or "sqlite" as your database
USE_REDIS: Set to "true" to use Redis for caching recent infohashes, or "false" to disable Redis
SQLITE_PATH: Path where the SQLite database file will be created (only used when DB_TYPE=sqlite)

SQLite is ideal for smaller deployments with reduced dependencies, while MongoDB is better for large-scale operations. Redis provides caching to prevent duplicate processing of recently seen infohashes.

Elasticsearch Configuration

DHT Spider now includes Elasticsearch integration for powerful full-text search capabilities:

USE_ELASTICSEARCH: Set to "true" to enable Elasticsearch integration
ELASTICSEARCH_NODE: URL of your Elasticsearch server (default: http://localhost:9200)
ELASTICSEARCH_INDEX: Name of the Elasticsearch index to use (default: magnets)

To bulk index existing data into Elasticsearch, run:

node utils/bulkIndexToElasticsearch.js

Elasticsearch provides significantly improved search performance and relevance, especially for large datasets. When enabled, search queries will use Elasticsearch instead of database queries.

Features

Real-time DHT network crawling and magnet link indexing
WebSocket-based live updates on the web interface
Searchable database of discovered magnet links
Statistics page with database information
Support for both MongoDB and SQLite databases
Elasticsearch integration for powerful full-text search
Redis caching for improved performance
Responsive web interface with modern design

Protocols

bep_0005, bep_0003, bep_0010, bep_0009

Notes

Cluster mode does not work on Windows. On Linux and other UNIX-like operating systems, multiple instances can listen on the same UDP port, which is not possible on Windows due to operating system limitations.

Notice

Please don't share the data DHT Spider crawls to the internet. Because sometimes it discovers sensitive/copyrighted/adult material.

Performance Optimization

To maximize performance, DHT Spider now includes several optimizations:

1. Redis Caching

Enable Redis by setting USE_REDIS=true in your .env file to significantly reduce database load:

# Redis options: "true" or "false"
USE_REDIS=true

2. Production Mode

Run the application in production mode for better performance:

npm run start:prod   # For the web server
npm run daemon:prod  # For the DHT crawler

# Or with PM2 (recommended for production)
pm2 start ecosystem.json

3. Optimized PM2 Configuration

The included ecosystem.json is configured for optimal performance:

Web server runs in cluster mode with multiple instances
DHT crawler runs in a single instance to avoid duplicate crawling
Memory limits prevent excessive resource usage

4. WebSocket Optimizations

The WebSocket server includes:

Message batching to reduce overhead
Client connection health monitoring
Throttled broadcasts to prevent excessive updates

5. Elasticsearch Search Optimization

When dealing with large datasets, enable Elasticsearch for improved search performance:

# Elasticsearch options: "true" or "false"
USE_ELASTICSEARCH=true

Monitoring Performance

Monitor system resources during operation:

pm2 monit

If the application is still slow:

Increase server resources (RAM/CPU)
Use a CDN for static assets
Consider using a dedicated Redis server
Consider using a dedicated Elasticsearch cluster
Scale horizontally with a load balancer

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
config		config
controllers		controllers
docs/screenshots		docs/screenshots
lib		lib
models		models
public		public
routes		routes
services		services
src		src
utils		utils
views		views
.env.sample		.env.sample
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
app.js		app.js
ecosystem.json		ecosystem.json
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
reset_data.sh		reset_data.sh
tailwind.config.js		tailwind.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

p2pspider - DHT Spider

Intro

Screenshots

Getting Started

Configuration

Component Control

Database and Redis Configuration

Elasticsearch Configuration

Features

Protocols

Notes

Notice

Performance Optimization

1. Redis Caching

2. Production Mode

3. Optimized PM2 Configuration

4. WebSocket Optimizations

5. Elasticsearch Search Optimization

Monitoring Performance

License

About

Releases 16

Packages

Contributors 3

Languages

License

thejordanprice/p2pspider

Folders and files

Latest commit

History

Repository files navigation

p2pspider - DHT Spider

Intro

Screenshots

Getting Started

Configuration

Component Control

Database and Redis Configuration

Elasticsearch Configuration

Features

Protocols

Notes

Notice

Performance Optimization

1. Redis Caching

2. Production Mode

3. Optimized PM2 Configuration

4. WebSocket Optimizations

5. Elasticsearch Search Optimization

Monitoring Performance

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 16

Packages 0

Contributors 3

Languages

Packages