In _generate_streaming:

Peer-to-peer AI Inference & Distributed Execution with PyTorch

What is Tensorlink?

Tensorlink is a Python library and decentralized compute platform for running PyTorch and Hugging Face models across peer-to-peer networks. It enables you to run, train, and serve large models securely across distributed hardware without relying on centralized cloud inference providers.

Key Features

Native PyTorch & REST API Access – Use models directly in Python or via HTTP endpoints
Run Large Models Without Local VRAM – Execute models that exceed your GPU capacity
Remote Access to Your Own Hardware – Securely host and access models on your devices via API
Plug-and-Play Distributed Execution – Automatic model sharding across multiple GPUs
Training & Inference Support – Train models with distributed optimizers or run inference across the network
Streaming Generation – Token-by-token streaming for real-time responses
Privacy Controls – Route queries exclusively to your own hardware for private usage
Earn Rewards for Idle Compute – Contribute GPU resources to the network and get compensated

Early Access: Tensorlink is under active development. APIs and internals may evolve.
Join our Discord for updates, support, and roadmap discussions. Learn more in the Litepaper

Quick Start

Option 1: Distributed Models in Python

Installation

pip install tensorlink

Requirements: Python 3.10+, PyTorch 2.3+, UNIX/MacOS (Windows: use WSL)

Basic Usage

Execute Hugging Face models distributed across the network.

from tensorlink.ml import DistributedModel

MODEL_NAME = "Qwen/Qwen3-8B"

# Connect to a pre-trained model on the network
model = DistributedModel(
    model=MODEL_NAME,
    training=True,
    device="cuda"
)
optimizer = model.create_optimizer(lr=0.001)

See Examples for streaming generation, distributed training, custom models, and network configurations.

Option 2: Accessing Models via HTTP

Access models via HTTP on the public network, or configure your own hardware for private API access. Tensorlink exposes OpenAI-style endpoints for distributed inference:

import requests

response = requests.post(
    "http://smartnodes.ddns.net/tensorlink-api/v1/generate",
    json={
        "hf_name": "Qwen/Qwen2.5-7B-Instruct",
        "message": "Explain quantum computing in one sentence.",
        "max_new_tokens": 50,
        "stream": False,
    }
)

print(response.json())

Access the public network or configure your own hardware for private API access. See Examples for streaming, chat completions, and API reference.

Option 3: Run a Node

Run Tensorlink nodes to host models, shard workloads across GPUs, and expose them via Python and HTTP APIs. Nodes can act as workers (run models), validators (route requests + expose API), or both. This allows you to build private clusters, public compute providers, or local development environments.

Download the latest tensorlink-node from Releases
Edit config.json to configure your nodes.
Run: ./run-node.sh

By default, the config is set for running a public worker node. Your GPU will process network workloads and earn rewards via the networking layer (Smartnodes). See Examples for different device and network configurations.

Configuration Reference

Your config.json controls networking, rewards, and model execution behavior. By default, the config.json is set for running a public worker node.

Node

Field	Type	Description
`type`	`str`	Node Type (`worker\|validator\|both`): validator accepts job & api requests, workers run models
`mode`	`str`	Network Type (`public\|private\|local`): public (earn rewards), private (your devices), local (testing)
`endpoint`	`bool`	Endpoint Toggle: Enables REST API server on this node (validator role)
`endpoint_url`	`str`	Endpoint URL: Address the API binds to. Use `0.0.0.0` to expose on LAN
`endpoint_port`	`int`	Endpoint Port: Port for the HTTP API (default: `64747`)
`priority_nodes`	`List[List[str, int]]`	Nodes to Connect: Bootstrap trusted peers to connect to first (e.g.,`[["192.168.2.42", 38751]]`)
`logging`	`int`	Console logging mode (e.g., `DEBUG\|INFO\|WARNING`)

ML

Field	Type	Description
`trusted`	`bool`	Allows execution of custom user-supplied models
`max_vram_gb`	`int`	Limits VRAM usage per node to prevent overload

Crypto

Field	Type	Description
`address`	`str`	Wallet address used for identity and rewards
`mining`	`bool`	Contribute GPU compute to the public network for rewards
`mining_script`	`str`	Path to mining / GPU workload executable
`seed_validators`	`List[List[str, int, str]]`	Path to mining / GPU workload executable

For common configuration recipes and examples, see Examples: Node Configuration

API Reference

Tensorlink exposes OpenAI-compatible HTTP endpoints for distributed inference.

Endpoints

POST /v1/generate – Simple text generation
POST /v1/chat/completions – OpenAI-compatible chat interface
POST /request-model – Preload models across the network

`/v1/generate`

Simple generation endpoint with flexible output formats.

Request Parameters

Parameter	Type	Default	Description
`hf_name`	string	required	Hugging Face model identifier
`message`	string	required	Input text to generate from
`prompt`	string	`null`	Alternative to `message`
`model_type`	string	`"auto"`	Model architecture hint
`max_length`	int	`2048`	Maximum total sequence length
`max_new_tokens`	int	`2048`	Maximum tokens to generate
`temperature`	float	`0.7`	Sampling temperature (0.01-2.0)
`do_sample`	bool	`true`	Enable sampling vs greedy decode
`num_beams`	int	`1`	Beam search width
`stream`	bool	`false`	Enable streaming responses
`input_format`	string	`"raw"`	`"chat"` or `"raw"`
`output_format`	string	`"simple"`	`"simple"`, `"openai"`, or `"raw"`
`history`	array	`null`	Chat history for multi-turn conversations
`is_chat_completion`	bool	`false`	Determines whether to format chat output

In _generate_streaming:

should_filter = request.is_chat_completion

Or if you want finer control:

should_filter = ( request.is_chat_completion or (request.input_format == "chat" and request.output_format == "openai") )

Example: Basic Generation

import requests

r = requests.post(
    "http://localhost:64747/v1/generate",
    json={
        "hf_name": "Qwen/Qwen2.5-7B-Instruct",
        "message": "Explain quantum computing in one sentence.",
        "max_new_tokens": 64,
        "temperature": 0.7,
        "stream": False,
    }
)

print(r.json()["generated_text"])

Example: Chat Format with History

r = requests.post(
    "http://localhost:64747/v1/generate",
    json={
        "hf_name": "Qwen/Qwen2.5-7B-Instruct",
        "message": "What about entanglement?",
        "input_format": "chat",
        "output_format": "openai",
        "history": [
            {"role": "user", "content": "Explain quantum computing."},
            {"role": "assistant", "content": "Quantum computing uses..."}
        ],
        "max_new_tokens": 128,
    }
)

print(r.json())

`/v1/chat/completions`

OpenAI-compatible chat completions endpoint with full streaming support.

Request Parameters

Parameter	Type	Default	Description
`model`	string	required	Hugging Face model identifier
`messages`	array	required	Array of chat messages
`temperature`	float	`0.7`	Sampling temperature (0.01-2.0)
`top_p`	float	`1.0`	Nucleus sampling threshold
`n`	int	`1`	Number of completions to generate
`stream`	bool	`false`	Enable SSE streaming
`stop`	string/array	`null`	Stop sequences
`max_tokens`	int	`1024`	Maximum tokens to generate
`presence_penalty`	float	`0.0`	Presence penalty (-2.0 to 2.0)
`frequency_penalty`	float	`0.0`	Frequency penalty (-2.0 to 2.0)
`user`	string	`null`	User identifier for tracking

Message Format

{
    "role": "system" | "user" | "assistant",
    "content": "message text"
}

Example: Non-Streaming

import requests

r = requests.post(
    "http://localhost:64747/v1/chat/completions",
    json={
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain quantum computing in simple terms."}
        ],
        "max_tokens": 128,
        "temperature": 0.7,
    }
)

response = r.json()
print(response["choices"][0]["message"]["content"])

Example: Streaming

import requests

r = requests.post(
    "http://localhost:64747/v1/chat/completions",
    json={
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "messages": [
            {"role": "system", "content": "You are helpful."},
            {"role": "user", "content": "Explain quantum computing."}
        ],
        "max_tokens": 128,
        "stream": True
    },
    stream=True,
)

for line in r.iter_lines():
    if line:
        if line.decode().startswith("data: "):
            data = line.decode()[6:]  # Remove "data: " prefix
            if data != "[DONE]":
                import json
                chunk = json.loads(data)
                if chunk["choices"][0]["delta"].get("content"):
                    print(chunk["choices"][0]["delta"]["content"], end="", flush=True)

Response Format (Non-Streaming)

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum computing harnesses quantum mechanics..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 50,
    "total_tokens": 70
  }
}

Response Format (Streaming)

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"Qwen/Qwen2.5-7B-Instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"Qwen/Qwen2.5-7B-Instruct","choices":[{"index":0,"delta":{"content":"Quantum"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"Qwen/Qwen2.5-7B-Instruct","choices":[{"index":0,"delta":{"content":" computing"},"finish_reason":null}]}

...

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"Qwen/Qwen2.5-7B-Instruct","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

`/request-model`

Preload a model across the distributed network before making generation requests.

Request Parameters

Parameter	Type	Description
`hf_name`	string	Hugging Face model identifier

Example

import requests

r = requests.post(
    "http://localhost:64747/request-model",
    json={"hf_name": "Qwen/Qwen2.5-7B-Instruct"}
)

print(r.json())
# {"status": "success", "message": "Model loading initiated"}

Notes

Tensorlink is designed to support any Hugging Face model, however errors with certain models may appear. Please report any bugs via Issues

Temperature: Values below 0.01 automatically disable sampling to prevent numerical instability
Streaming: Both endpoints support Server-Sent Events (SSE) streaming via stream: true
Token IDs: Automatically handles missing pad/eos tokens with safe fallbacks
Format Control: Use input_format="chat" and output_format="openai" for seamless integration

For complete examples, error handling, and advanced usage, see Examples: HTTP API

Learn More

📚 Documentation – Full API reference and guides
🎯 Examples – Comprehensive usage patterns and recipes
💬 Discord Community – Get help and connect with developers
🎮 Live Demo – Try localhostGPT powered by Tensorlink
📘 Litepaper – Technical overview and architecture

Contributing

We welcome contributions! Here's how to get involved:

Report bugs via GitHub Issues
Suggest features on our Discord
Submit PRs to improve code or documentation
Support the project via Buy Me a Coffee

Tensorlink is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 457 Commits
.github		.github
bin		bin
docs		docs
tensorlink		tensorlink
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SMALL_README.md		SMALL_README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

License

mattjhawken/tensorlink

Folders and files

Latest commit

History

Repository files navigation

Peer-to-peer AI Inference & Distributed Execution with PyTorch

Table of Contents

What is Tensorlink?

Key Features

Quick Start

Option 1: Distributed Models in Python

Installation

Basic Usage

Option 2: Accessing Models via HTTP

Option 3: Run a Node

Configuration Reference

Node

ML

Crypto

API Reference

Endpoints

/v1/generate

Request Parameters

In _generate_streaming:

Or if you want finer control:

Example: Basic Generation

Example: Chat Format with History

/v1/chat/completions

Request Parameters

Message Format

Example: Non-Streaming

Example: Streaming

Response Format (Non-Streaming)

Response Format (Streaming)

/request-model

Request Parameters

Example

Notes

Learn More

Contributing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

`/v1/generate`

`/v1/chat/completions`

`/request-model`

Packages