SQL across anything: query, join, and aggregate over local files, databases, S3, and vector stores — or turn any SQL into a parameterized HTTP API, zero application code required, written in Rust, powered by Apache Datafusion.
Skardi runs federated SQL across files, databases, object stores, and vector stores — and turns any query into a parameterized REST API with no application code.
skardi-cli— Run SQL queries locally against files, object stores, databases, and datalake formats. Ideal for local agents like OpenClaw that need structured data access without a running server.skardi-server— Define SQL queries in YAML and serve them as parameterized HTTP APIs. Connect to multiple data sources, run federated queries, and expose results as REST endpoints.
⚠️ Warning: This software is in BETA. It may still contain bugs and unexpected behavior. Use caution with production data and ensure you have backups. Feel free to contact us if you want to have a POC for the product.
- CLI for local agents & queries — Run SQL against local files, remote object stores (S3, GCS, Azure), databases, and datalake formats — ideal for local AI agents like OpenClaw
- Declarative pipelines — Define SQL queries in YAML, get REST APIs automatically
- Automatic parameter inference — Request parameters, types, and response schemas are inferred from your SQL
- Multi-source federation — JOIN across CSV, Parquet, PostgreSQL, MySQL, SQLite, MongoDB, Redis, Iceberg, and Lance in a single query
- Full CRUD — SELECT, INSERT, UPDATE, and DELETE operations on supported databases
- Vector search — Native KNN similarity search via Lance integration
- Full-text search — BM25-scored full-text search via Lance inverted indexes
- S3 support — Read CSV, Parquet, and Lance files directly from S3
- Docker ready — Ship as a container with your config files mounted at runtime
- ONNX inference — Run ONNX model predictions inline in SQL via the
onnx_predictUDF (requires--features onnx)
- Installation
- Quick Start
- Architecture
- Skardi CLI
- Skardi Server
- Supported Data Sources
- ONNX Model Inference
- Federated Queries
- Observability
- Docker
- Building from Source
Pre-built Docker images are published to GitHub Container Registry on every release.
# Default image
docker pull ghcr.io/skardilabs/skardi/skardi-server:latest
# With ONNX inference support
docker pull ghcr.io/skardilabs/skardi/skardi-server-onnx:latest
# Pull a specific version
docker pull ghcr.io/skardilabs/skardi/skardi-server:0.1.0
docker pull ghcr.io/skardilabs/skardi/skardi-server-onnx:0.1.0Download the latest CLI binary for your platform:
curl -fSL "https://github.com/SkardiLabs/skardi/releases/latest/download/skardi-$(uname -m | sed 's/arm64/aarch64/')-$(uname -s | sed 's/Linux/unknown-linux-gnu/' | sed 's/Darwin/apple-darwin/').tar.gz" | tar xz
sudo mv skardi /usr/local/bin/Or download manually from the Releases page. Available targets:
| Platform | Target |
|---|---|
| Linux x86_64 | skardi-x86_64-unknown-linux-gnu.tar.gz |
| Linux ARM64 | skardi-aarch64-unknown-linux-gnu.tar.gz |
| macOS ARM64 (Apple Silicon) | skardi-aarch64-apple-darwin.tar.gz |
Note: macOS Intel (x86_64) binaries are not provided. Apple no longer produces Intel-based Macs. You can build from source if needed.
# Build
cargo build --release
# --- Skardi CLI ---
# Query local files directly
skardi query --sql "SELECT * FROM './data/products.csv' LIMIT 10"
# Query remote files
skardi query --sql "SELECT * FROM 's3://mybucket/events.parquet' LIMIT 10"
# --- Skardi Server ---
# Start the server with a context and pipeline
cargo run --bin skardi-server -- \
--ctx demo/ctx.yaml \
--pipeline demo/pipeline.yaml \
--port 8080
# Execute the pipeline
curl -X POST http://localhost:8080/product-search-demo/execute \
-H "Content-Type: application/json" \
-d '{"brand": null, "max_price": 100.0, "color": null, "limit": 5}'Skardi has two main components:
skardi-cli(skardi) — A command-line tool for running SQL queries against local files, remote object stores, databases, and datalake formats without starting a server. Perfect for powering local AI agents like OpenClaw with structured data access.skardi-server— An HTTP server that loads data sources from a context file, registers SQL pipelines, and serves them as REST endpoints.
Both components use Apache DataFusion as the query engine, which enables federated queries across heterogeneous data sources.
The CLI lets you run SQL queries against local files, remote object stores, databases, and datalake formats — no server required. It's a great fit for local AI agents like OpenClaw that need to query structured data on the fly.
cargo install --path crates/cli# Query files directly by path (no context file needed)
skardi query --sql "SELECT * FROM './data/products.csv' LIMIT 10"
skardi query --sql "SELECT * FROM 's3://mybucket/events.parquet' LIMIT 10"
skardi query --sql "SELECT * FROM './embeddings.lance' LIMIT 5"
# Query with a context file (for databases, named tables, etc.)
skardi query --ctx ./ctx.yaml --sql "SELECT * FROM products LIMIT 10"
# SQL from file
skardi query --ctx ./ctx.yaml --file query.sql
# Show table schemas
skardi query --ctx ./ctx.yaml --schema --all
skardi query --ctx ./ctx.yaml --schema -t productsSupported sources:
| Category | Types |
|---|---|
| Local files | CSV, Parquet, JSON/NDJSON, Lance |
| Remote stores | S3, GCS, Azure Blob, HTTP/HTTPS, OSS, COS |
| Datalake formats | Lance, Iceberg |
| Databases | PostgreSQL, MySQL, SQLite, MongoDB, Redis |
Context file resolution (when --ctx is omitted): checks SKARDICONFIG env var, then ~/.skardi/config/ctx.yaml. If no context file is found, the query runs without pre-registered tables (you can still query files directly by path).
For full details, see crates/cli/README.md.
cargo run --bin skardi-server -- \
--ctx <path-to-ctx.yaml> \
--pipeline <path-to-pipeline.yaml-or-directory> \
--port 8080| Flag | Description |
|---|---|
--ctx |
Path to the context YAML file that defines data sources |
--pipeline |
Path to a pipeline YAML file or a directory of pipeline files |
--port |
Port to listen on (default: 8080) |
Once the server is running, open http://localhost:8080 in your browser to access the pipeline dashboard.
The dashboard lists every registered pipeline as a card showing:
- Endpoint URL — the
POSTpath to call, with a one-click copy button - Parameters — inferred parameter names and types from the pipeline SQL
- Example request — a ready-to-run
curlcommand for the pipeline - Try It — an interactive panel where you can edit the JSON body and execute the pipeline directly from the browser
No configuration required — the dashboard is built into skardi-server and updates automatically when pipelines are loaded.
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Pipeline dashboard UI |
/health |
GET | Service health check |
/health/:name |
GET | Per-pipeline health check (includes data source status) |
/pipelines |
GET | List all registered pipelines |
/pipeline/:name |
GET | Get specific pipeline info |
/data_source |
GET | List all data sources |
/:name/execute |
POST | Execute a pipeline by name |
A context file (ctx.yaml) defines the data sources available to your pipelines. Each data source is registered as a table in the query engine.
data_sources:
- name: "products" # Table name used in SQL queries
type: "csv" # Data source type
path: "data/products.csv" # File path or connection string
options: # Type-specific options
has_header: true
delimiter: ","
schema_infer_max_records: 1000
description: "Product catalog"You can define multiple data sources of different types in a single context file:
data_sources:
- name: "users"
type: "postgres"
connection_string: "postgresql://localhost:5432/mydb?sslmode=disable"
options:
table: "users"
schema: "public"
user_env: "PG_USER"
pass_env: "PG_PASSWORD"
- name: "orders"
type: "csv"
path: "demo/sample_data/orders.csv"
options:
has_header: true
delimiter: ","By default, all data sources are read-only — only SELECT queries are allowed. To enable write operations (INSERT, UPDATE, DELETE), set access_mode: read_write on the data source. Only postgres, mysql, sqlite, mongo, and redis sources support read_write mode; setting it on other types will produce an error at startup.
data_sources:
- name: "users"
type: "postgres"
connection_string: "postgresql://localhost:5432/mydb?sslmode=disable"
access_mode: read_write # Enable INSERT/UPDATE/DELETE
options:
table: "users"
user_env: "PG_USER"
pass_env: "PG_PASSWORD"
- name: "products"
type: "csv"
path: "data/products.csv"
# access_mode defaults to read_only (CSV doesn't support writes)If a pipeline attempts a write operation on a read_only source, the server returns an error:
Write operation not allowed on data source 'products'. The data source is configured with 'read_only' access mode.
For file-based sources (csv, parquet, iceberg), you can set enable_cache: true to load the entire dataset into memory at startup. This gives significantly faster query performance at the cost of memory usage.
data_sources:
- name: "products"
type: "csv"
path: "data/products.csv"
enable_cache: true # Load into memory at startup
options:
has_header: trueThis is useful for datasets that are queried frequently and fit in memory. The cache is created once at startup and used for all subsequent queries.
A pipeline file defines a SQL query with parameter placeholders. Parameters are enclosed in {braces} and automatically extracted. Types and response schemas are inferred from the SQL and table schemas.
metadata:
name: product-search-demo
version: 1.0.0
description: "Product search and filtering"
query: |
SELECT
"Name" as product_name,
"Brand" as brand,
"Price" as price
FROM products
WHERE ({brand} IS NULL OR "Brand" = {brand})
AND ({max_price} IS NULL OR "Price" < {max_price})
ORDER BY "Price" ASC
LIMIT {limit}Execute with:
curl -X POST http://localhost:8080/product-search-demo/execute \
-H "Content-Type: application/json" \
-d '{"brand": "Apple", "max_price": 500.0, "limit": 10}'Use the {param} IS NULL OR ... pattern for optional filters — pass null to skip a filter.
Success:
{
"success": true,
"data": [{"product_name": "Laptop", "price": 999.99}],
"rows": 1,
"execution_time_ms": 15,
"timestamp": "2025-01-15T12:00:00.000Z"
}Error:
{
"success": false,
"error": "Missing required parameters: limit",
"error_type": "parameter_validation_error",
"details": {"missing_parameters": ["limit"]},
"timestamp": "2025-01-15T12:00:00.000Z"
}- name: "products"
type: "csv"
path: "data/products.csv"
options:
has_header: true
delimiter: ","
schema_infer_max_records: 1000- name: "events"
type: "parquet"
path: "data/events.parquet"Full CRUD support (SELECT, INSERT, UPDATE, DELETE) with federated query capability.
- name: "users"
type: "postgres"
connection_string: "postgresql://localhost:5432/mydb?sslmode=disable"
options:
table: "users"
schema: "public" # Optional, default: "public"
user_env: "PG_USER" # Env var for username
pass_env: "PG_PASSWORD" # Env var for passwordexport PG_USER="myuser"
export PG_PASSWORD="mypassword"For detailed setup, CRUD examples, and federated queries, see demo/postgres/README.md.
Full CRUD support (SELECT, INSERT, UPDATE, DELETE) with federated query capability.
- name: "users"
type: "mysql"
connection_string: "mysql://localhost:3306/mydb"
options:
table: "users"
user_env: "MYSQL_USER"
pass_env: "MYSQL_PASSWORD"export MYSQL_USER="myuser"
export MYSQL_PASSWORD="mypassword"For detailed setup, CRUD examples, and federated queries, see demo/mysql/README.md.
Full CRUD support (SELECT, INSERT, UPDATE, DELETE) with no external server required — just a local .db file.
- name: "users"
type: "sqlite"
path: "data/my_database.db"
options:
table: "users"
busy_timeout_ms: "5000" # Optional, default: 5000SQLite requires no credentials — just the path to the database file.
CLI direct path query (no context file needed):
skardi query --sql "SELECT * FROM './data/my_database.db.users'"For detailed setup, CRUD examples, and federated queries, see demo/sqlite/README.md.
Full CRUD support with point lookups, full scans, and federated queries.
- name: "products"
type: "mongo"
connection_string: "mongodb://localhost:27017"
options:
database: "mydb"
collection: "products"
primary_key: "product_id"
user_env: "MONGO_USER"
pass_env: "MONGO_PASS"export MONGO_USER="myuser"
export MONGO_PASS="mypassword"For detailed setup, CRUD examples, and federated queries, see demo/mongo/README.md.
Full CRUD support with point lookups (O(1) via direct key construction), full scans, and federated queries. Redis hashes map directly to SQL rows.
- name: "products"
type: "redis"
connection_string: "redis://localhost:6379"
options:
key_space: "mydb"
table: "products"
key_column: "product_id"Redis keys follow the pattern {key_space}:{table}:{key_column_value}, where key_column is extracted from the key suffix and exposed as a SQL column. For initially empty tables, use the columns option to declare the schema upfront so INSERT operations work immediately.
For detailed setup, CRUD examples, and federated queries, see demo/redis/README.md.
Query Iceberg tables with support for schema evolution, partition pruning, and time travel.
- name: "nyc_taxi"
type: "iceberg"
path: "/path/to/iceberg-warehouse"
options:
namespace: "nyc"
table: "trips"For S3-backed Iceberg tables:
- name: "s3_iceberg"
type: "iceberg"
path: "s3://my-bucket/iceberg-warehouse"
options:
namespace: "production"
table: "events"
aws_region: "us-east-1"
aws_access_key_id_env: "AWS_ACCESS_KEY_ID"
aws_secret_access_key_env: "AWS_SECRET_ACCESS_KEY"For detailed setup and examples, see demo/iceberg/README.md.
Native KNN (K-Nearest Neighbors) similarity search using the lance_knn table function, and BM25-scored full-text search using the lance_fts table function.
- name: "sift_items"
type: "lance"
path: "data/vec_data.lance/"
description: "Vector embeddings"SELECT knn.id, knn.item_id, knn._distance
FROM lance_knn(
'sift_items', -- table name
'vector', -- vector column
(SELECT vector FROM sift_items WHERE id = {ref_id}), -- query vector
{k} -- number of neighbors
) knn
WHERE knn.id != {ref_id}| Dataset Size | Without Optimization | With Lance KNN | Speedup |
|---|---|---|---|
| 10K vectors | ~50ms | ~5ms | 10x |
| 100K vectors | ~500ms | ~8ms | 62x |
| 1M vectors | ~5000ms | ~15ms | 333x |
-- Basic term search (BM25 scored)
SELECT id, description, _score
FROM lance_fts('my_table', 'description', 'search terms', 10)
-- Phrase search
SELECT * FROM lance_fts('my_table', 'description', '"exact phrase"', 10)
-- With WHERE clause filter pushdown
SELECT * FROM lance_fts('my_table', 'description', 'search terms', 10)
WHERE category = 'food' AND price < 20Requires a Lance INVERTED index on the text column. See demo/lance/README.md for full details on vector search and full-text search.
Read CSV, Parquet, and Lance files from S3. Credentials are loaded from environment variables — never from config files.
- name: "sales_data"
type: "parquet"
location: "remote_s3"
path: "s3://my-bucket/sales/data.parquet"
description: "Sales data in S3"Authentication methods: environment variables, AWS CLI profiles, IAM roles, or AWS SSO.
export AWS_ACCESS_KEY_ID="your_key"
export AWS_SECRET_ACCESS_KEY="your_secret"
# Or use: export AWS_PROFILE="your_profile"For full S3 configuration, IAM permissions, and troubleshooting, see demo/S3_USAGE.md.
Note: ONNX support is behind a feature flag. Build with
--features onnxto enable it:cargo build --release -p skardi-server --features onnx
Run ONNX model predictions directly in SQL using the onnx_predict scalar UDF. Models are loaded lazily on first use and cached in memory.
onnx_predict('path/to/model.onnx', input1, input2, ...) -> FLOAT- First argument: path to an
.onnxfile (relative to the server's working directory) - Remaining arguments: model inputs (types are auto-detected from the ONNX model)
- Returns:
FLOATper row, orLIST(FLOAT)when inputs are aggregated lists
Example — score candidates with a Neural Collaborative Filtering model:
SELECT
item_id,
onnx_predict('models/ncf.onnx',
CAST({user_id} AS BIGINT),
CAST(item_id AS BIGINT)
) AS score
FROM candidates
ORDER BY score DESC
LIMIT 10Pre-built models are available in the models/ directory (ncf.onnx, TinyTimeMixer.onnx).
For the full guide including the movie recommendation demo, see demo/onnx_predict/README.md.
One of Skardi's most powerful features is the ability to JOIN data across different source types in a single SQL query. DataFusion handles the federation transparently.
Example: Join a CSV file with a PostgreSQL table and write results back to PostgreSQL:
metadata:
name: "federated_join_and_insert"
version: "1.0"
query: |
INSERT INTO user_order_stats (user_id, user_name, total_orders, total_spent)
SELECT
u.id as user_id,
u.name as user_name,
COUNT(o.order_id) as total_orders,
SUM(o.amount) as total_spent
FROM users u -- PostgreSQL table
INNER JOIN csv_orders o -- CSV file
ON u.id = o.user_id
WHERE u.name = {name}
GROUP BY u.id, u.nameskardi-server exports traces and metrics via OpenTelemetry (OTLP gRPC), giving you full visibility into query execution inside Grafana.
- Traces — Every DataFusion execution plan node is wrapped with a span. You can see
elapsed_compute,output_rows, spill counts, and optimizer rule timings per query. - Pipeline metrics — Per-pipeline request count, latency, and error rate exported via OTLP on every
/executecall. - Logs — Structured logs forwarded to Loki via the OTel Collector.
Every call to /:name/execute records two OTel metrics:
| Metric | Type | Labels | Description |
|---|---|---|---|
pipeline_requests_total |
Counter | pipeline, status (success/error), error_type |
Total requests by outcome |
pipeline_latency_ms |
Histogram | pipeline |
End-to-end handler latency in milliseconds |
Verify metrics are flowing — the OTel Collector exposes a Prometheus scrape endpoint at :8889. After firing at least one request, run:
curl -s http://localhost:8889/metrics | grep pipelineExpected output:
pipeline_latency_ms_bucket{pipeline="my-pipeline",le="..."} 2
pipeline_latency_ms_count{pipeline="my-pipeline"} 2
pipeline_latency_ms_sum{pipeline="my-pipeline"} 45.3
pipeline_requests_total{pipeline="my-pipeline",status="success"} 1
pipeline_requests_total{pipeline="my-pipeline",status="error",error_type="parameter_validation_error"} 1
Note: The periodic exporter flushes every 60 seconds. If the output is empty, wait a moment and retry.
Example PromQL queries for Grafana dashboards:
# Requests per second per pipeline
rate(pipeline_requests_total[5m])
# Error rate (fraction of failing requests)
rate(pipeline_requests_total{status="error"}[5m])
/ rate(pipeline_requests_total[5m])
# p99 latency per pipeline
histogram_quantile(0.99, rate(pipeline_latency_ms_bucket[5m]))
# p50 latency per pipeline
histogram_quantile(0.50, rate(pipeline_latency_ms_bucket[5m]))
A docker-compose.yml is included with a pre-configured Grafana stack:
| Service | Role | Port |
|---|---|---|
| OTel Collector | Receives OTLP from the server, fans out to backends | 4317 (gRPC), 4318 (HTTP) |
| Grafana Tempo | Trace storage | — |
| Prometheus | Metric storage (scrapes collector at :8889) | 9090 |
| Grafana Loki | Log storage | — |
| Grafana | Visualization (Tempo + Prometheus + Loki pre-provisioned) | 3000 |
# Start the stack
docker-compose -f observability/docker-compose.yml up -d
# Run the server pointing at the collector
OTLP_ENDPOINT=http://localhost:4317 RUST_LOG=debug cargo run -p skardi-server -- --port 8080Then open Grafana at http://localhost:3000 — all three datasources (Tempo, Prometheus, Loki) are pre-provisioned.
| Variable | Default | Description |
|---|---|---|
OTLP_ENDPOINT |
http://localhost:4317 |
OTLP gRPC endpoint to export to |
RUST_LOG |
info |
Log level. Use debug to see per-query DataFusion span detail |
RUST_LOG=info— normal production operation; high-level server events onlyRUST_LOG=debug— shows per-query DataFusion execution spans (plan nodes, row counts, optimizer timing)
docker build -t skardi .
# With ONNX support
docker build -t skardi --build-arg FEATURES=onnx .docker run --rm \
-v /path/to/your/ctx.yaml:/config/ctx.yaml \
-v /path/to/your/pipeline.yaml:/config/pipeline.yaml \
-p 8080:8080 \
skardi \
--ctx /config/ctx.yaml \
--pipeline /config/pipeline.yaml \
--port 8080Mount an entire directory of pipeline files:
docker run --rm \
-v /path/to/your/ctx.yaml:/config/ctx.yaml \
-v /path/to/your/pipelines:/config/pipelines \
-p 8080:8080 \
skardi \
--ctx /config/ctx.yaml \
--pipeline /config/pipelines \
--port 8080# Clone the repository
git clone https://github.com/SkardiLabs/skardi.git
cd skardi
# Build CLI
cargo build --release -p skardi-cli
# Or install CLI globally
cargo install --path crates/cli
# Build server
cargo build --release -p skardi-server
# Build server with ONNX model inference support
cargo build --release -p skardi-server --features onnxThe demo/ directory contains complete working examples:
| Directory | Description |
|---|---|
| demo/README.md | Product search demo (CSV/Parquet) |
| demo/postgres/ | PostgreSQL CRUD and federated query examples |
| demo/mysql/ | MySQL CRUD and federated query examples |
| demo/sqlite/ | SQLite CRUD and federated query examples |
| demo/mongo/ | MongoDB CRUD and federated query examples |
| demo/redis/ | Redis CRUD and federated query examples |
| demo/iceberg/ | Apache Iceberg integration examples |
| demo/lance/ | Lance vector search and full-text search examples |
| demo/onnx_predict/ | ONNX model inference in SQL |
| demo/S3_USAGE.md | S3 data source configuration guide |
See LICENSE for details.
