Skip to content

Commit

Permalink
Improve readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Joachim Rosskopf committed Oct 27, 2024
1 parent 3e8d43a commit 19fb537
Show file tree
Hide file tree
Showing 3 changed files with 244 additions and 792 deletions.
283 changes: 244 additions & 39 deletions Readme.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,271 @@
# flAPI: Instant APIs for Static Datasets

flAPI is a powerful service that automatically generates read-only APIs for static datasets without writing a single line of code. Built on top of DuckDB and leveraging its SQL engine and extension ecosystem, flAPI offers a seamless way to connect to various data sources and expose them as RESTful APIs.
flAPI is a powerful service that automatically generates read-only APIs for datasets by utilizing SQL templates. Built on top of [DuckDB](https://duckdb.org/) and leveraging its SQL engine and extension ecosystem, flAPI offers a seamless way to connect to various data sources and expose them as RESTful APIs.

## 🚀 Features
![overview of flAPI](https://i.imgur.com/m7UVZlR.png)

## ⚡ Features

- **Automatic API Generation**: Create APIs for your datasets without coding
- **Multiple Data Sources**: Connect to BigQuery, SAP (via ERPL), Parquet, Iceberg, Postgres, MySQL, and more
- **SQL Templates**: Use Jinja2-like syntax for dynamic queries
- **Caching**: Improve performance with built-in caching mechanisms
- **Multiple Data Sources**: Connect to [BigQuery](https://github.com/hafenkran/duckdb-bigquery), SAP ERP & BW (via [ERPL](https://github.com/datazoode/erpl)), Parquet, [Iceberg](https://github.com/duckdb/duckdb_iceberg), [Postgres](https://github.com/duckdb/postgres_scanner), [MySQL](https://github.com/duckdb/duckdb_mysql), and more
- **SQL Templates**: Use Mustache-like syntax for dynamic queries
- **Caching**: Improve performance and reduce database costs with built-in caching mechanisms
- **Security**: Implement row-level and column-level security with ease
- **Monitoring**: Built-in access logging and monitoring capabilities
- **Easy deployment**: Deploy flAPI with a single binary file

## 🛠 Quick Start
The easiest way to get started with flAPI is to use the pre-built binary for your operating system.

1. Clone the repository:
```
git clone https://github.com/datazoode/flapi.git
```
#### 1. Download the binary for your operating system from the [Releases](https://github.com/datazoode/flapi/releases) page.

2. Build the project:
```
cd flapi
mkdir build && cd build
cmake ..
make
```
We currently support the following operating systems:

- Linux
- MacOS
- Windows

and have that statically linked against [DuckDB v1.1.2](https://github.com/duckdb/duckdb/releases/tag/v1.1.2).

#### 2. Run flAPI:
Once you have downloaded the binary, you can run flAPI by executing the following command:

```
> ./flapi --config ../examples/flapi.yaml
```

#### 3. Test the API server:
If everything is set up correctly, you should be able to access the API at the URL specified in the configuration file.

```bash
> curl 'http://localhost:8080/'

3. Run flAPI:
```
./flapi --config ../examples/flapi.yaml
```

## 📖 Example
___
___( o)> Welcome to
\ <_. ) flAPI
`---'
Fast and Flexible API Framework
powered by DuckDB
```
## 🎓 Example
Here's a simple example of how to create an API endpoint using flAPI:
1. Define your SQL template (`users.sql`):
```sql
SELECT * FROM public.users
WHERE age >= {{ context.params.age }};
```
### 1. Create a basic flAPI configuration
flAPI uses the popular [YAML](https://yaml.org/) format to configure the API endpoints. A basic configuration file looks like this:
```yaml
project_name: example-flapi-project
project_description: An example flAPI project demonstrating various configuration options
template:
path: './sqls' # The path where SQL templates and API endpoint configurations are stored
environment-whitelist: # Optional: List of regular expressions for whitelisting envvars which are available in the templates
- '^FLAPI_.*'

duckdb: # Configuration of the DuckDB embedded into flAPI
db_path: ./flapi_cache.db # Optional: remove or comment out for in-memory database, we use this store also as cache
access_mode: READ_WRITE # See the https://duckdb.org/docs/configuration/overview) for more details
threads: 8
max_memory: 8GB
default_order: DESC

connections: # A YAML map of database connection configurations, a API endpoint needs to reference one of these connections
bigquery-lakehouse:
# SQL commands to initialize the connection (e.g., e.g. installing, loading and configuring the BQ a DuckDB extension)
init: |
INSTALL 'bigquery' FROM 'http://storage.googleapis.com/hafenkran';
LOAD 'bigquery';
properties: # A YAML map of connection-specific properties (accessible in templates via {{ context.conn.property_name }})
project_id: 'my-project-id'

customers-parquet:
properties:
path: './data/customers.parquet'

heartbeat:
enabled: true # The eartbeat worker is a background thread which can can be used to periodically trigger endpionts
worker-interval: 10 # The interval in seconds at which the heartbeat worker will trigger endpoints

enforce-https:
enabled: false # Whether to force HTTPS for the API connections, we strongly recommend to use a reverse proxy to do SSL termination
# ssl-cert-file: './ssl/cert.pem'
# ssl-key-file: './ssl/key.pem'

```

After that ensure that the template path (`./sqls` in this example) exists.

2. Configure your endpoint (`users.yaml`):
### 1. Define your API endpoint (`./sqls/customers.yaml`):
Each endpoint is at least defined by a YAML file and a corresponding SQL template in the template path. For our example we will create the file `./sqls/customers.yaml`:

```yaml
url-path: /customers/ # The URL path at which the endpoint will be available

request: # The request configuration for the endpoint, this defines the parameters that can be used in the query
- field-name: id
field-in: query # The location of the parameter, other options are 'path', 'query' and 'body'
description: Customer ID # A description of the parameter, this is used in the auto-generated API documentation
required: false # Whether the parameter is required
validators: # A list of validators that will be applied to the parameter
- type: int
min: 1
max: 1000000
preventSqlInjection: true

template-source: customers.sql # The path to the SQL template that will be used to generate the endpoint
connection:
- customers-parquet # The connection that will be used to execute the query

rate-limit:
enabled: true # Whether rate limiting is enabled for the endpoint
max: 100 # The maximum number of requests per interval
interval: 60 # The interval in seconds

auth:
enabled: true # Whether authentication is enabled for the endpoint
type: basic # The type of authentication, other options are 'basic' and 'bearer'
users: # The users that are allowed to access the endpoint
- username: admin
password: secret
roles: [admin]
- username: user
password: password
roles: [read]

heartbeat:
enabled: true # Whether the heartbeat worker if enabled will trigger the endpoint periodically
params: # A YAML map of parameters that will be passed by the heartbeat worker to the endpoint
id: 123

```
There are many more configuration options available, see the [full documentation](link-to-your-docs) for more details.

### 2. Configure the endpoints SQL template (`./sqls/customers.sql`):
After the creation of the YAML endpoint configuration we need to connect the SQL template which connects the enpoint to the data connection.
The template files use the [Mustache templating language](https://mustache.github.io/) to dynamically generate the SQL query.

```sql
SELECT * FROM '{{{conn.path}}}'
WHERE 1=1
{{#params.id}}
AND c_custkey = {{{ params.id }}}
{{/params.id}}
```

The above template uses the `path` parameter defined in the connection configuration to directly query a local parquet file. If the id parameter is
provided, it will be used to filter the results.

### 3. Send a request:
To test the endpoint and see if everything worked, we can use curl. We should also provide the correct basic auth credentials (`admin:secret` in this case). To make the JSON result easier to read, we pipe the output to [`jq`](https://jqlang.github.io/jq/).

```bash
> curl -X GET -u admin:secret "http://localhost:8080/customers?id=123" | jq .

{
"next": "",
"total_count": 1,
"data": [
{
"c_mktsegment": "BUILDING",
"c_acctbal": 5897.82999999999992724,
"c_phone": "15-817-151-1168",
"c_address": "YsOnaaER8MkvK5cpf4VSlq",
"c_nationkey": 5,
"c_name": "Customer#000000123",
"c_comment": "ependencies. regular, ironic requests are fluffily regu",
"c_custkey": 123
}
]
}

```

## ⁉️ How caching works?
flAPI implements a powerful and flexible caching mechanism to optimize query performance and reduce load on data sources. Here's how it works:

1. Cache Configuration:
In the endpoint YAML file (e.g., `products.yaml`), you can define caching parameters:
```yaml
urlPath: /users
request:
- fieldName: age
fieldIn: query
cache:
cache-table-name: 'products_cache'
cache-source: products_cache.sql
refresh-time: 15m
```
- `cache-table-name`: The name of the cache table.
- `cache-source`: SQL file to create and populate the cache.
- `refresh-time`: How often the cache should be refreshed.

3. Send a request:
2. Cache Creation:
The `cache-source` SQL file (e.g., `products_cache.sql`) defines how to create and populate the cache:
```sql
CREATE TABLE {{cache.schema}}.{{cache.table}} AS
SELECT
p.category,
p.brand,
p.price_range,
p.region,
sum(p.sales) as total_sales
FROM bigquery_scan('my-data-warehouse.sales.product_transactions') AS p
GROUP BY 1, 2, 3, 4
```
curl -X GET "http://localhost:8080/api/users?age=18"
This query aggregates data from the source, potentially reducing the data volume and precomputing expensive operations.

3. Cache Usage:
The main query (e.g., `products.sql`) then uses the cache:
```sql
SELECT
p.category,
p.brand,
p.price_range,
p.region,
p.total_sales
FROM {{{cache.table}}} AS p
WHERE 1=1
{{#params.category}}
AND p.category LIKE '{{{ params.category }}}'
{{/params.category}}
{{#params.region}}
AND p.region LIKE '{{{ params.region }}}'
{{/params.region}}
```
This query runs against the cache table instead of the original data source.

4. Cache Management:
flAPI handles cache creation, refreshing, and cleanup:
- It checks if caches need refreshing based on the `refresh-time`.
- Creates new cache tables with timestamps in their names.
- Keeps a configurable number of previous cache versions.
- Performs garbage collection to remove old cache tables.
- When a `HTTP DELETE` request the cache table is also invalidated.

5. Benefits:
- Improved query performance by querying pre-aggregated data.
- Reduced load (and thereby costs) on the original data source.
- Always fresh data with periodic cache refreshes.
- Ability to fall back to previous cache versions if needed.

This caching mechanism provides a balance between data freshness and query performance, making it ideal for scenarios where real-time data is not critical but fast query responses are important. For example, in our products scenario, it allows for quick analysis of sales data across different categories, brands, price ranges, and regions without repeatedly querying the entire transaction history.



## 🏭 Building from source
The source code of flAPI is written in C++ and closely resembles the [DuckDB build process](https://duckdb.org/docs/dev/building/overview). A good documentation of the build process is the GitHub action in [`build.yaml`](.github/workflows/build.yaml). In essecence a few prerequisites need to be met:
In essecence a few prerequisites need to be met:

## 🔒 Security
- Install the dependencies: `sudo apt-get install -y build-essential cmake ninja-build`
- Checkout the repository and submodules: `git clone --recurse-submodules https://github.com/datazoode/flapi.git`
- Build the project: `make release`

flAPI supports various security features:
The build process will download and build DuckDB v1.1.2 and install the vcpkg package manager. We depend on the following vcpkg ports:

- **Row-Level Security**: Control access to individual rows based on user attributes
- **Column-Level Security**: Restrict access to specific columns for certain users
- **Dynamic Data Masking**: Mask sensitive data in query results
- [`argparse`](https://github.com/p-ranav/argparse) - Command line argument parser
- [`crow`](https://github.com/CrowCpp/Crow) - Our REST-Web framework
- [`yaml-cpp`](https://github.com/jbeder/yaml-cpp) - YAML parser
- [`jwt-cpp`](https://github.com/Thalhammer/jwt-cpp) - JSON Web Token library
- [`openssl`](https://github.com/openssl/openssl) - Crypto library
- [`catch2`](https://github.com/catchorg/Catch2) - Testing framework

## 📚 Documentation

Expand Down
9 changes: 0 additions & 9 deletions examples/flapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,20 +21,11 @@ connections:
# Connection-specific properties (accessible in templates via {{ context.conn.property_name }})
properties:
project_id: 'd-kaercher-lakehouse'
# Whether to log executed queries
log-queries: true
# Whether to log request parameters (headers, URL params, etc.)
log-parameters: true
# Access control for this connection ('*' allows all users)
allow: '*'

# A local Parquet file connection
customers-parquet:
properties:
path: './data/customers.parquet'
log-queries: false
log-parameters: false
allow: '*'

# Heartbeat configuration
heartbeat:
Expand Down
Loading

0 comments on commit 19fb537

Please sign in to comment.