Author: agavra | Release Target: 5.3 | Status: Merged | Discussion: #2693
tl;dr: Improve the interactivity of the development story by enabling CLI users to directly produce data to Kafka topics through KSQL
KSQL is a powerful tool for exploring Apache Kafka and stream processing paradigms. It allows a developer to easily create Streams and Tables from source topics, apply arbitrary transformations and inspect resulting data in Kafka. For developers with relational database experience, the SQL-like language provides a familiar workflow.
After installing KSQL, developers are met with an immediate hurdle - they must have existing data
in Kafka that match the data they expect to handle. In a relational database environment, developers
can use INSERT INTO ... VALUES
to quickly construct a toy environment for testing. Generating this
data KSQL, however, is cumbersome:
- you require a separate tool (e.g.
kafkacat
,ksql-datagen
) - you must know the deployment topology (e.g. URIs for ZK/Broker)
- there is no synergy with KSQL (e.g. tools are not schema aware and can produce arbitrary data)
This KLIP attempts to integrate producing test data into the KSQL REPL.
This KLIP is intended to improve the development and testing workflow as opposed to providing production-ready functionality. The following tasks will be addressed:
- KSQL will be able to create and produce data to arbitrary topics
- KSQL will be able to validate data when producing to a topic that is registered with a schema
in KSQL (e.g. as a result of
CREATE STREAM
)
In the future, this functionality can be extended to be "production grade", allowing system operators or first-responders to quickly patch up data holes in case of emergency. This KLIP will explicitly discourage such usage in documentation (as it needs further testing before we can recommend it), but will not artificially impose a restriction on such usage.
This proposal will improve the first 15-minute experience that developers have with KSQL and will help seasoned KSQL developers test new functionality. Furthermore, this functionality can be leveraged to create a powerful testing harness for KSQL applications that can run in a sandbox without pre-existing data.
INSERT INTO <stream_name|table_name> [(column1, column2, ...)]
VALUES (value1, value2, ...)
The INSERT VALUES
statement can be used to insert new records into a KSQL source (agnostic to the
Stream/Table duality). If the columns are not present, it is assumed that the values are in the same
order as the schema and contain every field. ROWKEY
is always present as the first column, and is
automatically populated with semantics described in the section on ROWKEY
and ROWTIME
below.
The value for stream_name
/table_name
must be a valid Stream/Table registered with a KSQL schema.
The serialization format will be the same as the format specified in the value_format
of the
CREATE
statement.
For the initial version, only primitive literal types will be supported (e.g. INT
, VARCHAR
,
etc...) but syntax is easily extensible to support struct, map and list (see Complex Datatypes
section below).
Future extensions:
- Support Structs/Complex Types
- Support some
AT INTERVAL
syntax to generalize the functionality to generate full stream data. - Read from file to allow reproducible insertions
- Headless mode support
DELETE FROM table_name WHERE ROWKEY = <rowkey_value>;
The DELETE
statement can be used to insert tombstones into kafka topics that back tables. If the
source is not a table, the statement will fail. For this KLIP, we will only support conditions
of the form ROWKEY = value
. While this may initially be confusing to users coming from a pure SQL
background (as DELETE FROM
supports arbitrary conditions), maintaining consistent documentation
and examples, along with adequate error messages, will alleviate the concern.
CREATE STREAM stream_name ( { column_name data_type } [, ...] )
WITH ( property_name = expression [, ...] );
CREATE TABLE table_name ( { column_name data_type } [, ...] )
WITH ( property_name = expression [, ...] );
The WITH
clause for CS/CT statements will now accept two additional property_names:
PARTITIONS
/REPLICAS
If the topic does not exist and these values are specified, this command
will create a topic with a the set number of partitions and replicas before executing the command.
If the topic does not exist, and PARTITIONS
/REPLICAS
are not supplied, the command will fail
with the following error message:
Topic 'foo' does not exist. If you require an empty topic to be created for the stream/table, please re-run the statement providing the required 'PARTITION' count and, optional, 'REPLICAS' count in the WITH clause, e.g.
CREATE STREAM BAR (f0, f1) WITH (Kafka_topic='foo', PARTITIONS=4, REPLICAS=2);
If the topic exists and has a different number of partitions/replicas, this command will fail with the following error message:
Cannot issue create command. Topic
<topic name
> exists with different number of partitions or replicas:<topic description>
.
The above APIs make various trade-offs, outlined below:
- The proposal to use
DELETE
encompasses various trade-offs:- We need a different mechanism to produce tombstones than
INSERT INTO
because there is no good way with the standard SQL to differentiate between an entirelyNULL
value (i.e. tombstone) andNULL
values for every column. - Introducing a new syntax (e.g.
INSERT TOMBSTONE
) will require users to understand another language construct and make KSQL depart further from SQL standard - This is a step closer to SQL standard, and we can always support more
DELETE
conditions in the future
- We need a different mechanism to produce tombstones than
- Allowing
CREATE TABLE/STREAM
to specify a kafka topic that does not yet exist makes it easy to use the new APIs without any other system. General sentiment was that it is important for users to understand the concept ofPARTITIONS
soon into their journey understanding Kafka. There are four alternatives:- Do not require
PARTITIONS
/REPLICAS
to create the kafka topic. This is a slight change that will make it easier to experiment, but may open the door to accidentally creating kafka topics. With the release of 5.2, this will also not be consistent with CSAS/CTAS, which default to the source replication/partitioning instead of using some default configured values. - Introduce another
CREATE TOPIC
API which will create empty topics. I believe this will cause confusion, as it departs farther from SQL standard and putsTOPIC
on the same level as the duality of Tables/Streams (which it is not, it is a physical manifestation of a stream/table). - Don't fail when creating streams/topics with non-existing topics, and create those topics if
a call to
INSERT INTO ...
is called with a stream/table target that does not exist. This seems more "magical" and runs into issues when specifying partition/replica counts. - Introduce another
REGISTER TABLE/STREAM
to represent the difference. This moves away from SQL standard and adds more complexity for the user to understand.
- Do not require
KSQL in interactive mode is already configured to produce data to a command topic. This KLIP can
leverage that existing functionality, which will require piping through various configurations into
the ServiceContext
. Beyond that, we must introduce a new command and Statement type to reflect the
new functionality.
In interactive mode, this will be handled without distributing the statement to other servers. This operation will not be available in headless or embedded mode for the first version of the feature.
INSERT INTO ... VALUES
will only support inserting into values for existing KSQL Streams/Tables,
which already have a schema associated with them. Before producing a record to the underlying topic,
the values for the fields will be verified against the schema.
JSON, AVRO and DELIMITED value formats will all be supported, the latter of which will not accept complex types. Since we know the schema of the inserts, we can initially load the values into a connect data type. We already have the proper serializers to convert from connect to each of the supported serialization formats.
If the user specifies AVRO
encoding in their CREATE
statement, we will register the topic with
the schema registry proactively if the topic did not previously exist. Note that this will not
change how we handle CREATE
statements for topics that already exist (today we do not validate
that the schema supplied matches the schema in Schema Registry).
The underlying data and the way it is represented in KSQL sometimes differ, and this INSERT VALUES
functionality needs to bridge that gap. For example take the following KSQL statement:
CREATE TABLE bar (id VARCHAR, foo INT) WITH (key="id")
In this case, we expect kafka records that resemble
{"key": "bar", "timestamp": 1, "value": {"id": "bar", "foo": 123}}
If this is not the case (e.g. the value of "key"
and of "value"->"id"
do not match), then the
KSQL behavior is not well defined.
The column schema will always include the implicit ROWKEY
and ROWTIME
fields, and the user can
choose to specify or ignore these fields. To prevent the inconsistencies described above, though,
we will automatically enforce consistency whenever the key property is set in the WITH
clause.
- If the user sets a
KEY
property, they do not need to supplyROWKEY
in theirINSERT INTO
statement (e.g.INSERT INTO foo (id, foo) VALUES ("key", 123);
). In this case, the kafka message produced will automatically populateROWKEY
to the value ofid
:{"key": "key", "timestamp": 1, "value": {"id": "key", "foo": 123}}
- Analogous to the above, if the user provides the key but not
ROWKEY
, it will be populated for them. - If the user sets a
KEY
property and they supplyROWKEY
in theirINSERT INTO
statement, KSQL will enforce that the value ofROWKEY
and the value of the key column are identical.INSERT INTO foo (ROWKEY, ID, VALUE) VALUES ("THIS WILL FAIL", "key", 123);
- If the user does not set a
KEY
property, and they are inserting into a stream, they may choose to omitROWKEY
altogether. This will result in anull
key. - If the user does not set a
KEY
property, and they are inserting into a table, they must have aROWKEY
value in theirINSERT INTO
statement.
If a value for ROWTIME
is supplied, then that will be populated into the kafka record,
otherwise broker time is used. If a timestamp extractor is also supplied, the ROWTIME
value will
still be sent, but the extractor will overwrite the data in ROWTIME
.
Although not in scope for V1, complex data types can all be supported following precedents of other SQL-like languages. For example, Google BigQuery supports constructing STRUCT using the syntax below (see link for full set of sample syntax):
Syntax | Output |
---|---|
STRUCT(1,2,3) |
STRUCT<int64,int64,int64> * |
STRUCT(1 AS a, 'abc' AS b) |
STRUCT<a int64, b string> |
STRUCT<x int64>(5) |
STRUCT<x int64> |
STRUCT(5, STRUCT(5)) |
STRUCT<int64, STRUCT<int64>> |
* we may not want to support "unnamed" structs, thought that decision can be made later
This functionality is expected to work only in Interactive mode, and these tests must succeed:
- Test that values can be serialized into JSON/AVRO/DELIMITED
- Test that schemas are properly verified
- Test
NULL
and table tombstones - Test
CREATE STREAM/TABLE
with topics that don't exist - Test that
CREATE STREAM/TABLE
also creates schemas in Schema Registry for AVRO encodings - Test producing data in topics that already exist
- Test producing data to topics that are not associated with a KSQL source (should fail)
- The
CREATE TABLE
andCREATE STREAM
syntax references will have an updated table properties section that includes the following rows:
+=========================+======================================================================================================+ | KAFKA_TOPIC (required) | The name of the Kafka topic that backs this source. The topic must either already exist in Kafka, or | | | PARTITIONS must be specified to create the topic. Command will fail if the topic exists with | | | different partition/replica counts. | +-------------------------+------------------------------------------------------------------------------------------------------+ | PARTITIONS | The number of partitions in the backing topic. This property must be set if creating a SOURCE without| | | an existing topic (the command will fail if the topic does not exist). | +-------------------------+------------------------------------------------------------------------------------------------------+ | REPLICAS | The number of replicas in the backing topic. If this property is not set but PARTITIONS is set, | | | then the default Kafka cluster configuration for replicas will be used for creating a new topic. | +-------------------------+------------------------------------------------------------------------------------------------------+
- A new section for
INSERT INTO ... VALUES
will be added to thesytnax.rst
:
**Synopsis** .. code:: sql INSERT INTO <stream_name|table_name> [(column_name [, ...] )] VALUES (value [, ...]) **Description** Produce a row into an existing stream or table and its underlying topic based on explicitly sepcified values. The first ``column_name`` of every schema is ``ROWTIME``, which represents the time at which the event was produced (simulated). The second is ``ROWKEY`, which defines the corresponding kafka key - it is expected to always match a corresponding column in the value if the create statement specifies a key in its ``WITH`` clause. Column values can be given in several ways: * Any column not explicitly given a value is set to ``null`` (KSQL does not yet support DEFAULT values). If no columns are specified, a value for every column is expected in the same order as the schema with ``ROWKEY`` as the first column and ``ROWTIME`` as teh current time. For example, the statements below would all be valid for a source with schema ``<KEY_COL VARCHAR, COL_A VARCHAR>`` with ``KEY=KEY_COL``: .. code:: sql // inserts (1234, "key", "key", "A") INSERT INTO foo (ROWTIME, ROWKEY, KEY_COL, COL_A) VALUES (1234, "key", "key", "A"); // inserts (current_time(), "key", "key", "A") INSERT INTO foo VALUES ("key", "key", "A"); // inserts (current_time(), "key", "key", "A") INSERT INTO foo (KEY_COL, COL_A) VALUES ("key", "A"); // inserts (current_time(), "key", "key", null) INSERT INTO foo (KEY_COL) VALUES ("key"); The values will serialize using the ``value_format`` specified in the original `CREATE` statment.
- A new section for
DELETE
will be added to thesyntax.rst
:
**Synopsis** .. code:: sql DELETE FROM <table_name> WHERE ROWKEY = <value>; **Description** Delete a row from an existing table. This will issue a tombstone, producing a value to the underlying kafka topic with the specified key and a null value. Deleting a value that does not exist will not affect the table contents. This statement will fail if attempting to delete from a ``STREAM`` source.
- All previous commands that succeeded will still succeed after this KLIP
CREATE
commands that failed with "topic does not exist" may now fail with a verbose and useful error message:
Topic 'foo' does not exist. If you require an empty topic to be created for the stream/table, please re-run the statement providing the required 'PARTITION' count and, optional, 'REPLICAS' count in the WITH clause, e.g.
CREATE STREAM BAR (f0, f1) WITH (Kafka_topic='foo', PARTITIONS=4, REPLICAS=2);
- A new configuration
ksql.insert.into.values.enabled
will config guard this whole feature suite
N/A
- Authentication/Authorization: we will authenticate/authorize
INSERT INTO
statements under the same systems that we use for transient queries. Namely, the user needsPRODUCE
permissions to the topic that it is writing to. - Audit: For V1, we will not be auditing
INSERT INTO
statements, but we will provide a config to disable this functionality completely (ksql.insert.into.values.enabled
)