Tracking: Ad-hoc(batch) ingestion #18583

st1page · 2024-09-18T07:59:08Z

kwannoel · 2024-09-24T05:04:17Z

Hi, I will help with this issue, starting with TVFs.

xxchan · 2024-09-27T08:09:03Z

Have we reached consensus to support TVFs? To me, their use cases are duplicated with Sources, so they seem to be unnecessary.

I’d like to see rationales and examples where they are more useful than sources before adding them

st1page · 2024-09-27T09:09:13Z

Have we reached consensus to support TVFs? To me, their use cases are duplicated with Sources, so they seem to be unnecessary.

I’d like to see rationales and examples where they are more useful than sources before adding them

The original discussion of the discussion here https://risingwave-labs.slack.com/archives/C036F5Z3EMD/p1720166956241589
Additionally, One case is for the Ad-hoc ingestion from databases. Currently we only support the CDC table and can not create a source on a external databases's table. So only TVF is clear defined method to do ad hoc ingest from Databases. We can refer to the grammer of duckDB for the cases
- https://duckdb.org/docs/extensions/postgres.html#the-postgres_query-table-function
- https://duckdb.org/docs/extensions/mysql#the-mysql_query-table-function

xxchan · 2024-09-27T09:15:54Z

Thanks for the explanation!

Currently we only support the CDC table and can not create a source on a external databases's table.

Makes me think whether also related with other shared source e.g., Kafka?

We can refer to the grammer of duckDB for the cases

Compared with duckDB

They don't have source at all. So it might be a little different
Their syntax contains a ATTACH, which looks like CREATE CONNECTION we might have in the future. So maybe we should design that first.

ATTACH 'dbname=postgresscanner' AS postgres_db (TYPE POSTGRES);
SELECT * FROM postgres_query('postgres_db', 'SELECT * FROM cars LIMIT 3');

st1page · 2024-09-27T09:18:19Z

Currently we only support the CDC table and can not create a source on a external databases's table.

Makes me think whether also related with other shared source e.g., Kafka?

The issue is not related to "shared" but it is beacuse the CDC source contains multiple tables' changes. Actually that is a "CONNECTION"

st1page · 2024-09-27T09:19:19Z

Compared with duckDB

They don't have source at all. So it might be a little different

Their syntax contains a ATTACH, which looks like CREATE CONNECTION we might have in the future. So maybe we should design that first.
ATTACH 'dbname=postgresscanner' AS postgres_db (TYPE POSTGRES);
SELECT * FROM postgres_query('postgres_db', 'SELECT * FROM cars LIMIT 3');

Agree with that. cc @chenzl25. do we have plan to simplify the syntax of the TVF with connection?

chenzl25 · 2024-09-27T09:42:01Z

After the connection is supported, in my mind connection can be used in TVF directly like:

read_parquet(s3_connection, 's3://bucket/path/xxxx.parquet')
read_csv(s3_connection, 's3://bucket/path/xxxx.parquet')
read_json(s3_connection, 's3://bucket/path/xxxx.parquet')
iceberg_scan(iceberg_connection, 'database_name.table_name')
postgres_query(pg_connection, 'select * from t')
mysql_quert(my_connection, 'select * from t')

Connections contain the necessary information to allow TVF to query the external system.
I think @tabVersion will support Connection in this Q.

tabVersion · 2024-10-10T02:34:31Z

After the connection is supported, in my mind connection can be used in TVF directly like:

read_parquet(s3_connection, 's3://bucket/path/xxxx.parquet')

read_csv(s3_connection, 's3://bucket/path/xxxx.parquet')

read_json(s3_connection, 's3://bucket/path/xxxx.parquet')

iceberg_scan(iceberg_connection, 'database_name.table_name')

postgres_query(pg_connection, 'select * from t')

mysql_quert(my_connection, 'select * from t')

Connections contain the necessary information to allow TVF to query the external system. I think @tabVersion will support Connection in this Q.

Yes for s3_connection and iceberg_connection.
I still have some concerns about -cdc connectors. The CDC source is a CONNECTION in concept, it contains nearly the same info as CONNECTION.

CREATE SOURCE pg_mydb WITH (
    connector = 'postgres-cdc',
    hostname = '127.0.0.1',
    port = '8306',
    username = 'root',
    password = '123456',
    database.name = 'mydb',
    slot.name = 'mydb_slot',
    debezium.schema.history.internal.skip.unparseable.ddl = 'true'
);

tabVersion · 2024-10-10T02:53:59Z

@st1page Additionally, One case is for the Ad-hoc ingestion from databases. Currently we only support the CDC table and can not create a source on a external databases's table. So only TVF is clear defined method to do ad hoc ingest from Databases. We can refer to the grammer of duckDB for the cases
duckdb.org/docs/extensions/postgres.html#the-postgres_query-table-function
duckdb.org/docs/extensions/mysql#the-mysql_query-table-function

The main idea for CONNECTION is minimizing the user's effort when creating new sources/tables. It stores some props and applies to all sources/sinks/tables created from the CONNECTION.

Things get a little different here because MQs have relatively more loose ACL control than file systems, eg. S3. So I'd propose we must define BUCKET in fs CONNECTIONs.

In my prospective, we can draw a line here.

Adhoc ingest from MQs, eg. kafka. -> select * from source
Adhoc ingets from FS, eg. iceberg and s3 -> TVF

wcy-fdu · 2024-10-14T07:23:25Z

So I'd propose we must define BUCKET in fs CONNECTIONs.

+1 for this, we need bucket name to validate RisingWave can read from specific bucket or data directory. Here the bucket in fs_connection is like db_name in db_connection.

github-actions bot added this to the release-2.1 milestone Sep 18, 2024

st1page assigned kwannoel Sep 24, 2024

kwannoel modified the milestones: release-2.1, future-release-2.2 Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: Ad-hoc(batch) ingestion #18583

Tracking: Ad-hoc(batch) ingestion #18583

st1page commented Sep 18, 2024 •

edited by kwannoel

Loading

kwannoel commented Sep 24, 2024

xxchan commented Sep 27, 2024

st1page commented Sep 27, 2024

xxchan commented Sep 27, 2024

st1page commented Sep 27, 2024

st1page commented Sep 27, 2024

chenzl25 commented Sep 27, 2024 •

edited

Loading

tabVersion commented Oct 10, 2024

tabVersion commented Oct 10, 2024

wcy-fdu commented Oct 14, 2024

Tracking: Ad-hoc(batch) ingestion #18583

Tracking: Ad-hoc(batch) ingestion #18583

Comments

st1page commented Sep 18, 2024 • edited by kwannoel Loading

Streaming storage

lake

file source(object store)

Database

Misc

kwannoel commented Sep 24, 2024

xxchan commented Sep 27, 2024

st1page commented Sep 27, 2024

xxchan commented Sep 27, 2024

st1page commented Sep 27, 2024

st1page commented Sep 27, 2024

chenzl25 commented Sep 27, 2024 • edited Loading

tabVersion commented Oct 10, 2024

tabVersion commented Oct 10, 2024

wcy-fdu commented Oct 14, 2024

st1page commented Sep 18, 2024 •

edited by kwannoel

Loading

chenzl25 commented Sep 27, 2024 •

edited

Loading