Releases · MobileTeleSystems/onetl

23 Feb 21:18

github-actions

0.13.0

12d95d8

0.13.0 (2025-02-24) Latest

Latest

🎉 3 years since first release 0.1.0 🎉

Breaking Changes

Add Python 3.13. support. (#298)
Change the logic of FileConnection.walk and FileConnection.list_dir. (#327)

Previously limits.stops_at(path) == True considered as "return current file and stop", and could lead to exceeding some limit. Not it means "stop immediately".
Change default value for FileDFWriter.Options(if_exists=...) from error to append, to make it consistent with other .Options() classes within onETL. (#343)

Features

Add support for FileModifiedTimeHWM HWM class (see etl-entities 2.5.0):

from etl_entitites.hwm import FileModifiedTimeHWM
from onetl.file import FileDownloader
from onetl.strategy import IncrementalStrategy

downloader = FileDownloader(
    ...,
    hwm=FileModifiedTimeHWM(name="somename"),
)

with IncrementalStrategy():
    downloader.run()

Introduce FileSizeRange(min=..., max=...) filter class. (#325)

Now users can set FileDownloader / FileMover to download/move only files with specific file size range:

from onetl.file import FileDownloader
from onetl.file.filter import FileSizeRange

downloader = FileDownloader(
    ...,
    filters=[FileSizeRange(min="10KiB", max="1GiB")],
)

Introduce TotalFilesSize(...) limit class. (#326)

Now users can set FileDownloader / FileMover to stop downloading/moving files after reaching a certain amount of data:

from datetime import datetime, timedelta
from onetl.file import FileDownloader
from onetl.file.limit import TotalFilesSize

downloader = FileDownloader(
    ...,
    limits=[TotalFilesSize("1GiB")],
)

Implement FileModifiedTime(since=..., until=...) file filter. (#330)

Now users can set FileDownloader / FileMover to download/move only files with specific file modification time:

from datetime import datetime, timedelta
from onetl.file import FileDownloader
from onetl.file.filter import FileModifiedTime

downloader = FileDownloader(
    ...,
    filters=[FileModifiedTime(before=datetime.now() - timedelta(hours=1))],
)

Add SparkS3.get_exclude_packages() and Kafka.get_exclude_packages() methods. (#341)

Using them allows to skip downloading dependencies not required by this specific connector, or which are already a part of Spark/PySpark:

from onetl.connection import SparkS3, Kafka

maven_packages = [
    *SparkS3.get_packages(spark_version="3.5.4"),
    *Kafka.get_packages(spark_version="3.5.4"),
]
exclude_packages = SparkS3.get_exclude_packages() + Kafka.get_exclude_packages()
spark = (
    SparkSession.builder.appName("spark_app_onetl_demo")
    .config("spark.jars.packages", ",".join(maven_packages))
    .config("spark.jars.excludes", ",".join(exclude_packages))
    .getOrCreate()
)

Improvements

All DB connections opened by JDBC.fetch(...), JDBC.execute(...) or JDBC.check() are immediately closed after the statements is executed. (#334)

Previously Spark session with master=local[3] actually opened up to 5 connections to target DB - one for JDBC.check(), another for Spark driver interaction with DB to create tables, and one for each Spark executor. Now only max 4 connections are opened, as JDBC.check() does not hold opened connection.

This is important for RDBMS like Postgres or Greenplum where number of connections is strictly limited and limit is usually quite low.
Set up ApplicationName (client info) for Clickhouse, MongoDB, MSSQL, MySQL and Oracle. (#339, #248)

Also update ApplicationName format for Greenplum, Postgres, Kafka and SparkS3. Now all connectors have the same ApplicationName format: ${spark.applicationId} ${spark.appName} onETL/${onetl.version} Spark/${spark.version}

The only connections not sending ApplicationName are Teradata and FileConnection implementations.
Now DB.check() will test connection availability not only on Spark driver, but also from some Spark executor. (#346)

This allows to fail immediately if Spark driver host has network access to target DB, but Spark executors have not.

Bug Fixes

Avoid suppressing Hive Metastore errors while using DBWriter. (#329)

Previously this was implemented as:
```
try:
    spark.sql(f"SELECT * FROM {table}")
    table_exists = True
except Exception:
    table_exists = False
```
If Hive Metastore was overloaded and responded with an exception, it was considered as non-existing table, resulting to full table override instead of append or override only partitions subset.
Fix using onETL to write data to PostgreSQL or Greenplum instances behind pgbouncer with pool_mode=transaction. (#336)

Previously Postgres.check() opened a read-only transaction, pgbouncer changed the entire connection type from read-write to read-only, and when DBWriter.run(df) executed in read-only connection, producing errors like:
```
org.postgresql.util.PSQLException: ERROR: cannot execute INSERT in a read-only transaction
org.postgresql.util.PSQLException: ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
```
Added a workaround by passing readOnly=True to JDBC params for read-only connections, so pgbouncer may differ read-only and read-write connections properly.

After upgrading onETL 0.13.x or higher the same error still may appear of pgbouncer still holds read-only connections and returns them for DBWriter. To this this, user can manually convert read-only connection to read-write:
```
postgres.execute("BEGIN READ WRITE;")  # <-- add this line
DBWriter(...).run()
```
After all connections in pgbouncer pool were converted from read-only to read-write, and error fixed, this additional line could be removed.

See Postgres JDBC driver documentation.
Fix MSSQL.fetch(...) and MySQL.fetch(...) opened a read-write connection instead of read-only. (#337)
- Now this is fixed:
  - MSSQL.fetch(...) establishes connection with ApplicationIntent=ReadOnly.
  - MySQL.fetch(...) calls SET SESSION TRANSACTION READ ONLY statement.
Fixed passing multiple filters to FileDownloader and FileMover. (#338) If was caused by sorting filters list in internal logging method, but FileFilter subclasses are not sortable.
Fix a false warning about a lof of parallel connections to Grenplum. (#342)

Creating Spark session with .master("local[5]") may open up to 6 connections to Greenplum (=number of Spark executors + 1 for driver), but onETL instead used number of CPU cores on the host as a number of parallel connections.

This lead to showing a false warning that number of Greenplum connections is too high, which actually should be the case only if number of executors is higher than 30.

Fix MongoDB trying to use current database name as authSource. (#347)

Use default connector value which is admin database. Previous onETL versions could be fixed by:

from onetl.connection import MongoDB

mongodb = MongoDB(
    ...,
    database="mydb",
    extra={
        "authSource": "admin",
    },
)

Dependencies

Minimal etl-entities version is now 2.5.0. (#331)
- Update DB connectors/drivers to latest versions: (#345)
  - Clickhouse 0.6.5 → 0.7.2
  - MongoDB 10.4.0 → 10.4.1
  - MySQL 9.0.0 → 9.2.0
  - Oracle 23.5.0.24.07 → 23.7.0.25.01
  - Postgres 42.7.4 → 42.7.5

Doc only Changes

Split large code examples to tabs. (#344)

Assets 4

03 Dec 09:32

github-actions

0.12.5

57754e4

0.12.5 (2024-12-03)

Improvements

Use sipHash64 instead of md5 in Clickhouse for reading data with {"partitioning_mode": "hash"}, as it is 5 times faster.
Use hashtext instead of md5 in Postgres for reading data with {"partitioning_mode": "hash"}, as it is 3-5 times faster.
Use BINARY_CHECKSUM instead of HASHBYTES in MSSQL for reading data with {"partitioning_mode": "hash"}, as it is 5 times faster.

Big fixes

In JDBC sources wrap MOD(partitionColumn, numPartitions) with ABS(...) to make al returned values positive. This prevents data skew.
Fix reading table data from MSSQL using {"partitioning_mode": "hash"} with partitionColumn of integer type.
Fix reading table data from Postgres using {"partitioning_mode": "hash"} lead to data skew (all the data was read into one Spark partition).

Assets 4

27 Nov 12:37

github-actions

0.12.4

694a71a

0.12.4 (2024-11-27)

Bug Fixes

Fix DBReader(conn=oracle, options={"partitioning_mode": "hash"}) lead to data skew in last partition due to wrong ora_hash usage. (#319)

Assets 4

22 Nov 11:30

github-actions

0.12.3

c4d2caa

0.12.3 (2024-11-22)

Bug Fixes

Allow passing table names in format schema."table.with.dots" to DBReader(source=...) and DBWriter(target=...).

Assets 4

12 Nov 14:55

github-actions

0.12.2

ebe457d

0.12.2 (2024-11-12)

Improvements

Change Spark jobDescription for DBReader & FileDFReader from DBReader.run() -> Connection to Connection -> DBReader.run().

Bug Fixes

Fix log_hwm output for KeyValueIntHWM (used by Kafka). (#316)
Fix log_collection hiding values in logs with INFO level. (#316)

Dependencies

Allow using etl-entities==2.4.0.

Doc only Changes

Fix links to MSSQL date & time type documentation.

Assets 4

28 Oct 08:02

github-actions

0.12.1

92349e2

0.12.1 (2024-10-28)

Features

Log detected JDBC dialect while using DBWriter.

Bug Fixes

Fix SparkMetricsRecorder failing when receiving SparkListenerTaskEnd without taskMetrics (e.g. executor was killed by OOM). (#313)
Call kinit before checking for HDFS active namenode.
Wrap kinit with threading.Lock to avoid multithreading issues.
Immediately show kinit errors to user, instead of hiding them.
Use AttributeError instead of ImportError in module's __getattr__ method, to make code compliant with Python spec.

Doc only Changes

Add note about spark-dialect-extension package to Clickhouse connector documentation. (#310)

Assets 4

03 Sep 12:45

github-actions

0.12.0

5a0fead

0.12.0 (2024-09-03)

Breaking Changes

Change connection URL used for generating HWM names of S3 and Samba sources:
- smb://host:port -> smb://host:port/share
- s3://host:port -> s3://host:port/bucket (#304)
Update DB connectors/drivers to latest versions:
- Clickhouse 0.6.0-patch5 → 0.6.5
- MongoDB 10.3.0 → 10.4.0
- MSSQL 12.6.2 → 12.8.1
- MySQL 8.4.0 → 9.0.0
- Oracle 23.4.0.24.05 → 23.5.0.24.07
- Postgres 42.7.3 → 42.7.4
Update Excel package from 0.20.3 to 0.20.4, to include Spark 3.5.1 support. (#306)

Features

Add support for specifying file formats (ORC, Parquet, CSV, etc.) in HiveWriteOptions.format (#292):
```
Hive.WriteOptions(format=ORC(compression="snappy"))
```
Collect Spark execution metrics in following methods, and log then in DEBUG mode:
- DBWriter.run()
- FileDFWriter.run()
- Hive.sql()
- Hive.execute()
This is implemented using custom SparkListener which wraps the entire method call, and then report collected metrics. But these metrics sometimes may be missing due to Spark architecture, so they are not reliable source of information. That's why logs are printed only in DEBUG mode, and are not returned as method call result. (#303)
Generate default jobDescription based on currently executed method. Examples:
- DBWriter.run(schema.table) -> Postgres[host:5432/database]
- MongoDB[localhost:27017/admin] -> DBReader.has_data(mycollection)
- Hive[cluster].execute()
If user already set custom jobDescription, it will left intact. (#304)

Add log.info about JDBC dialect usage (#305):

|MySQL| Detected dialect: 'org.apache.spark.sql.jdbc.MySQLDialect'

Log estimated size of in-memory dataframe created by JDBC.fetch and JDBC.execute methods. (#303)

Bug Fixes

Fix passing Greenplum(extra={"options": ...}) during read/write operations. (#308)
Do not raise exception if yield-based hook whas something past (and only one) yield.

Assets 4

02 Sep 07:47

dolfinus

0.11.2

c03f331

0.11.2 (2024-09-02)

Bug Fixes

Fix passing Greenplum(extra={"options": ...}) during read/write operations. (#308)

Assets 4

29 May 08:10

github-actions

0.11.1

7c9c708

0.11.1 (2024-05-29)

Features

Change MSSQL.port default from 1433 to None, allowing use of instanceName to detect port number. (#287)

Bug Fixes

Remove fetchsize from JDBC.WriteOptions. (#288)

Assets 4

27 May 09:46

github-actions

0.11.0

2335d7f

0.11.0 (2024-05-27)

Breaking Changes

There can be some changes in connection behavior, related to version upgrades. So we mark these changes as breaking although most of users will not see any differences.

Update Clickhouse JDBC driver to latest version (#249):
- Package was renamed ru.yandex.clickhouse:clickhouse-jdbc → com.clickhouse:clickhouse-jdbc.
- Package version changed 0.3.2 → 0.6.0-patch5.
- Driver name changed ru.yandex.clickhouse.ClickHouseDriver → com.clickhouse.jdbc.ClickHouseDriver.
This brings up several fixes for Spark <-> Clickhouse type compatibility, and also Clickhouse clusters support.

Warning

New JDBC driver has a more strict behavior regarding types:

Old JDBC driver applied max(1970-01-01T00:00:00, value) for Timestamp values, as this is a minimal supported value of DateTime32 Clickhouse type. New JDBC driver doesn't.
Old JDBC driver rounded values with higher precision than target column during write. New JDBC driver doesn't.
Old JDBC driver replaced NULLs as input for non-Nullable columns with column's DEFAULT value. New JDBC driver doesn't. To enable previous behavior, pass Clickhouse(extra={"nullsAsDefault": 2}) (see documentation).

Update other JDBC drivers to latest versions:
- MSSQL 12.2.0 → 12.6.2 (#254).
- MySQL 8.0.33 → 8.4.0 (#253, #285).
- Oracle 23.2.0.0 → 23.4.0.24.05 (#252, #284).
- Postgres 42.6.0 → 42.7.3 (#251).
Update MongoDB connector to latest version: 10.1.1 → 10.3.0 (#255, #283).

This brings up Spark 3.5 support.
Update XML package to latest version: 0.17.0 → 0.18.0 (#259).

This brings few bugfixes with datetime format handling.

For JDBC connections add new SQLOptions class for DB.sql(query, options=...) method (#272).

Firsly, to keep naming more consistent.

Secondly, some of options are not supported by DB.sql(...) method, but supported by DBReader. For example, SQLOptions do not support partitioning_mode and require explicit definition of lower_bound and upper_bound when num_partitions is greater than 1. ReadOptions does support partitioning_mode and allows skipping lower_bound and upper_bound values.

This require some code changes. Before:

from onetl.connection import Postgres

postgres = Postgres(...)
df = postgres.sql(
    """
    SELECT *
    FROM some.mytable
    WHERE key = 'something'
    """,
    options=Postgres.ReadOptions(
        partitioning_mode="range",
        partition_column="id",
        num_partitions=10,
    ),
)

After:

from onetl.connection import Postgres

postgres = Postgres(...)
df = postgres.sql(
    """
    SELECT *
    FROM some.mytable
    WHERE key = 'something'
    """,
    options=Postgres.SQLOptions(
        # partitioning_mode is not supported!
        partition_column="id",
        num_partitions=10,
        lower_bound=0,  # <-- set explicitly
        upper_bound=1000,  # <-- set explicitly
    ),
)

For now, DB.sql(query, options=...) can accept ReadOptions to keep backward compatibility, but emits deprecation warning. The support will be removed in v1.0.0.

Split up JDBCOptions class into FetchOptions and ExecuteOptions (#274).

New classes are used by DB.fetch(query, options=...) and DB.execute(query, options=...) methods respectively. This is mostly to keep naming more consistent.

This require some code changes. Before:

from onetl.connection import Postgres

postgres = Postgres(...)
df = postgres.fetch(
    "SELECT * FROM some.mytable WHERE key = 'something'",
    options=Postgres.JDBCOptions(
        fetchsize=1000,
        query_timeout=30,
    ),
)

postgres.execute(
    "UPDATE some.mytable SET value = 'new' WHERE key = 'something'",
    options=Postgres.JDBCOptions(query_timeout=30),
)

After:

from onetl.connection import Postgres

# Using FetchOptions for fetching data
postgres = Postgres(...)
df = postgres.fetch(
    "SELECT * FROM some.mytable WHERE key = 'something'",
    options=Postgres.FetchOptions(  # <-- change class name
        fetchsize=1000,
        query_timeout=30,
    ),
)

# Using ExecuteOptions for executing statements
postgres.execute(
    "UPDATE some.mytable SET value = 'new' WHERE key = 'something'",
    options=Postgres.ExecuteOptions(query_timeout=30),  # <-- change class name
)

For now, DB.fetch(query, options=...) and DB.execute(query, options=...) can accept JDBCOptions, to keep backward compatibility, but emit a deprecation warning. The old class will be removed in v1.0.0.

Serialize ColumnDatetimeHWM to Clickhouse's DateTime64(6) (precision up to microseconds) instead of DateTime (precision up to seconds) (#267).

In previous onETL versions, ColumnDatetimeHWM value was rounded to the second, and thus reading some rows that were read in previous runs, producing duplicates.

For Clickhouse versions below 21.1 comparing column of type DateTime with a value of type DateTime64 is not supported, returning an empty dataframe. To avoid this, replace:
```
DBReader(
    ...,
    hwm=DBReader.AutoDetectHWM(
        name="my_hwm",
        expression="hwm_column",  # <--
    ),
)
```
with:
```
DBReader(
    ...,
    hwm=DBReader.AutoDetectHWM(
        name="my_hwm",
        expression="CAST(hwm_column AS DateTime64)",  # <-- add explicit CAST
    ),
)
```
Pass JDBC connection extra params as properties dict instead of URL with query part (#268).

This allows passing custom connection parameters like Clickhouse(extra={"custom_http_options": "option1=value1,option2=value2"}) without need to apply urlencode to parameter value, like option1%3Dvalue1%2Coption2%3Dvalue2.

Features

Improve user experience with Kafka messages and Database tables with serialized columns, like JSON/XML.

Allow passing custom package version as argument for DB.get_packages(...) method of several DB connectors:
- Clickhouse.get_packages(package_version=..., apache_http_client_version=...) (#249).
- MongoDB.get_packages(scala_version=..., spark_version=..., package_version=...) (#255).
- MySQL.get_packages(package_version=...) (#253).
- MSSQL.get_packages(java_version=..., package_version=...) (#254).
- Oracle.get_packages(java_version=..., package_version=...) (#252).
- Postgres.get_packages(package_version=...) (#251).
- Teradata.get_packages(package_version=...) (#256).
Now users can downgrade or upgrade connection without waiting for next onETL release. Previously only Kafka and Greenplum supported this feature.
Add FileFormat.parse_column(...) method to several classes:
- Avro.parse_column(col) (#265).
- JSON.parse_column(col, schema=...) (#257).
- CSV.parse_column(col, schema=...) (#258).
- XML.parse_column(col, schema=...) (#269).
This allows parsing data in value field of Kafka message or string/binary column of some table as a nested Spark structure.
Add FileFormat.serialize_column(...) method to several classes:
- Avro.serialize_column(col) (#265).
- JSON.serialize_column(col) (#257).
- CSV.serialize_column(col) (#258).
This allows saving Spark nested structures or arrays to value field of Kafka message or string/binary column of some table.

Improvements

Few documentation improvements.

Replace all assert in documentation with doctest syntax. This should make documentation more readable (#273).
Add generic Troubleshooting guide (#275).
Improve Kafka documentation:
- Add "Prerequisites" page describing different aspects of connecting to Kafka.
- Improve "Reading from" and "Writing to" page of Kafka documentation, add more examples and usage notes.
- Add "Troubleshooting" page (#276).
Improve Hive documentation:
- Add "Prerequisites" page describing different aspects of connecting to Hive.
- Improve "Reading from" and "Writing to" page of Hive documentation, add more examples and recommendations.
- Improve "Executing statements in Hive" page of Hive documentation. (#278).
Add "Prerequisites" page describing different aspects of using SparkHDFS and SparkS3 connectors. (#279).
Add note about connecting to Clickhouse cluster. (#280).
Add notes about versions when specific class/method/attribute/argument was added, renamed or changed behavior (#282).

Bug Fixes

Fix missing pysmb package after installing pip install onetl[files].

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaking Changes

Features

Improvements

Bug Fixes

Dependencies

Doc only Changes

Improvements

Big fixes

Bug Fixes

Bug Fixes

Improvements

Bug Fixes

Dependencies

Doc only Changes

Features

Bug Fixes

Doc only Changes

Breaking Changes

Features

Bug Fixes

Bug Fixes

Features

Bug Fixes

Breaking Changes

Features

Improvements

Bug Fixes

Releases: MobileTeleSystems/onetl

0.13.0 (2025-02-24)

Breaking Changes

Features

Improvements

Bug Fixes

Dependencies

Doc only Changes

0.12.5 (2024-12-03)

Improvements

Big fixes

0.12.4 (2024-11-27)

Bug Fixes

0.12.3 (2024-11-22)

Bug Fixes

0.12.2 (2024-11-12)

Improvements

Bug Fixes

Dependencies

Doc only Changes

0.12.1 (2024-10-28)

Features

Bug Fixes

Doc only Changes

0.12.0 (2024-09-03)

Breaking Changes

Features

Bug Fixes

0.11.2 (2024-09-02)

Bug Fixes

0.11.1 (2024-05-29)

Features

Bug Fixes

0.11.0 (2024-05-27)

Breaking Changes

Features

Improvements

Bug Fixes