Releases: MobileTeleSystems/onetl
0.13.0 (2025-02-24)
🎉 3 years since first release 0.1.0 🎉
Breaking Changes
-
Add Python 3.13. support. (#298)
-
Change the logic of
FileConnection.walk
andFileConnection.list_dir
. (#327)Previously
limits.stops_at(path) == True
considered as "return current file and stop", and could lead to exceeding some limit. Not it means "stop immediately". -
Change default value for
FileDFWriter.Options(if_exists=...)
fromerror
toappend
, to make it consistent with other.Options()
classes within onETL. (#343)
Features
-
Add support for
FileModifiedTimeHWM
HWM class (see etl-entities 2.5.0):from etl_entitites.hwm import FileModifiedTimeHWM from onetl.file import FileDownloader from onetl.strategy import IncrementalStrategy downloader = FileDownloader( ..., hwm=FileModifiedTimeHWM(name="somename"), ) with IncrementalStrategy(): downloader.run()
-
Introduce
FileSizeRange(min=..., max=...)
filter class. (#325)Now users can set
FileDownloader
/FileMover
to download/move only files with specific file size range:from onetl.file import FileDownloader from onetl.file.filter import FileSizeRange downloader = FileDownloader( ..., filters=[FileSizeRange(min="10KiB", max="1GiB")], )
-
Introduce
TotalFilesSize(...)
limit class. (#326)Now users can set
FileDownloader
/FileMover
to stop downloading/moving files after reaching a certain amount of data:from datetime import datetime, timedelta from onetl.file import FileDownloader from onetl.file.limit import TotalFilesSize downloader = FileDownloader( ..., limits=[TotalFilesSize("1GiB")], )
-
Implement
FileModifiedTime(since=..., until=...)
file filter. (#330)Now users can set
FileDownloader
/FileMover
to download/move only files with specific file modification time:from datetime import datetime, timedelta from onetl.file import FileDownloader from onetl.file.filter import FileModifiedTime downloader = FileDownloader( ..., filters=[FileModifiedTime(before=datetime.now() - timedelta(hours=1))], )
-
Add
SparkS3.get_exclude_packages()
andKafka.get_exclude_packages()
methods. (#341)Using them allows to skip downloading dependencies not required by this specific connector, or which are already a part of Spark/PySpark:
from onetl.connection import SparkS3, Kafka maven_packages = [ *SparkS3.get_packages(spark_version="3.5.4"), *Kafka.get_packages(spark_version="3.5.4"), ] exclude_packages = SparkS3.get_exclude_packages() + Kafka.get_exclude_packages() spark = ( SparkSession.builder.appName("spark_app_onetl_demo") .config("spark.jars.packages", ",".join(maven_packages)) .config("spark.jars.excludes", ",".join(exclude_packages)) .getOrCreate() )
Improvements
-
All DB connections opened by
JDBC.fetch(...)
,JDBC.execute(...)
orJDBC.check()
are immediately closed after the statements is executed. (#334)Previously Spark session with
master=local[3]
actually opened up to 5 connections to target DB - one forJDBC.check()
, another for Spark driver interaction with DB to create tables, and one for each Spark executor. Now only max 4 connections are opened, asJDBC.check()
does not hold opened connection.This is important for RDBMS like Postgres or Greenplum where number of connections is strictly limited and limit is usually quite low.
-
Set up
ApplicationName
(client info) for Clickhouse, MongoDB, MSSQL, MySQL and Oracle. (#339, #248)Also update
ApplicationName
format for Greenplum, Postgres, Kafka and SparkS3. Now all connectors have the sameApplicationName
format:${spark.applicationId} ${spark.appName} onETL/${onetl.version} Spark/${spark.version}
The only connections not sending
ApplicationName
are Teradata and FileConnection implementations. -
Now
DB.check()
will test connection availability not only on Spark driver, but also from some Spark executor. (#346)This allows to fail immediately if Spark driver host has network access to target DB, but Spark executors have not.
Bug Fixes
-
Avoid suppressing Hive Metastore errors while using
DBWriter
. (#329)Previously this was implemented as:
try: spark.sql(f"SELECT * FROM {table}") table_exists = True except Exception: table_exists = False
If Hive Metastore was overloaded and responded with an exception, it was considered as non-existing table, resulting to full table override instead of append or override only partitions subset.
-
Fix using onETL to write data to PostgreSQL or Greenplum instances behind pgbouncer with
pool_mode=transaction
. (#336)Previously
Postgres.check()
opened a read-only transaction, pgbouncer changed the entire connection type from read-write to read-only, and whenDBWriter.run(df)
executed in read-only connection, producing errors like:org.postgresql.util.PSQLException: ERROR: cannot execute INSERT in a read-only transaction org.postgresql.util.PSQLException: ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
Added a workaround by passing
readOnly=True
to JDBC params for read-only connections, so pgbouncer may differ read-only and read-write connections properly.After upgrading onETL 0.13.x or higher the same error still may appear of pgbouncer still holds read-only connections and returns them for DBWriter. To this this, user can manually convert read-only connection to read-write:
postgres.execute("BEGIN READ WRITE;") # <-- add this line DBWriter(...).run()
After all connections in pgbouncer pool were converted from read-only to read-write, and error fixed, this additional line could be removed.
-
Fix
MSSQL.fetch(...)
andMySQL.fetch(...)
opened a read-write connection instead of read-only. (#337)-
Now this is fixed:
MSSQL.fetch(...)
establishes connection withApplicationIntent=ReadOnly
.MySQL.fetch(...)
callsSET SESSION TRANSACTION READ ONLY
statement.
-
-
Fixed passing multiple filters to
FileDownloader
andFileMover
. (#338) If was caused by sorting filters list in internal logging method, butFileFilter
subclasses are not sortable. -
Fix a false warning about a lof of parallel connections to Grenplum. (#342)
Creating Spark session with
.master("local[5]")
may open up to 6 connections to Greenplum (=number of Spark executors + 1 for driver), but onETL instead used number of CPU cores on the host as a number of parallel connections.This lead to showing a false warning that number of Greenplum connections is too high, which actually should be the case only if number of executors is higher than 30.
-
Fix MongoDB trying to use current database name as
authSource
. (#347)Use default connector value which is
admin
database. Previous onETL versions could be fixed by:from onetl.connection import MongoDB mongodb = MongoDB( ..., database="mydb", extra={ "authSource": "admin", }, )
Dependencies
-
-
Update DB connectors/drivers to latest versions: (#345)
- Clickhouse
0.6.5
→0.7.2
- MongoDB
10.4.0
→10.4.1
- MySQL
9.0.0
→9.2.0
- Oracle
23.5.0.24.07
→23.7.0.25.01
- Postgres
42.7.4
→42.7.5
- Clickhouse
-
Doc only Changes
- Split large code examples to tabs. (#344)
0.12.5 (2024-12-03)
Improvements
- Use
sipHash64
instead ofmd5
in Clickhouse for reading data with{"partitioning_mode": "hash"}
, as it is 5 times faster. - Use
hashtext
instead ofmd5
in Postgres for reading data with{"partitioning_mode": "hash"}
, as it is 3-5 times faster. - Use
BINARY_CHECKSUM
instead ofHASHBYTES
in MSSQL for reading data with{"partitioning_mode": "hash"}
, as it is 5 times faster.
Big fixes
- In JDBC sources wrap
MOD(partitionColumn, numPartitions)
withABS(...)
to make al returned values positive. This prevents data skew. - Fix reading table data from MSSQL using
{"partitioning_mode": "hash"}
withpartitionColumn
of integer type. - Fix reading table data from Postgres using
{"partitioning_mode": "hash"}
lead to data skew (all the data was read into one Spark partition).
0.12.4 (2024-11-27)
Bug Fixes
- Fix
DBReader(conn=oracle, options={"partitioning_mode": "hash"})
lead to data skew in last partition due to wrongora_hash
usage. (#319)
0.12.3 (2024-11-22)
Bug Fixes
- Allow passing table names in format
schema."table.with.dots"
toDBReader(source=...)
andDBWriter(target=...)
.
0.12.2 (2024-11-12)
Improvements
- Change Spark
jobDescription
for DBReader & FileDFReader fromDBReader.run() -> Connection
toConnection -> DBReader.run()
.
Bug Fixes
- Fix
log_hwm
output forKeyValueIntHWM
(used by Kafka). (#316) - Fix
log_collection
hiding values in logs withINFO
level. (#316)
Dependencies
- Allow using etl-entities==2.4.0.
Doc only Changes
- Fix links to MSSQL date & time type documentation.
0.12.1 (2024-10-28)
Features
- Log detected JDBC dialect while using
DBWriter
.
Bug Fixes
- Fix
SparkMetricsRecorder
failing when receivingSparkListenerTaskEnd
withouttaskMetrics
(e.g. executor was killed by OOM). (#313) - Call
kinit
before checking for HDFS active namenode. - Wrap
kinit
withthreading.Lock
to avoid multithreading issues. - Immediately show
kinit
errors to user, instead of hiding them. - Use
AttributeError
instead ofImportError
in module's__getattr__
method, to make code compliant with Python spec.
Doc only Changes
- Add note about spark-dialect-extension package to Clickhouse connector documentation. (#310)
0.12.0 (2024-09-03)
Breaking Changes
-
Change connection URL used for generating HWM names of S3 and Samba sources:
smb://host:port
->smb://host:port/share
s3://host:port
->s3://host:port/bucket
(#304)
-
Update DB connectors/drivers to latest versions:
- Clickhouse
0.6.0-patch5
→0.6.5
- MongoDB
10.3.0
→10.4.0
- MSSQL
12.6.2
→12.8.1
- MySQL
8.4.0
→9.0.0
- Oracle
23.4.0.24.05
→23.5.0.24.07
- Postgres
42.7.3
→42.7.4
- Clickhouse
-
Update
Excel
package from0.20.3
to0.20.4
, to include Spark 3.5.1 support. (#306)
Features
-
Add support for specifying file formats (
ORC
,Parquet
,CSV
, etc.) inHiveWriteOptions.format
(#292):Hive.WriteOptions(format=ORC(compression="snappy"))
-
Collect Spark execution metrics in following methods, and log then in DEBUG mode:
DBWriter.run()
FileDFWriter.run()
Hive.sql()
Hive.execute()
This is implemented using custom
SparkListener
which wraps the entire method call, and then report collected metrics. But these metrics sometimes may be missing due to Spark architecture, so they are not reliable source of information. That's why logs are printed only in DEBUG mode, and are not returned as method call result. (#303) -
Generate default
jobDescription
based on currently executed method. Examples:DBWriter.run(schema.table) -> Postgres[host:5432/database]
MongoDB[localhost:27017/admin] -> DBReader.has_data(mycollection)
Hive[cluster].execute()
If user already set custom
jobDescription
, it will left intact. (#304) -
Add log.info about JDBC dialect usage (#305):
|MySQL| Detected dialect: 'org.apache.spark.sql.jdbc.MySQLDialect'
-
Log estimated size of in-memory dataframe created by
JDBC.fetch
andJDBC.execute
methods. (#303)
Bug Fixes
- Fix passing
Greenplum(extra={"options": ...})
during read/write operations. (#308) - Do not raise exception if yield-based hook whas something past (and only one)
yield
.
0.11.2 (2024-09-02)
Bug Fixes
- Fix passing
Greenplum(extra={"options": ...})
during read/write operations. (#308)
0.11.1 (2024-05-29)
0.11.0 (2024-05-27)
Breaking Changes
There can be some changes in connection behavior, related to version upgrades. So we mark these changes as breaking although most of users will not see any differences.
-
Update Clickhouse JDBC driver to latest version (#249):
- Package was renamed
ru.yandex.clickhouse:clickhouse-jdbc
→com.clickhouse:clickhouse-jdbc
. - Package version changed
0.3.2
→0.6.0-patch5
. - Driver name changed
ru.yandex.clickhouse.ClickHouseDriver
→com.clickhouse.jdbc.ClickHouseDriver
.
This brings up several fixes for Spark <-> Clickhouse type compatibility, and also Clickhouse clusters support.
- Package was renamed
Warning
New JDBC driver has a more strict behavior regarding types:
- Old JDBC driver applied
max(1970-01-01T00:00:00, value)
for Timestamp values, as this is a minimal supported value ofDateTime32
Clickhouse type. New JDBC driver doesn't. - Old JDBC driver rounded values with higher precision than target column during write. New JDBC driver doesn't.
- Old JDBC driver replaced NULLs as input for non-Nullable columns with column's DEFAULT value. New JDBC driver doesn't. To enable previous behavior, pass
Clickhouse(extra={"nullsAsDefault": 2})
(see documentation).
-
Update other JDBC drivers to latest versions:
-
Update MongoDB connector to latest version:
10.1.1
→10.3.0
(#255, #283).This brings up Spark 3.5 support.
-
Update
XML
package to latest version:0.17.0
→0.18.0
(#259).This brings few bugfixes with datetime format handling.
-
For JDBC connections add new
SQLOptions
class forDB.sql(query, options=...)
method (#272).Firsly, to keep naming more consistent.
Secondly, some of options are not supported by
DB.sql(...)
method, but supported byDBReader
. For example,SQLOptions
do not supportpartitioning_mode
and require explicit definition oflower_bound
andupper_bound
whennum_partitions
is greater than 1.ReadOptions
does supportpartitioning_mode
and allows skippinglower_bound
andupper_bound
values.This require some code changes. Before:
from onetl.connection import Postgres postgres = Postgres(...) df = postgres.sql( """ SELECT * FROM some.mytable WHERE key = 'something' """, options=Postgres.ReadOptions( partitioning_mode="range", partition_column="id", num_partitions=10, ), )
After:
from onetl.connection import Postgres postgres = Postgres(...) df = postgres.sql( """ SELECT * FROM some.mytable WHERE key = 'something' """, options=Postgres.SQLOptions( # partitioning_mode is not supported! partition_column="id", num_partitions=10, lower_bound=0, # <-- set explicitly upper_bound=1000, # <-- set explicitly ), )
For now,
DB.sql(query, options=...)
can acceptReadOptions
to keep backward compatibility, but emits deprecation warning. The support will be removed inv1.0.0
. -
Split up
JDBCOptions
class intoFetchOptions
andExecuteOptions
(#274).New classes are used by
DB.fetch(query, options=...)
andDB.execute(query, options=...)
methods respectively. This is mostly to keep naming more consistent.This require some code changes. Before:
from onetl.connection import Postgres postgres = Postgres(...) df = postgres.fetch( "SELECT * FROM some.mytable WHERE key = 'something'", options=Postgres.JDBCOptions( fetchsize=1000, query_timeout=30, ), ) postgres.execute( "UPDATE some.mytable SET value = 'new' WHERE key = 'something'", options=Postgres.JDBCOptions(query_timeout=30), )
After:
from onetl.connection import Postgres # Using FetchOptions for fetching data postgres = Postgres(...) df = postgres.fetch( "SELECT * FROM some.mytable WHERE key = 'something'", options=Postgres.FetchOptions( # <-- change class name fetchsize=1000, query_timeout=30, ), ) # Using ExecuteOptions for executing statements postgres.execute( "UPDATE some.mytable SET value = 'new' WHERE key = 'something'", options=Postgres.ExecuteOptions(query_timeout=30), # <-- change class name )
For now,
DB.fetch(query, options=...)
andDB.execute(query, options=...)
can acceptJDBCOptions
, to keep backward compatibility, but emit a deprecation warning. The old class will be removed inv1.0.0
. -
Serialize
ColumnDatetimeHWM
to Clickhouse'sDateTime64(6)
(precision up to microseconds) instead ofDateTime
(precision up to seconds) (#267).In previous onETL versions,
ColumnDatetimeHWM
value was rounded to the second, and thus reading some rows that were read in previous runs, producing duplicates.For Clickhouse versions below 21.1 comparing column of type
DateTime
with a value of typeDateTime64
is not supported, returning an empty dataframe. To avoid this, replace:DBReader( ..., hwm=DBReader.AutoDetectHWM( name="my_hwm", expression="hwm_column", # <-- ), )
with:
DBReader( ..., hwm=DBReader.AutoDetectHWM( name="my_hwm", expression="CAST(hwm_column AS DateTime64)", # <-- add explicit CAST ), )
-
Pass JDBC connection extra params as
properties
dict instead of URL with query part (#268).This allows passing custom connection parameters like
Clickhouse(extra={"custom_http_options": "option1=value1,option2=value2"})
without need to apply urlencode to parameter value, likeoption1%3Dvalue1%2Coption2%3Dvalue2
.
Features
Improve user experience with Kafka messages and Database tables with serialized columns, like JSON/XML.
-
Allow passing custom package version as argument for
DB.get_packages(...)
method of several DB connectors:Clickhouse.get_packages(package_version=..., apache_http_client_version=...)
(#249).MongoDB.get_packages(scala_version=..., spark_version=..., package_version=...)
(#255).MySQL.get_packages(package_version=...)
(#253).MSSQL.get_packages(java_version=..., package_version=...)
(#254).Oracle.get_packages(java_version=..., package_version=...)
(#252).Postgres.get_packages(package_version=...)
(#251).Teradata.get_packages(package_version=...)
(#256).
Now users can downgrade or upgrade connection without waiting for next onETL release. Previously only
Kafka
andGreenplum
supported this feature. -
Add
FileFormat.parse_column(...)
method to several classes:Avro.parse_column(col)
(#265).JSON.parse_column(col, schema=...)
(#257).CSV.parse_column(col, schema=...)
(#258).XML.parse_column(col, schema=...)
(#269).
This allows parsing data in
value
field of Kafka message or string/binary column of some table as a nested Spark structure. -
Add
FileFormat.serialize_column(...)
method to several classes:Avro.serialize_column(col)
(#265).JSON.serialize_column(col)
(#257).CSV.serialize_column(col)
(#258).
This allows saving Spark nested structures or arrays to
value
field of Kafka message or string/binary column of some table.
Improvements
Few documentation improvements.
-
Replace all
assert
in documentation with doctest syntax. This should make documentation more readable (#273). -
Add generic
Troubleshooting
guide (#275). -
Improve Kafka documentation:
- Add "Prerequisites" page describing different aspects of connecting to Kafka.
- Improve "Reading from" and "Writing to" page of Kafka documentation, add more examples and usage notes.
- Add "Troubleshooting" page (#276).
-
Improve Hive documentation:
- Add "Prerequisites" page describing different aspects of connecting to Hive.
- Improve "Reading from" and "Writing to" page of Hive documentation, add more examples and recommendations.
- Improve "Executing statements in Hive" page of Hive documentation. (#278).
-
Add "Prerequisites" page describing different aspects of using SparkHDFS and SparkS3 connectors. (#279).
-
Add note about connecting to Clickhouse cluster. (#280).
-
Add notes about versions when specific class/method/attribute/argument was added, renamed or changed behavior (#282).
Bug Fixes
- Fix missing
pysmb
package after installingpip install onetl[files]
.