Data import functionality for ThemisDB.
Provides data import functionality for ThemisDB, supporting PostgreSQL (v2.1), MySQL/MariaDB, MongoDB, SQLite, Oracle, Apache Kafka, S3-compatible object storage, BigQuery, and CSV/TSV/Parquet flat files with schema mapping, batch import, incremental import, MDM entity deduplication, and advanced data quality capabilities.
In scope: Database dump import with schema mapping, batch import operations, incremental import via change tracking, streaming row callbacks, dry-run mode, MDM entity deduplication and golden-record selection, CDC logical replication, data quality assessment, adaptive batch optimization, CRDT-based parallel import, GUI import wizard, plugin API for third-party connectors.
Out of scope: Data transformation beyond schema mapping (handled by content module),
export functionality (handled by exporters module), CDC-based ongoing sync as a primary
interface (CDC module handles long-running replication; postgres_cdc.cpp here is for
import-time CDC bootstrapping).
postgres_importer.cpp— PostgreSQL pg_dump source connector with schema mapping, FK preservation (v2.0+), relationship mapping, COPY-protocol support, checkpoint/resumemysql_importer.cpp— MySQL/MariaDB mysqldump source connector with parameterised queriesmongo_importer.cpp— MongoDB mongoexport JSON/NDJSON source connectorsqlite_importer.cpp— SQLite.dumpsource connector (type-affinity mapping)oracle_importer.cpp— Oracle ODBC importer with CLOB/BLOB streamingkafka_importer.cpp— Apache Kafka consumer (JSON, Avro, plaintext); requiresTHEMIS_ENABLE_KAFKAs3_importer.cpp— S3-compatible object-storage source; requiresTHEMIS_ENABLE_S3flatfile_importer.cpp— CSV/TSV/Parquet/NDJSON flat-file importer with BOM/encoding detectionconflict_resolver.cpp— pluggable conflict resolution strategies (SKIP, OVERWRITE, MERGE, ERROR)adaptive_import.cpp— FK-topology-aware import ordering and runtime batch-size adaptationmdm_engine.cpp— probabilistic matching, deduplication, merge policy, golden-record selectionmdm_audit_trail.cpp— immutable append-only MDM entity lifecycle event logmdm_metrics.cpp— Prometheus-compatible MDM metrics exporterdeterministic_matcher.cpp— exact-key entity matching for MDM deduplicationentity_linker.cpp— cross-source entity resolution and ID alignmentcanonical_resolver.cpp— MDM golden-record selection and merge arbitrationpostgres_cdc.cpp— PostgreSQL logical replication CDC (pgoutput); live stream requiresTHEMIS_ENABLE_CDCcrdt_importer.cpp— CRDT-based import with LWW, set-union, multi-value register strategiesdata_quality.cpp— NIST SP 800-188 six-dimension quality scoringaudit_trail.cpp— SOX/HIPAA-compliant Merkle-chained immutable audit logschema_inference.cpp— implicit FK discovery, semantic type detection, cardinality estimationschema_validator.cpp— strict/lenient schema validation before applypolyglot_mapper.cpp— recommend optimal data model per tabletemporal_support.cpp— SQL:2011 temporal detection and point-in-time query builderblockchain_integrity.cpp— SHA-256 Merkle tree with optional blockchain anchoringfederated_learning.cpp— FedAvg + Gaussian ε-δ differential privacy (experimental)graphql_federation.cpp— Apollo Federation v2 SDL generationcolumn_importance.cpp— Shannon entropy, Gini impurity, information gain column rankinggui_import_wizard.cpp— step-by-step GUI import wizard with source configuration and dry-run preview
Maturity: 🟢 Production — PostgreSQL (v2.1), MySQL/MariaDB, MongoDB, SQLite, Oracle, Kafka, S3, and flat-file importers operational. MDM engine, GUI wizard, CDC interface, and advanced analytics capabilities (v2.1+) production-ready.
- PostgreSQL importer (v2.1): FK preservation, relationship mapping, extended constraints
- MySQL/MariaDB importer
- MongoDB importer
- SQLite importer
- Oracle importer
- Apache Kafka consumer importer (real-time streaming; requires
THEMIS_ENABLE_KAFKA) - S3-compatible object-storage source connector (requires
THEMIS_ENABLE_S3) - CSV/TSV/Parquet/NDJSON flat-file importer
- Conflict resolver (SKIP, OVERWRITE, MERGE, ERROR)
- Adaptive import optimizer (FK-topology ordering, dynamic batch sizing)
- MDM engine (probabilistic matching, deduplication, golden-record selection)
- MDM audit trail and metrics
- Deterministic and semantic entity matchers
- Entity linker and canonical resolver
- PostgreSQL CDC logical replication interface
- CRDT-based parallel import
- Data quality framework (NIST SP 800-188)
- SOX/HIPAA audit trail
- Schema inference engine
- Schema validator
- Polyglot persistence mapper
- Temporal database support
- Blockchain integrity verifier
- Federated learning coordinator (experimental)
- GraphQL federation SDL generator
- Column importance analyzer
- GUI import wizard
- Import data from PostgreSQL, MySQL/MariaDB, MongoDB, SQLite, Oracle, Kafka, S3, and flat files
- Real-time streaming ingestion from Apache Kafka topics (JSON, Avro, plaintext)
- Foreign key preservation and relationship mapping (PostgreSQL v2.0+)
- Schema mapping, type-affinity transformation, and inference
- Batch import operations with configurable chunk size and adaptive tuning
- Incremental import support (watermark-based change tracking, checkpoint/resume)
- CDC logical replication interface for PostgreSQL (
THEMIS_ENABLE_CDC) - MDM entity deduplication, golden-record selection, immutable audit trail
- CRDT-based conflict-free parallel import
- NIST SP 800-188 data quality assessment
- Dry-run mode (validate without writing data)
- Streaming row callback for real-time progress
- Include/exclude table filtering
- Permission-check callback (ACL enforcement)
- Plugin API for third-party importers (stable C-linkage ABI
THEMIS_IMPORTER_PLUGIN_V1) - GUI step-by-step import wizard with source configuration and dry-run preview
- Metrics and distributed-tracing observability hooks
For importer documentation, see:
- Architecture Guide
- Roadmap
- Changelog
- Security
- Audit Report
- PostgreSQL Importer
- Importers Runbook
- Importers Roadmap
-
Vassiliadis, P., Simitsis, A., & Skiadopoulos, S. (2002). Conceptual Modeling for ETL Processes. Proceedings of the 5th ACM International Workshop on Data Warehousing and OLAP (DOLAP), 14–21. https://doi.org/10.1145/583890.583893
-
Kimball, R., & Caserta, J. (2004). The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Wiley. ISBN: 978-0-764-57923-5
-
Stonebraker, M., Bruckner, D., Ilyas, I. F., Beschastnikh, I., Cherniack, M., & Xu, N. (2013). Data Curation at Scale: The Data Tamer System. Proceedings of the 6th Biennial Conference on Innovative Data Systems Research (CIDR). https://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper28.pdf
-
Doan, A., Halevy, A., & Ives, Z. (2012). Principles of Data Integration. Morgan Kaufmann. ISBN: 978-0-124-16248-4
-
Shapiro, M., Preguiça, N., Baquero, C., & Zawirski, M. (2011). Conflict-Free Replicated Data Types. Proceedings of the 13th International Conference on Stabilization, Safety, and Security of Distributed Systems (SSS), 386–400. https://doi.org/10.1007/978-3-642-24550-3_29