Skip to content

Commit

Permalink
Merge pull request #106 from microsoft/v3
Browse files Browse the repository at this point in the history
v2.0.8
  • Loading branch information
bluewatersql authored Jan 10, 2025
2 parents c221433 + a9b9b91 commit 4d504b5
Show file tree
Hide file tree
Showing 15 changed files with 426 additions and 121 deletions.
21 changes: 21 additions & 0 deletions Docs/DataExpiration.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,24 @@ The sync happens in two steps, when enabled.
2. After the schedule load completes successfully, expiration policies are enforced if required.
- Tables that expired are dropped from the Fabric Lakehouse. A dropped table requires no further maintenance consideration since the table and its underlying files are deleted.
- Partitions that expired are deleted from the table. When partitions are deleted, table maintenance should be consider. The partition is removed from the table but OneLake storage is not recovered until a <code>VACUUM</code> is performed where the operation falls outside the retention window. For more information, please see the Data Maintenance docs.

### Example Configuration

Data expiration is enabled and a default is turned on by default for all tables using the <code>table_defaults</code>. Table 1 is turned on by default. Table 2 overrides the default behavior, and disables data expiration.
```
{
"enable_data_expiration": true,
"table_defaults": {
"enforce_expiration": true,
},
"tables": [
{
"table_name": "<<TABLE 1>>"
},
{
"table_name": "<<TABLE 2>>",
"enforce_expiration": false
}
]
}
```
98 changes: 87 additions & 11 deletions Docs/DataMaintenance.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,30 +20,33 @@ To address these scenarios, Delta has built-in capabilites to <code>OPTIMIZE</co
### Data Maintenance for Fabric Lakehouse
Theere are a number of options and approaches for manual or semi-automatic data maintenance at the table-level within Fabric. The Fabric Sync accelerator provides two built-in options for handling data maintenance as part of your BigQuery sync process.

1. Time-based Maintenance
1. Schedule Maintenance

Time-based maintenance is a simple schedule-driven maintenance where OPTIMIZE and VACUUM is run for your table at pre-defined intervals. The schedule-based intervals available are:
Scheduled maintenance is simple time-based maintenance where OPTIMIZE and VACUUM are run for your table at pre-defined intervals. The schedule-based intervals available are:

- AUTO - Every schedule run
- DAY - 24 hours
- WEEK - 7 days
- MONTH - 30 days
- QUARTER - 90 days
- YEAR - 365 days

The intervals are time-period based and maintenance is calculated from the last time maitenance was performed. For example, if maintenance was last run on January 1st @ 8:00 AM then with a DAY scheduled it would be eligible to run again January 2nd @ 8:00am. If maintenance has never been run for a table, it will run immediately in the next maintenance window.
The intervals are time-period based and maintenance is calculated from the last time maintenance was performed. For example, if maintenance was last run on January 1st @ 8:00 AM then with a DAY scheduled it would be eligible to run again January 2nd @ 8:00am. If maintenance has never been run for a table, it will run immediately in the next maintenance window.

2. Intelligent Maintenance

Intelligent maintenance utilizes the Delta log and configurable threshholds to determine if maintenance should be run for any given table. This adaptive process looks at data growth, number of files, file size overall table size including out-of-scope files to be more selective for maintenance operations.
Intelligent maintenance utilizes an inventory of the Delta log and configurable threshholds to determine if maintenance should be run for any given table.

Intelligent maintenance and can specify either an <code>OPTIMIZE</code>, a <code>VACUUM</code> or both as required. Thresholds that influence or trigger this smart maintenance process are:
This adaptive process looks at data growth, number of files, file size overall and table size including out-of-scope files to be more selective for maintenance operations.

Intelligent maintenance can specify either <code>OPTIMIZE</code>, <code>VACUUM</code> or both when configured thresholds are exceeded. Thresholds that influence or trigger this smart maintenance process are:

- <code>rows_changed</code> - ratio of rows that changed (inserted, updated or delete) versus the total table rows
- <code>table_size_growth</code> - percentage growth in overall table size
- <code>file_fragmentation</code> - ratio of files that are not optimally sized
- <code>out_of_scope_size</code>- ratio of out of scope data to total table size
- <code>rows_changed</code> - ratio of rows that changed (inserted, updated or delete) versus the total table rows. This triggers an <code>OPTIMIZE</code>.
- <code>table_size_growth</code> - percentage growth in overall table size. This triggers an <code>OPTIMIZE</code>.
- <code>file_fragmentation</code> - ratio of files that are not optimally sized. This triggers an <code>OPTIMIZE</code>.
- <code>out_of_scope_size</code>- ratio of out of scope data to total table size. This triggers an <code>VACUUM</code>.

Intelligent maintenance requires a storage inventory process to run and collect the Delta metadata for each table in you mirrored Lakehouse. This process runs as part of the larger Fabric Sync Data Maintenance process.
Intelligent maintenance runs a storage inventory process against the Delta Log to run and collect the Delta metadata for each table in you mirrored Lakehouse. This process runs as part of the Fabric Sync Data Maintenance process.

<mark><b><u>Note:</u></b> The storage inventory data is collected in your Fabric Sync Metadata Lakehouse and can be used for further analysis about your OneLake storage usage.</mark>

Expand Down Expand Up @@ -82,4 +85,77 @@ The Fabric Sync accelerator uses a default <code>retention_hours</code> of <code

If you want to leverage Delta Time Travel capabilities or has support for versioning and rollbacks, it is recommended that you adjust the <code>retention_hours</code> setting to <code>168</code> for 7 days for history.

<mark><b><u>Note: </u></b> Defining a large retention peiod can significant degrade overall performance for large tables. It could also substantially increase the overall OneLake storage cost</mark>
<mark><b><u>Note: </u></b> Defining a large retention peiod can significant degrade overall performance for large tables. It could also substantially increase the overall OneLake storage cost</mark>

### Example Configurations
#### Scheduled Maintenance

Scheduled maintenance with Week default and a history retention default of 0 hours. Table 1 use the default configuration. Table 2 overrides the default to a Monthly period. Table 3 disables maintenance.

```
{
"maintenance": {
"enabled": true,
"interval": "WEEK",
"strategy": "SCHEDULED",
"retention_hours": 0,
},
"tables": [
{
"table_name": "<<TABLE 1>>"
},
{
"table_name": "<<TABLE 2>>",
"table_maintenance": {
"enabled": true,
"interval": "MONTH"
}
},
{
"table_name": "<<TABLE 3>>",
"table_maintenance": {
"enabled": false
}
}
]
}
```
#### Intelligent Maintenance

Intelligent maintenance with history retention set for 7-days (168 hours). The thresholds are set to allow table rows, table size in MB and fragmentation to reach 50% before <code>OPTIMIZE</code> is triggered. <code>VACUUM</code> is triggered when the out-of-scope file size is 3x the table size.

In this example, Table 2 ensures regular monthly maintenance is performed if the thresholds are never met.
```
{
"maintenance": {
"enabled": true,
"interval": "AUTO",
"strategy": "INTELLIGENT",
"retention_hours": 168,
"thresholds": {
"rows_changed": 0.5,
"table_size_growth": 0.5,
"file_fragmentation": 0.5,
"out_of_scope_size": 3.0
}
},
"table_defaults": {
"table_maintenance": {
"enabled": true,
"interval": "AUTO"
}
},
"tables": [
{
"table_name": "<<TABLE 1>>"
},
{
"table_name": "<<TABLE 2>>",
"table_maintenance": {
"enabled": true,
"interval": "MONTH"
}
}
]
}
```
6 changes: 3 additions & 3 deletions Notebooks/v2.0.0/BQ-Sync-Maintenance.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@
},
"outputs": [],
"source": [
"%pip install FabricSync --quiet --disable-pip-version-check"
"%pip install FabricSync=='<<<VERSION>>>' --quiet --disable-pip-version-check"
]
},
{
Expand All @@ -109,7 +109,7 @@
},
"outputs": [],
"source": [
"config_json_path = \"<<<FILE API PATH TO FABRIC SYNC CONFIG FILE>>>\""
"config_json_path = \"<<<PATH_TO_USER_CONFIG>>>\""
]
},
{
Expand All @@ -133,7 +133,7 @@
},
"outputs": [],
"source": [
"from FabricSync.BQ.Sync import *\n",
"from FabricSync.BQ.Maintenance import *\n",
"\n",
"token = mssparkutils.credentials.getToken(\"https://api.fabric.microsoft.com/\")\n",
"fabric_maintenance = FabricSyncMaintenance(spark, config_json_path, token)\n",
Expand Down
4 changes: 1 addition & 3 deletions Notebooks/v2.0.0/BQ-Sync.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -56,9 +56,7 @@
"%%configure -f \n",
"{\n",
" \"defaultLakehouse\": {\n",
" \"name\": \"<<<METADATA_LAKEHOUSE_NAME>>>\",\n",
" \"id\": \"<<<METADATA_LAKEHOUSE_ID>>>\",\n",
" \"workspaceId\": \"<<<FABRIC_WORKSPACE_ID>>>\"\n",
" \"name\": \"<<<METADATA_LAKEHOUSE_NAME>>>\"\n",
" },\n",
" \"conf\": {\n",
" \"spark.jars\": \"<<<PATH_SPARK_BQ_JAR>>>\"\n",
Expand Down
14 changes: 7 additions & 7 deletions Packages/FabricSync/FabricSync.egg-info/PKG-INFO
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Metadata-Version: 2.1
Metadata-Version: 2.2
Name: FabricSync
Version: 2.0.7
Version: 2.0.8
Summary: Fabric BigQuery Data Sync Utility
Author-email: Microsoft GBBs North America <chriprice@microsoft.com>
License: MIT License
Expand All @@ -23,11 +23,11 @@ Classifier: Topic :: Utilities
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic==2.10.4
Requires-Dist: PyGithub==2.5.0
Requires-Dist: mlflow-skinny==2.19.0
Requires-Dist: google-cloud-bigquery==3.27.0
Requires-Dist: google-cloud-bigquery[pandas]==3.27.0
Requires-Dist: pydantic
Requires-Dist: PyGithub
Requires-Dist: mlflow-skinny
Requires-Dist: google-cloud-bigquery
Requires-Dist: db-dtypes

# Fabric BQ (BigQuery) Sync

Expand Down
10 changes: 5 additions & 5 deletions Packages/FabricSync/FabricSync.egg-info/requires.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
pydantic==2.10.4
PyGithub==2.5.0
mlflow-skinny==2.19.0
google-cloud-bigquery==3.27.0
google-cloud-bigquery[pandas]==3.27.0
pydantic
PyGithub
mlflow-skinny
google-cloud-bigquery
db-dtypes
4 changes: 2 additions & 2 deletions Packages/FabricSync/FabricSync/BQ/Logging.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from .Enum import *
from .Constants import SyncConstants
from .Utils import *
from ..Meta import Version as SyncVersion
from ..Meta import *

class SyncLogger:
def __init__(self, context:SparkSession):
Expand Down Expand Up @@ -62,7 +62,7 @@ def telemetry(self, message, *args, **kwargs):
if (self.logger.isEnabledFor(SyncLogLevel.TELEMETRY.value)):
if self.Telemetry:
message["correlation_id"] = self.ApplicationID
message["sync_version"] = SyncVersion.CurrentVersion
message["sync_version"] = Version.CurrentVersion

self.send_telemetry(json.dumps(message))
#self.logger._log(SyncLogLevel.SYNC_STATUS.value, f"Telemetry: {message}", args, **kwargs)
Expand Down
Loading

0 comments on commit 4d504b5

Please sign in to comment.