diff --git a/docs/tutorials.md b/docs/tutorials.md
index 601724f5..9920e44c 100644
--- a/docs/tutorials.md
+++ b/docs/tutorials.md
@@ -27,3 +27,4 @@ Explore synthetic data tutorials with the option to run them **either in Google
| Enrich Sensitive Data with LLMs using Synthetic Replicas | [](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/synthetic-enrich/synthetic-enrich.ipynb) | [View Notebook](./tutorials/synthetic-enrich/synthetic-enrich.ipynb) |
| MOSTLY AI vs. SDV comparison: single-table scenario | [](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/sdv-comparison/single-table-scenario/single-table-scenario.ipynb) | [View Notebook](./tutorials/sdv-comparison/single-table-scenario/single-table-scenario.ipynb) |
| MOSTLY AI vs. SDV comparison: sequential scenario | [](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/sdv-comparison/sequential-scenario/sequential-scenario.ipynb) | [View Notebook](./tutorials/sdv-comparison/sequential-scenario/sequential-scenario.ipynb) |
+| MOSTLY AI vs. SDV comparison: non-context foreign keys | [](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/sdv-comparison/foreign-key/foreign-key.ipynb) | [View Notebook](./tutorials/sdv-comparison/foreign-key/foreign-key.ipynb) |
diff --git a/docs/tutorials/sdv-comparison/foreign-key/data/mostly/mostlyai_organizations.parquet b/docs/tutorials/sdv-comparison/foreign-key/data/mostly/mostlyai_organizations.parquet
new file mode 100644
index 00000000..df9d8ffa
Binary files /dev/null and b/docs/tutorials/sdv-comparison/foreign-key/data/mostly/mostlyai_organizations.parquet differ
diff --git a/docs/tutorials/sdv-comparison/foreign-key/data/mostly/mostlyai_relations.parquet b/docs/tutorials/sdv-comparison/foreign-key/data/mostly/mostlyai_relations.parquet
new file mode 100644
index 00000000..04221a19
Binary files /dev/null and b/docs/tutorials/sdv-comparison/foreign-key/data/mostly/mostlyai_relations.parquet differ
diff --git a/docs/tutorials/sdv-comparison/foreign-key/data/sdv/sdv_organizations.parquet b/docs/tutorials/sdv-comparison/foreign-key/data/sdv/sdv_organizations.parquet
new file mode 100644
index 00000000..941f68bf
Binary files /dev/null and b/docs/tutorials/sdv-comparison/foreign-key/data/sdv/sdv_organizations.parquet differ
diff --git a/docs/tutorials/sdv-comparison/foreign-key/data/sdv/sdv_relations.parquet b/docs/tutorials/sdv-comparison/foreign-key/data/sdv/sdv_relations.parquet
new file mode 100644
index 00000000..7d6d25cd
Binary files /dev/null and b/docs/tutorials/sdv-comparison/foreign-key/data/sdv/sdv_relations.parquet differ
diff --git a/docs/tutorials/sdv-comparison/foreign-key/data/subject-data/organizations.csv.gz b/docs/tutorials/sdv-comparison/foreign-key/data/subject-data/organizations.csv.gz
new file mode 100644
index 00000000..836d7936
Binary files /dev/null and b/docs/tutorials/sdv-comparison/foreign-key/data/subject-data/organizations.csv.gz differ
diff --git a/docs/tutorials/sdv-comparison/foreign-key/data/subject-data/relations.csv.gz b/docs/tutorials/sdv-comparison/foreign-key/data/subject-data/relations.csv.gz
new file mode 100644
index 00000000..f67b1018
Binary files /dev/null and b/docs/tutorials/sdv-comparison/foreign-key/data/subject-data/relations.csv.gz differ
diff --git a/docs/tutorials/sdv-comparison/foreign-key/foreign-key.ipynb b/docs/tutorials/sdv-comparison/foreign-key/foreign-key.ipynb
new file mode 100644
index 00000000..6aa91dfd
--- /dev/null
+++ b/docs/tutorials/sdv-comparison/foreign-key/foreign-key.ipynb
@@ -0,0 +1,729 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "24b43283",
+ "metadata": {},
+ "source": [
+ "# Non-Context Foreign Keys with MOSTLY AI & SDV\n",
+ "\n",
+ "A column in one table, Table A, which references a column in another table, Table B, is called a foreign key. In most Synthetic Data generator engines, when you have more than one foreign key in a single table, the foreign key whose parent contains the other foreign keys also included in your table, this foreign key is called the Context Foreign Key.\n",
+ "\n",
+ "In this notebook we compare two synthetic data generation engines, The Synthetic Data Vault (SDV) and the Synthetic Data SDK from MOSTLY AI to demonstrate how each of the two platforms handles non-context foreign keys when generating Synthetic Data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b2c59d61",
+ "metadata": {},
+ "source": [
+ "## Contents\n",
+ "\n",
+ "1. [Set up Environment](#set-up)\n",
+ " - [Install SDV](#install-sdv)\n",
+ " - [Install MOSTLY AI](#install-mostly-ai)\n",
+ "2. [Data Preparation](#data-preparation)\n",
+ " - [Download Data](#download-data)\n",
+ " - [Save Data in Environment Memory](#save-data-in-environment-memory)\n",
+ "3. [SDV Implementation](#sdv-implementation)\n",
+ " - [SDV Configuration](#sdv-configuration)\n",
+ " - [SDV Model Training](#sdv-model-training)\n",
+ " - [SDV Synthetic Data Generation](#sdv-synthetic-data-generation)\n",
+ " - [SDV Synthetic Data Preview](#sdv-synthetic-data-preview)\n",
+ " - [Save SDV Synthetic Data](#save-sdv-synthetic-data)\n",
+ " 4. [MOSTLY AI Implementation](#mostly-ai-implementation)\n",
+ " - [MOSTLY AI Configuration](#mostly-ai-configuration)\n",
+ " - [MOSTLY AI Generator Training](#mostly-ai-generator-training)\n",
+ " - [MOSTLY AI Synthetic Data Generation](#mostly-ai-synthetic-data-generation)\n",
+ " - [MOSTLY AI Synthetic Data Preview](#mostly-ai-synthetic-data-preview)\n",
+ " - [Save MOSTLY AI Synthetic Data](#save-mostly-ai-synthetic-data)\n",
+ " 5. [MOSTLY AI Synthetic Data Quality Assurance](#mostly-ai-synthetic-data-quality-assurance)\n",
+ " - [Instantiate the MOSTLY AI Synthetic Data QA Library](#instantiate-the-mostly-ai-synthetic-data-qa-library)\n",
+ " - [SDV Synthetic Data Quality](#sdv-synthetic-data-quality)\n",
+ " - [SDV - START_ID](#sdv---start_id)\n",
+ " - [SDV - END_ID](#sdv---end_id)\n",
+ " - [MOSTLY AI Synthetic Data Quality](#mostly-ai-synthetic-data-quality)\n",
+ " - [MOSTLY AI - START_ID](#mostly-ai---start_id)\n",
+ " - [MOSTLY AI - END_ID](#mostly-ai---end_id)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c540fec3",
+ "metadata": {},
+ "source": [
+ "## Set up Environment"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e0dbf579",
+ "metadata": {},
+ "source": [
+ "### Install SDV"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fb377980",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Install The Synthetic Data Vault\n",
+ "%pip install sdv==1.24.0 -qqq"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "366d4c94",
+ "metadata": {},
+ "source": [
+ "### Install MOSTLY AI"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "de26b9ea",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Install the Synthetic Data SDK from MOSTLY AI\n",
+ "%pip install -U \"mostlyai[local]\" -qqq"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "97d3dd98",
+ "metadata": {},
+ "source": [
+ "## Data Preparation"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c70eb207",
+ "metadata": {},
+ "source": [
+ "### Download Data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d52aca18",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "BASE = \"https://raw.githubusercontent.com/mostly-ai/public-demo-data/dev/gleif/\"\n",
+ "URL_ORGS = BASE + \"organizations.csv.gz\"\n",
+ "URL_RELS = BASE + \"relations.csv.gz\"\n",
+ "\n",
+ "organizations = pd.read_csv(URL_ORGS, compression=\"infer\", low_memory=False)\n",
+ "relations = pd.read_csv(URL_RELS, compression=\"infer\", low_memory=False)\n",
+ "\n",
+ "\n",
+ "def inspect_df(df, name):\n",
+ " \"\"\"\n",
+ " Comprehensive data inspection function to understand:\n",
+ " - Dataset dimensions and structure\n",
+ " - Column names and data types\n",
+ " - Sample data for manual review\n",
+ " \"\"\"\n",
+ " print(f\"--- {name} ---\")\n",
+ " print(f\"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns\")\n",
+ " print(\"Columns:\", df.columns.tolist())\n",
+ " print(\"Dtypes:\", df.dtypes)\n",
+ "\n",
+ "\n",
+ "data = {\"organizations\": organizations, \"relations\": relations}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c1b60bdd",
+ "metadata": {},
+ "source": [
+ "### Save Data in Environment Memory"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "78601bc8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "\n",
+ "# Ensure the target directory exists\n",
+ "os.makedirs(\"./data/subject-data\", exist_ok=True)\n",
+ "\n",
+ "# Save CSV files\n",
+ "organizations.to_csv(\"./data/subject-data/organizations.csv.gz\", index=False, compression=\"gzip\")\n",
+ "relations.to_csv(\"./data/subject-data/relations.csv.gz\", index=False, compression=\"gzip\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7c40ea8f",
+ "metadata": {},
+ "source": [
+ "## SDV Implementation"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "841b5551",
+ "metadata": {},
+ "source": [
+ "### SDV Configuration\n",
+ "\n",
+ "As noted by the SDV team, SDV supports multiple foreign keys by invoking the `metadata` method multiple times."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6fda2fa9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sdv.metadata import Metadata\n",
+ "\n",
+ "metadata = Metadata.detect_from_dataframes(data, infer_keys=\"primary_and_foreign\")\n",
+ "\n",
+ "metadata.add_relationship(\n",
+ " parent_table_name=\"organizations\",\n",
+ " parent_primary_key=\"ID\",\n",
+ " child_table_name=\"relations\",\n",
+ " child_foreign_key=\"START_ID\",\n",
+ ")\n",
+ "\n",
+ "metadata.add_relationship(\n",
+ " parent_table_name=\"organizations\", parent_primary_key=\"ID\", child_table_name=\"relations\", child_foreign_key=\"END_ID\"\n",
+ ")\n",
+ "\n",
+ "metadata.validate()\n",
+ "metadata.validate_data(data)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "43166ac9",
+ "metadata": {},
+ "source": [
+ "### SDV Model Training\n",
+ "\n",
+ "An interesting comparison beyond simply the validity of the generated synthetic data is the time required to train a model to create it.\n",
+ "\n",
+ "We'll use the `time` library to compare performance of the two tools against the full dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "603006de",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import time\n",
+ "\n",
+ "from sdv.multi_table import HMASynthesizer\n",
+ "\n",
+ "synthesizer = HMASynthesizer(metadata)\n",
+ "\n",
+ "start = time.time()\n",
+ "synthesizer.fit(data)\n",
+ "end = time.time()\n",
+ "\n",
+ "print(\"Fitting time:\", round(end - start, 2), \"seconds\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bdddc688",
+ "metadata": {},
+ "source": [
+ "### SDV Synthetic Data Generation"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "753a0450",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "start = time.time()\n",
+ "synthetic_data = synthesizer.sample(scale=0.10)\n",
+ "end = time.time()\n",
+ "\n",
+ "print(\"Sampling time:\", round(end - start, 2), \"seconds\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f0eb3e91",
+ "metadata": {},
+ "source": [
+ "#### SDV Synthetic Data Preview"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "87ab0e40",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "synthetic_data[\"organizations\"].head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ac1d688c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "synthetic_data[\"relations\"].head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d1e0a637",
+ "metadata": {},
+ "source": [
+ "### Save SDV Synthetic Data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bbe1cd51",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "\n",
+ "# Create target directory\n",
+ "os.makedirs(\"./data/sdv\", exist_ok=True)\n",
+ "\n",
+ "# Define file paths\n",
+ "orgs_output_file = \"./data/sdv/sdv_organizations.parquet\"\n",
+ "rels_output_file = \"./data/sdv/sdv_relations.parquet\"\n",
+ "\n",
+ "# Save tables\n",
+ "synthetic_data[\"organizations\"].to_parquet(orgs_output_file, index=False)\n",
+ "synthetic_data[\"relations\"].to_parquet(rels_output_file, index=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "392c5c21",
+ "metadata": {},
+ "source": [
+ "## MOSTLY AI Implementation"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "097239e7",
+ "metadata": {},
+ "source": [
+ "### MOSTLY AI Configuration"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f631da1d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from mostlyai.sdk import MostlyAI\n",
+ "\n",
+ "mostly = MostlyAI(local=True)\n",
+ "\n",
+ "config = {\n",
+ " \"name\": \"GLEIF Organizations & Relations Generator\",\n",
+ " \"tables\": [\n",
+ " {\n",
+ " \"name\": \"organizations\",\n",
+ " \"data\": organizations,\n",
+ " \"primary_key\": \"ID\",\n",
+ " \"tabular_model_configuration\": {\"enable_model_report\": False},\n",
+ " },\n",
+ " {\n",
+ " \"name\": \"relations\",\n",
+ " \"data\": relations,\n",
+ " \"foreign_keys\": [\n",
+ " {\"column\": \"START_ID\", \"referenced_table\": \"organizations\", \"is_context\": True},\n",
+ " {\"column\": \"END_ID\", \"referenced_table\": \"organizations\", \"is_context\": False},\n",
+ " ],\n",
+ " \"tabular_model_configuration\": {\"enable_model_report\": False},\n",
+ " },\n",
+ " ],\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c910d96a",
+ "metadata": {},
+ "source": [
+ "### MOSTLY AI Generator Training"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9278958b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Launch MOSTLY AI generator training job\n",
+ "start_time = time.time()\n",
+ "g = mostly.train(config=config, start=True, wait=True)\n",
+ "end_time = time.time()\n",
+ "\n",
+ "# Measure and print elapsed time for generator training\n",
+ "elapsed = end_time - start_time\n",
+ "print(f\"Training completed in {elapsed:.2f} seconds ({elapsed / 60:.2f} minutes).\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "20d47f96",
+ "metadata": {},
+ "source": [
+ "### MOSTLY AI Synthetic Data Generation"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c73fbb8a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Synthetic data generation\n",
+ "start_time = time.time()\n",
+ "sd = mostly.generate(g, size=int(0.10 * len(organizations)))\n",
+ "mostlyai_synthetic_data = sd.data()\n",
+ "end_time = time.time()\n",
+ "\n",
+ "# Measure and print elapsed time for data generation\n",
+ "elapsed = end_time - start_time\n",
+ "print(f\"Generation completed in {elapsed:.2f} seconds ({elapsed / 60:.2f} minutes).\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bff97b4e",
+ "metadata": {},
+ "source": [
+ "### MOSTLY AI Synthetic Data Preview"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9d230336",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "mostlyai_synthetic_data[\"organizations\"].head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4215bf4d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "mostlyai_synthetic_data[\"relations\"].head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5d2467ca",
+ "metadata": {},
+ "source": [
+ "### Save MOSTLY AI Synthetic Data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e10733c7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "os.makedirs(\"./data/mostly\", exist_ok=True)\n",
+ "\n",
+ "orgs_train_output_file = \"./data/mostly/mostlyai_organizations.parquet\"\n",
+ "rels_train_output_file = \"./data/mostly/mostlyai_relations.parquet\"\n",
+ "mostlyai_synthetic_data[\"organizations\"].to_parquet(orgs_train_output_file, index=False)\n",
+ "mostlyai_synthetic_data[\"relations\"].to_parquet(rels_train_output_file, index=False)\n",
+ "print(f\"💾 MOSTLY AI synthetic data saved to: {orgs_train_output_file} and {rels_train_output_file}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0b9cc634",
+ "metadata": {},
+ "source": [
+ "## MOSTLY AI Synthetic Data Quality Assurance\n",
+ "\n",
+ "As the SDV team has already demonstrated that the generated synthetic data maintains referential integrity, we'll dive deeper and explore the quality of the generated data. If you are interesting in seeing the referential integrity of the generated datasets, please refer to the [Confirming Referential Integrity](#confirming-referential-integity) where we'll use [SDMetrics](https://docs.sdv.dev/sdmetrics) to confirm the referential intrigty of all generated data.\n",
+ "\n",
+ "And while referential integrity is, of course, an important piece of the puzzle when generating synthetic data, one of the key advantages of synthetic data (as compared to [homomorphic encryption](https://en.wikipedia.org/wiki/Homomorphic_encryption#:~:text=Homomorphic%20encryption%20is%20a%20form%20of%20encryption%20with%20an%20additional,extension%20of%20public%2Dkey%20cryptography.), for example) is its ability to not just maintain privacy protections but also resemble the subject data-- not just to a machine, but indeed, to a human as well.\n",
+ "\n",
+ "We'll see that while the data generated by SDV indeed maintained referential integrity, it failed to maintain observable features of the subject dataset that are essential to creating realistic synthetic data.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "90725150",
+ "metadata": {},
+ "source": [
+ "### Initialize MOSTLY AI Synthetic Data QA Library"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "603bd5b8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from mostlyai import qa\n",
+ "\n",
+ "qa.init_logging()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d616e5c4",
+ "metadata": {},
+ "source": [
+ "### SDV Synthetic Data Quality"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "698bd031",
+ "metadata": {},
+ "source": [
+ "### SDV - START_ID"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c6179994",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sdv_relations = pd.read_parquet(\"./data/sdv/sdv_relations.parquet\")\n",
+ "sdv_organizations = pd.read_parquet(\"./data/sdv/sdv_organizations.parquet\")\n",
+ "\n",
+ "id_columns_to_exclude = [\"ID\", \"END_ID\"]\n",
+ "\n",
+ "\n",
+ "def remove_id_columns(df, columns_to_remove):\n",
+ " \"\"\"Remove specified columns if they exist in the dataframe\"\"\"\n",
+ " return df.drop(columns=[col for col in columns_to_remove if col in df.columns])\n",
+ "\n",
+ "\n",
+ "sdv_relations = remove_id_columns(sdv_relations, id_columns_to_exclude)\n",
+ "rels_train_qa = remove_id_columns(relations, id_columns_to_exclude)\n",
+ "\n",
+ "report_path, metrics = qa.report(\n",
+ " syn_tgt_data=sdv_relations,\n",
+ " trn_tgt_data=rels_train_qa,\n",
+ " syn_ctx_data=sdv_organizations,\n",
+ " trn_ctx_data=organizations,\n",
+ " ctx_primary_key=\"ID\",\n",
+ " tgt_context_key=\"START_ID\",\n",
+ " max_sample_size_embeddings=10_000,\n",
+ " report_path=\"sdv_relations_qa_report_start_id.html\",\n",
+ ")\n",
+ "\n",
+ "print(f\"SDV Relations Quality Report saved to: {report_path}\")\n",
+ "print(\"\\nSDV Relations Quality Metrics:\")\n",
+ "print(metrics.model_dump_json(indent=4))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cb11da28",
+ "metadata": {},
+ "source": [
+ "### SDV - END_ID"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f20a0448",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sdv_relations = pd.read_parquet(\"./data/sdv/sdv_relations.parquet\")\n",
+ "sdv_organizations = pd.read_parquet(\"./data/sdv/sdv_organizations.parquet\")\n",
+ "\n",
+ "id_columns_to_exclude = [\"ID\", \"START_ID\"]\n",
+ "\n",
+ "\n",
+ "def remove_id_columns(df, columns_to_remove):\n",
+ " \"\"\"Remove specified columns if they exist in the dataframe\"\"\"\n",
+ " return df.drop(columns=[col for col in columns_to_remove if col in df.columns])\n",
+ "\n",
+ "\n",
+ "sdv_relations = remove_id_columns(sdv_relations, id_columns_to_exclude)\n",
+ "rels_train_qa = remove_id_columns(relations, id_columns_to_exclude)\n",
+ "\n",
+ "report_path, metrics = qa.report(\n",
+ " syn_tgt_data=sdv_relations,\n",
+ " trn_tgt_data=rels_train_qa,\n",
+ " syn_ctx_data=sdv_organizations,\n",
+ " trn_ctx_data=organizations,\n",
+ " ctx_primary_key=\"ID\",\n",
+ " tgt_context_key=\"END_ID\",\n",
+ " max_sample_size_embeddings=10_000,\n",
+ " report_path=\"sdv_relations_qa_report_end_id.html\",\n",
+ ")\n",
+ "\n",
+ "print(f\"SDV Relations Quality Report saved to: {report_path}\")\n",
+ "print(\"\\nSDV Relations Quality Metrics:\")\n",
+ "print(metrics.model_dump_json(indent=4))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bf95507e",
+ "metadata": {},
+ "source": [
+ "## MOSTLY AI Synthetic Data Quality"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "10d840af",
+ "metadata": {},
+ "source": [
+ "### MOSTLY AI - START_ID"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6e769f1e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "mostlyai_relations = pd.read_parquet(\"./data/mostly/mostlyai_relations.parquet\")\n",
+ "mostlyai_organizations = pd.read_parquet(\"./data/mostly/mostlyai_organizations.parquet\")\n",
+ "\n",
+ "id_columns_to_exclude = [\"ID\", \"END_ID\"]\n",
+ "\n",
+ "\n",
+ "def remove_id_columns(df, columns_to_remove):\n",
+ " \"\"\"Remove specified columns if they exist in the dataframe\"\"\"\n",
+ " return df.drop(columns=[col for col in columns_to_remove if col in df.columns])\n",
+ "\n",
+ "\n",
+ "mostlyai_relations = remove_id_columns(mostlyai_relations, id_columns_to_exclude)\n",
+ "rels_train_qa = remove_id_columns(relations, id_columns_to_exclude)\n",
+ "\n",
+ "report_path, metrics = qa.report(\n",
+ " syn_tgt_data=mostlyai_relations,\n",
+ " trn_tgt_data=rels_train_qa,\n",
+ " syn_ctx_data=mostlyai_organizations,\n",
+ " trn_ctx_data=organizations,\n",
+ " ctx_primary_key=\"ID\",\n",
+ " tgt_context_key=\"START_ID\",\n",
+ " max_sample_size_embeddings=10_000,\n",
+ " report_path=\"mostlyai_relations_qa_report_start_id.html\",\n",
+ ")\n",
+ "\n",
+ "print(f\"MOSTLY AI START_ID Quality Report saved to: {report_path}\")\n",
+ "print(\"\\nMOSTLY AI START_ID Quality Metrics:\")\n",
+ "print(metrics.model_dump_json(indent=4))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f075e1b9",
+ "metadata": {},
+ "source": [
+ "### MOSTLY AI - END_ID"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6f54a10f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "mostlyai_relations = pd.read_parquet(\"./data/mostly/mostlyai_relations.parquet\")\n",
+ "mostlyai_organizations = pd.read_parquet(\"./data/mostly/mostlyai_organizations.parquet\")\n",
+ "\n",
+ "id_columns_to_exclude = [\"ID\", \"START_ID\"]\n",
+ "\n",
+ "\n",
+ "def remove_id_columns(df, columns_to_remove):\n",
+ " \"\"\"Remove specified columns if they exist in the dataframe\"\"\"\n",
+ " return df.drop(columns=[col for col in columns_to_remove if col in df.columns])\n",
+ "\n",
+ "\n",
+ "mostlyai_relations = remove_id_columns(mostlyai_relations, id_columns_to_exclude)\n",
+ "rels_train_qa = remove_id_columns(relations, id_columns_to_exclude)\n",
+ "\n",
+ "report_path, metrics = qa.report(\n",
+ " syn_tgt_data=mostlyai_relations,\n",
+ " trn_tgt_data=rels_train_qa,\n",
+ " syn_ctx_data=mostlyai_organizations,\n",
+ " trn_ctx_data=organizations,\n",
+ " ctx_primary_key=\"ID\",\n",
+ " tgt_context_key=\"END_ID\",\n",
+ " max_sample_size_embeddings=10_000,\n",
+ " report_path=\"mostlyai_relations_qa_report_end_id.html\",\n",
+ ")\n",
+ "\n",
+ "print(f\"MOSTLY AI END_ID Quality Report saved to: {report_path}\")\n",
+ "print(\"\\nMOSTLY AI END_ID Quality Metrics:\")\n",
+ "print(metrics.model_dump_json(indent=4))"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "venv",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.18"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/docs/tutorials/sdv-comparison/foreign-key/mostlyai_relations_qa_report_end_id.html b/docs/tutorials/sdv-comparison/foreign-key/mostlyai_relations_qa_report_end_id.html
new file mode 100644
index 00000000..06913e1c
--- /dev/null
+++ b/docs/tutorials/sdv-comparison/foreign-key/mostlyai_relations_qa_report_end_id.html
@@ -0,0 +1,4797 @@
+
+
+
+
+
+ Model Report
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Model Report
+
+ Generated on 18 Nov 2025, 21:38
+ ●
+ 64,553 original samples, 24,538 synthetic samples
+
+
+
+
+
+
+
+
+
+ Accuracy
+
+
+
+
+
+
67.6%
+
(99.0%)
+
+
+
+
+
Univariate
+
+ 75.9%
+
+
(99.5%)
+
+
+
+
+
Bivariate
+
+ 70.5%
+
+
(98.9%)
+
+
+
+
+
+
Trivariate
+
+ 49.3%
+
+
(97.9%)
+
+
+
+
+
+
+
Coherence
+
+ 74.7%
+
+
(99.5%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Similarity
+
+
+
+
+
+
+
+
+
+
+
Cosine Similarity
+
+ 0.87685
+
+
+
+
+
Discriminator AUC
+
+ 89.4%
+
(50.0%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Distances
+
+
+
+
+
+
+
+
+
+
+
+
+
Identical Matches
+
+ 95.4%
+
+
+
+
+
Average Distances
+
+ 0.055
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Correlations
+
+
+
+
+
+
+
+
+
+
Univariate Distributions
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Bivariate Distributions
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Bivariate Distributions for context
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Coherence: Auto-correlations
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Coherence: Sequences per Distinct Category
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Coherence: Distinct Categories per Sequence
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Accuracy
+
+
+
+
+
+
+
Column
+
Univariate
+
+
Bivariate
+
+
+
Trivariate
+
+
+
Coherence
+
+
+
+
+
+
+
STATUS
+
97.9%
+
+
77.9%
+
+
+
49.3%
+
+
+
84.8%
+
+
+
+
+
TYPE
+
70.9%
+
+
63.4%
+
+
+
49.3%
+
+
+
64.7%
+
+
+
+
+
Sequence Length
+
58.7%
+
+
56.9%
+
+
+
49.3%
+
+
+
-
+
+
+
+
+
+
+
Total
+
75.9%
(99.5%)
+
+
70.5%
(98.9%)
+
+
+
49.3%
(97.9%)
+
+
+
74.7%
(99.5%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Explainer
+
+
+ Accuracy of synthetic data is assessed by comparing the distributions of the synthetic (shown in green) and the original data (shown in gray).
+ For each distribution plot we sum up the deviations across all categories, to get the so-called total variation distance (TVD). The reported accuracy is then simply reported as 100% - TVD.
+ These accuracies are calculated for all univariate, bivariate and trivariate distributions. A final accuracy score is then calculated as the average across all of these.
+
+
+
+
+
+
+
+
Similarity
+
+
+
+
+
+
+
+
+
+
+
+
Explainer
+
+
+ These plots show the first 3 principal components of training samples, synthetic samples, and (if available) holdout samples within the embedding space. The black dots visualize the centroids of the respective samples.
+ The similarity metric then measures the cosine similarity between these centroids. We expect the cosine similarity to be close to 1, indicating that the synthetic samples are as similar to the training samples as the holdout samples are.
+
+
+
+
+
+
+
+
Distances
+
+
+
+
+
+
+
+
Synthetic vs. Training
+
+
+
+
+
+
Identical Matches
+
95.4%
+
+
+
+
DCR Average
+
0.055
+
+
+
+
NNDR Min10
+
1.13e-07
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Explainer
+
+
+ Synthetic data shall be as close to the original training samples, as it is close to original holdout samples, which serve us as a reference.
+ This can be asserted empirically by measuring distances between synthetic samples to their closest original samples, whereas training and holdout sets are sampled to be of equal size.
+ A green line that is significantly left of the dark gray line implies that synthetic samples are closer to the training samples than to the holdout samples, indicating that the data has overfitted to the training data.
+ A green line that overlays with the dark gray line validates that the trained model indeed represents the general rules, that can be found in training just as well as in holdout samples.
+ The DCR share indicates the proportion of synthetic samples that are closer to a training sample than to a holdout sample, and ideally, this value should not significantly exceed 50%, as a higher value could indicate overfitting.
+ The NNDR ratio is the ratio of the 10-th smallest NNDR for synthetic vs. training, divided by 10-th smallest NNDR for synthetic vs. holdout. Ideally, this value should be close to 1, indicating that the synthetic samples are in sparse as well as in dense regions just as close to the training samples as to the holdout samples.
+
+ Generated on 18 Nov 2025, 21:38
+ ●
+ 217,504 original samples, 21,735 synthetic samples
+
+
+
+
+
+
+
+
+
+ Accuracy
+
+
+
+
+
+
98.5%
+
(99.3%)
+
+
+
+
+
Univariate
+
+ 99.2%
+
+
(99.6%)
+
+
+
+
+
Bivariate
+
+ 98.2%
+
+
(99.2%)
+
+
+
+
+
+
Trivariate
+
+ 97.5%
+
+
(98.8%)
+
+
+
+
+
+
+
Coherence
+
+ 99.2%
+
+
(99.6%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Similarity
+
+
+
+
+
+
+
+
+
+
+
Cosine Similarity
+
+ 0.99668
+
+
+
+
+
Discriminator AUC
+
+ 55.4%
+
(50.0%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Distances
+
+
+
+
+
+
+
+
+
+
+
+
+
Identical Matches
+
+ 95.6%
+
+
+
+
+
Average Distances
+
+ 0.055
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Correlations
+
+
+
+
+
+
+
+
+
+
Univariate Distributions
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Bivariate Distributions
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Bivariate Distributions for context
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Coherence: Auto-correlations
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Coherence: Sequences per Distinct Category
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Coherence: Distinct Categories per Sequence
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Accuracy
+
+
+
+
+
+
+
Column
+
Univariate
+
+
Bivariate
+
+
+
Trivariate
+
+
+
Coherence
+
+
+
+
+
+
+
Sequence Length
+
99.6%
+
+
98.2%
+
+
+
97.5%
+
+
+
-
+
+
+
+
+
TYPE
+
99.3%
+
+
98.4%
+
+
+
97.5%
+
+
+
99.3%
+
+
+
+
+
STATUS
+
98.7%
+
+
98.2%
+
+
+
97.5%
+
+
+
99.1%
+
+
+
+
+
+
+
Total
+
99.2%
(99.6%)
+
+
98.2%
(99.2%)
+
+
+
97.5%
(98.8%)
+
+
+
99.2%
(99.6%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Explainer
+
+
+ Accuracy of synthetic data is assessed by comparing the distributions of the synthetic (shown in green) and the original data (shown in gray).
+ For each distribution plot we sum up the deviations across all categories, to get the so-called total variation distance (TVD). The reported accuracy is then simply reported as 100% - TVD.
+ These accuracies are calculated for all univariate, bivariate and trivariate distributions. A final accuracy score is then calculated as the average across all of these.
+
+
+
+
+
+
+
+
Similarity
+
+
+
+
+
+
+
+
+
+
+
+
Explainer
+
+
+ These plots show the first 3 principal components of training samples, synthetic samples, and (if available) holdout samples within the embedding space. The black dots visualize the centroids of the respective samples.
+ The similarity metric then measures the cosine similarity between these centroids. We expect the cosine similarity to be close to 1, indicating that the synthetic samples are as similar to the training samples as the holdout samples are.
+
+
+
+
+
+
+
+
Distances
+
+
+
+
+
+
+
+
Synthetic vs. Training
+
+
+
+
+
+
Identical Matches
+
95.6%
+
+
+
+
DCR Average
+
0.055
+
+
+
+
NNDR Min10
+
0.248
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Explainer
+
+
+ Synthetic data shall be as close to the original training samples, as it is close to original holdout samples, which serve us as a reference.
+ This can be asserted empirically by measuring distances between synthetic samples to their closest original samples, whereas training and holdout sets are sampled to be of equal size.
+ A green line that is significantly left of the dark gray line implies that synthetic samples are closer to the training samples than to the holdout samples, indicating that the data has overfitted to the training data.
+ A green line that overlays with the dark gray line validates that the trained model indeed represents the general rules, that can be found in training just as well as in holdout samples.
+ The DCR share indicates the proportion of synthetic samples that are closer to a training sample than to a holdout sample, and ideally, this value should not significantly exceed 50%, as a higher value could indicate overfitting.
+ The NNDR ratio is the ratio of the 10-th smallest NNDR for synthetic vs. training, divided by 10-th smallest NNDR for synthetic vs. holdout. Ideally, this value should be close to 1, indicating that the synthetic samples are in sparse as well as in dense regions just as close to the training samples as to the holdout samples.
+
+ Generated on 18 Nov 2025, 21:38
+ ●
+ 64,553 original samples, 16,009 synthetic samples
+
+
+
+
+
+
+
+
+
+ Accuracy
+
+
+
+
+
+
32.2%
+
(98.8%)
+
+
+
+
+
Univariate
+
+ 38.5%
+
+
(99.4%)
+
+
+
+
+
Bivariate
+
+ 33.3%
+
+
(98.8%)
+
+
+
+
+
+
Trivariate
+
+ 15.0%
+
+
(97.5%)
+
+
+
+
+
+
+
Coherence
+
+ 42.2%
+
+
(99.5%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Similarity
+
+
+
+
+
+
+
+
+
+
+
Cosine Similarity
+
+ 0.00000
+
+
+
+
+
Discriminator AUC
+
+ 98.5%
+
(50.0%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Distances
+
+
+
+
+
+
+
+
+
+
+
+
+
Identical Matches
+
+ 11.6%
+
+
+
+
+
Average Distances
+
+ 1.095
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Correlations
+
+
+
+
+
+
+
+
+
+
Univariate Distributions
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Bivariate Distributions
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Bivariate Distributions for context
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Coherence: Auto-correlations
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Coherence: Sequences per Distinct Category
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Coherence: Distinct Categories per Sequence
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Accuracy
+
+
+
+
+
+
+
Column
+
Univariate
+
+
Bivariate
+
+
+
Trivariate
+
+
+
Coherence
+
+
+
+
+
+
+
Sequence Length
+
74.2%
+
+
51.9%
+
+
+
15.0%
+
+
+
-
+
+
+
+
+
TYPE
+
21.5%
+
+
19.0%
+
+
+
15.0%
+
+
+
35.9%
+
+
+
+
+
STATUS
+
19.8%
+
+
19.1%
+
+
+
15.0%
+
+
+
48.4%
+
+
+
+
+
+
+
Total
+
38.5%
(99.4%)
+
+
33.3%
(98.8%)
+
+
+
15.0%
(97.5%)
+
+
+
42.2%
(99.5%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Explainer
+
+
+ Accuracy of synthetic data is assessed by comparing the distributions of the synthetic (shown in green) and the original data (shown in gray).
+ For each distribution plot we sum up the deviations across all categories, to get the so-called total variation distance (TVD). The reported accuracy is then simply reported as 100% - TVD.
+ These accuracies are calculated for all univariate, bivariate and trivariate distributions. A final accuracy score is then calculated as the average across all of these.
+
+
+
+
+
+
+
+
Similarity
+
+
+
+
+
+
+
+
+
+
+
+
Explainer
+
+
+ These plots show the first 3 principal components of training samples, synthetic samples, and (if available) holdout samples within the embedding space. The black dots visualize the centroids of the respective samples.
+ The similarity metric then measures the cosine similarity between these centroids. We expect the cosine similarity to be close to 1, indicating that the synthetic samples are as similar to the training samples as the holdout samples are.
+
+
+
+
+
+
+
+
Distances
+
+
+
+
+
+
+
+
Synthetic vs. Training
+
+
+
+
+
+
Identical Matches
+
11.6%
+
+
+
+
DCR Average
+
1.095
+
+
+
+
NNDR Min10
+
1.03e-07
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Explainer
+
+
+ Synthetic data shall be as close to the original training samples, as it is close to original holdout samples, which serve us as a reference.
+ This can be asserted empirically by measuring distances between synthetic samples to their closest original samples, whereas training and holdout sets are sampled to be of equal size.
+ A green line that is significantly left of the dark gray line implies that synthetic samples are closer to the training samples than to the holdout samples, indicating that the data has overfitted to the training data.
+ A green line that overlays with the dark gray line validates that the trained model indeed represents the general rules, that can be found in training just as well as in holdout samples.
+ The DCR share indicates the proportion of synthetic samples that are closer to a training sample than to a holdout sample, and ideally, this value should not significantly exceed 50%, as a higher value could indicate overfitting.
+ The NNDR ratio is the ratio of the 10-th smallest NNDR for synthetic vs. training, divided by 10-th smallest NNDR for synthetic vs. holdout. Ideally, this value should be close to 1, indicating that the synthetic samples are in sparse as well as in dense regions just as close to the training samples as to the holdout samples.
+
+ Generated on 18 Nov 2025, 21:38
+ ●
+ 217,504 original samples, 37,041 synthetic samples
+
+
+
+
+
+
+
+
+
+ Accuracy
+
+
+
+
+
+
23.4%
+
(99.4%)
+
+
+
+
+
Univariate
+
+ 26.2%
+
+
(99.7%)
+
+
+
+
+
Bivariate
+
+ 21.3%
+
+
(99.4%)
+
+
+
+
+
+
Trivariate
+
+ 11.4%
+
+
(99.0%)
+
+
+
+
+
+
+
Coherence
+
+ 34.9%
+
+
(99.7%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Similarity
+
+
+
+
+
+
+
+
+
+
+
Cosine Similarity
+
+ 0.00000
+
+
+
+
+
Discriminator AUC
+
+ 99.6%
+
(50.0%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Distances
+
+
+
+
+
+
+
+
+
+
+
+
+
Identical Matches
+
+ 13.2%
+
+
+
+
+
Average Distances
+
+ 1.006
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Correlations
+
+
+
+
+
+
+
+
+
+
Univariate Distributions
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Bivariate Distributions
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Bivariate Distributions for context
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Coherence: Auto-correlations
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Coherence: Sequences per Distinct Category
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Coherence: Distinct Categories per Sequence
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Accuracy
+
+
+
+
+
+
+
Column
+
Univariate
+
+
Bivariate
+
+
+
Trivariate
+
+
+
Coherence
+
+
+
+
+
+
+
Sequence Length
+
38.3%
+
+
25.1%
+
+
+
11.4%
+
+
+
-
+
+
+
+
+
TYPE
+
22.5%
+
+
18.8%
+
+
+
11.4%
+
+
+
26.7%
+
+
+
+
+
STATUS
+
17.8%
+
+
16.6%
+
+
+
11.4%
+
+
+
43.1%
+
+
+
+
+
+
+
Total
+
26.2%
(99.7%)
+
+
21.3%
(99.4%)
+
+
+
11.4%
(99.0%)
+
+
+
34.9%
(99.7%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Explainer
+
+
+ Accuracy of synthetic data is assessed by comparing the distributions of the synthetic (shown in green) and the original data (shown in gray).
+ For each distribution plot we sum up the deviations across all categories, to get the so-called total variation distance (TVD). The reported accuracy is then simply reported as 100% - TVD.
+ These accuracies are calculated for all univariate, bivariate and trivariate distributions. A final accuracy score is then calculated as the average across all of these.
+
+
+
+
+
+
+
+
Similarity
+
+
+
+
+
+
+
+
+
+
+
+
Explainer
+
+
+ These plots show the first 3 principal components of training samples, synthetic samples, and (if available) holdout samples within the embedding space. The black dots visualize the centroids of the respective samples.
+ The similarity metric then measures the cosine similarity between these centroids. We expect the cosine similarity to be close to 1, indicating that the synthetic samples are as similar to the training samples as the holdout samples are.
+
+
+
+
+
+
+
+
Distances
+
+
+
+
+
+
+
+
Synthetic vs. Training
+
+
+
+
+
+
Identical Matches
+
13.2%
+
+
+
+
DCR Average
+
1.006
+
+
+
+
NNDR Min10
+
1.000
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Explainer
+
+
+ Synthetic data shall be as close to the original training samples, as it is close to original holdout samples, which serve us as a reference.
+ This can be asserted empirically by measuring distances between synthetic samples to their closest original samples, whereas training and holdout sets are sampled to be of equal size.
+ A green line that is significantly left of the dark gray line implies that synthetic samples are closer to the training samples than to the holdout samples, indicating that the data has overfitted to the training data.
+ A green line that overlays with the dark gray line validates that the trained model indeed represents the general rules, that can be found in training just as well as in holdout samples.
+ The DCR share indicates the proportion of synthetic samples that are closer to a training sample than to a holdout sample, and ideally, this value should not significantly exceed 50%, as a higher value could indicate overfitting.
+ The NNDR ratio is the ratio of the 10-th smallest NNDR for synthetic vs. training, divided by 10-th smallest NNDR for synthetic vs. holdout. Ideally, this value should be close to 1, indicating that the synthetic samples are in sparse as well as in dense regions just as close to the training samples as to the holdout samples.
+