Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion python/di_to_cu_migration_tool/.sample_env
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Rename to .env
HOST="<fill in your target endpoint here>"

API_VERSION = "2025-05-01-preview"
API_VERSION = "2025-11-01"

SUBSCRIPTION_KEY = "<fill in your API Key here>" # This is your API Key if you have one or can be your Subscription ID
31 changes: 25 additions & 6 deletions python/di_to_cu_migration_tool/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Document Intelligence to Content Understanding Migration Tool (Python)

Welcome! This tool helps convert your Document Intelligence (DI) datasets to the Content Understanding (CU) **Preview.2** 2025-05-01-preview format, as used in AI Foundry. The following DI versions are supported:
Welcome! This tool helps convert your Document Intelligence (DI) datasets to the Content Understanding (CU) **GA** 2025-11-01 format, as used in AI Foundry. The following DI versions are supported:

- Custom Extraction Model DI 3.1 GA (2023-07-31) to DI 4.0 GA (2024-11-30) (Document Intelligence Studio) → DI-version = neural
- Document Field Extraction Model 4.0 Preview (2024-07-31-preview) (AI Foundry / AI Services / Vision + Document / Document Field Extraction) → DI-version = generative

To identify the version of your Document Intelligence dataset, please consult the sample documents in this folder to match your format. You can also verify the version by reviewing your DI project's user experience. For instance, Custom Extraction DI 3.1/4.0 GA appears in Document Intelligence Studio (https://documentintelligence.ai.azure.com/studio), whereas Document Field Extraction DI 4.0 Preview is only available on Azure AI Foundry's preview service (https://ai.azure.com/explore/aiservices/vision/document/extraction).

For migrating from these DI versions to Content Understanding Preview.2, this tool first converts the DI dataset into a CU-compatible format. After conversion, you can create a Content Understanding Analyzer trained on your converted CU dataset. Additionally, you have the option to test its quality against any sample documents.
For migrating from these DI versions to Content Understanding GA (2025-11-01), this tool first converts the DI dataset into a CU-compatible format. After conversion, you can create a Content Understanding Analyzer trained on your converted CU dataset. Additionally, you have the option to test its quality against any sample documents.

## Details About the Tools

Expand All @@ -27,8 +27,26 @@ Here is a detailed breakdown of the three CLI tools and their functionality:
* **call_analyze.py**
* This CLI tool verifies that the migration completed successfully and assesses the quality of the created analyzer.


## Setup

## Prerequisites

⚠️ **IMPORTANT: Before using this migration tool**, ensure your Azure AI Foundry resource is properly configured for Content Understanding:

1. **Configure Default Model Deployments**: You must set default model deployments in your Content Understanding in your Foundry Resource before creating or running analyzers.

To do this walk through the prerequisites here:
- [REST API Quickstart Guide](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/quickstart/use-rest-api?tabs=portal%2Cdocument)

For more details about defaults checkout this documentation:
- [Models and Deployments Documentation](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/models-deployments)

2. **Verify you can create and use a basic Content Understanding analyzer** in your Azure AI Foundry resource before attempting migration. This ensures all prerequisites are met.

3. Complete all setup steps outlined in the REST API documentation above, including authentication and model deployment configuration.

### Tool Setup
Please follow these steps to set up the tool:

1. Install dependencies by running:
Expand All @@ -43,7 +61,7 @@ Please follow these steps to set up the tool:
- **SUBSCRIPTION_KEY:** Update to your Azure AI Service API Key or Subscription ID to authenticate the API requests.
- Locate your API Key here: ![Azure AI Service Endpoints With Keys](assets/endpoint-with-keys.png)
- If using Azure Active Directory (AAD), please refer to your Subscription ID: ![Azure AI Service Subscription ID](assets/subscription-id.png)
- **API_VERSION:** This is preset to the CU Preview.2 version; no changes are needed.
- **API_VERSION:** This is preset to the CU GA version (2025-11-01); no changes are needed.

## How to Locate Your Document Field Extraction Dataset for Migration

Expand Down Expand Up @@ -73,6 +91,7 @@ To obtain SAS URLs for a file or folder for any container URL arguments, please
3. Configure permissions and expiry for your SAS URL as follows:

- For the **DI source dataset**, please select permissions: _**Read & List**_

- For the **CU target dataset**, please select permissions: _**Read, Add, Create, & Write**_

After configuring, click **Generate SAS Token and URL** and copy the URL shown under **Blob SAS URL**.
Expand Down Expand Up @@ -155,7 +174,7 @@ Below are common issues you might encounter when creating an analyzer or running
- **400 Bad Request** errors:
Please validate the following:
- The endpoint URL is valid. Example:
`https://yourEndpoint/contentunderstanding/analyzers/yourAnalyzerID?api-version=2025-05-01-preview`
`https://yourEndpoint/contentunderstanding/analyzers/yourAnalyzerID?api-version=2025-11-01`
- Your converted CU dataset respects the naming constraints below. If needed, please manually correct the `analyzer.json` fields:
- Field names start with a letter or underscore
- Field name length must be between 1 and 64 characters
Expand All @@ -174,7 +193,7 @@ Below are common issues you might encounter when creating an analyzer or running

- **400 Bad Request**:
This implies that you might have an incorrect endpoint or SAS URL. Please ensure that your endpoint is valid and that you are using the correct SAS URL for the document:
`https://yourendpoint/contentunderstanding/analyzers/yourAnalyzerID:analyze?api-version=2025-05-01-preview`
`https://yourendpoint/contentunderstanding/analyzers/yourAnalyzerID:analyze?api-version=2025-11-01`
Confirm you are using the correct SAS URL for the document.

- **401 Unauthorized**:
Expand All @@ -189,4 +208,4 @@ Below are common issues you might encounter when creating an analyzer or running
2. Signature field types (e.g., in previous DI versions) are not yet supported in Content Understanding. These will be ignored during migration when creating the analyzer.
3. The content of your training documents is retained in the CU model's metadata, under storage specifically. You can find more details at:
https://learn.microsoft.com/en-us/legal/cognitive-services/content-understanding/transparency-note?toc=%2Fazure%2Fai-services%2Fcontent-understanding%2Ftoc.json&bc=%2Fazure%2Fai-services%2Fcontent-understanding%2Fbreadcrumb%2Ftoc.json
4. All conversions are for Content Understanding preview.2 version only.
4. All conversions are for Content Understanding GA (2025-11-01) version.
8 changes: 7 additions & 1 deletion python/di_to_cu_migration_tool/constants.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,23 @@
# Supported DI versions
DI_VERSIONS = ["generative", "neural"]
CU_API_VERSION = "2025-05-01-preview"
CU_API_VERSION = "2025-11-01"

# Models
COMPLETION_MODEL = "gpt-4.1"
EMBEDDING_MODEL = "text-embedding-3-large"

# constants
MAX_FIELD_COUNT = 100
MAX_FIELD_LENGTH = 64

# standard file names
FIELDS_JSON = "fields.json"
ANALYZER_JSON = "analyzer.json"
LABELS_JSON = ".labels.json"
VALIDATION_TXT = "validation.txt"
PDF = ".pdf"
OCR_JSON = ".ocr.json"
RESULT_JSON = ".result.json"

# for field type conversion
SUPPORT_FIELD_TYPE = [
Expand Down
46 changes: 38 additions & 8 deletions python/di_to_cu_migration_tool/cu_converter_generative.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from rich import print # For colored output

# imports from same project
from constants import CU_API_VERSION, MAX_FIELD_LENGTH, VALID_CU_FIELD_TYPES
from constants import CU_API_VERSION, MAX_FIELD_LENGTH, VALID_CU_FIELD_TYPES, COMPLETION_MODEL, EMBEDDING_MODEL
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aainav269, I found we only validate the length of field name and do not check/normalize the field name by our current field limitation. It seems like we also don't check/remove the field format. Do you recall the discussion of field name normalization in this tool?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we decided then to remove the fields that exceed the field name length. One point of discussion was if we shorten the field name, could there be another field with that name? Ex: if we have ...._Yes and ...._No and we shorten both, it would be ....

I don't think we ever validated the field format. I think we assumed that if the field was already generated by DI, the format would apply to CU as well. What are you thinking of enforcing for this?

Copy link
Collaborator Author

@chienyuanchang chienyuanchang Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CU has more limitations on field name than DI like no white spaces and only underscores and no other symbols. If we didn't ignore this intentionally. I will add some logics to do the validation and modification.

from field_definitions import FieldDefinitions

# schema constants subject to change
Expand Down Expand Up @@ -48,7 +48,7 @@ def format_angle(angle: float) -> float:
formatted_num = f"{rounded_angle:.7f}".rstrip('0') # Remove trailing zeros
return float(formatted_num)

def convert_fields_to_analyzer(fields_json_path: Path, analyzer_prefix: Optional[str], target_dir: Path, field_definitions: FieldDefinitions) -> dict:
def convert_fields_to_analyzer(fields_json_path: Path, analyzer_prefix: Optional[str], target_dir: Path, field_definitions: FieldDefinitions, target_container_sas_url: str = None, target_blob_folder: str = None) -> dict:
"""
Convert DI 4.0 preview Custom Document fields.json to analyzer.json format.
Args:
Expand Down Expand Up @@ -79,7 +79,11 @@ def convert_fields_to_analyzer(fields_json_path: Path, analyzer_prefix: Optional
# build analyzer.json appropriately
analyzer_data = {
"analyzerId": analyzer_id,
"baseAnalyzerId": "prebuilt-documentAnalyzer",
"baseAnalyzerId": "prebuilt-document",
"models": {
"completion": COMPLETION_MODEL,
"embedding": EMBEDDING_MODEL
},
"config": {
"returnDetails": True,
# Add the following line as a temp workaround before service issue is fixed.
Expand Down Expand Up @@ -121,6 +125,17 @@ def convert_fields_to_analyzer(fields_json_path: Path, analyzer_prefix: Optional
else:
analyzer_json_path = fields_json_path.parent / 'analyzer.json'

# Add knowledgeSources section if container info is provided
if target_container_sas_url and target_blob_folder:
analyzer_data["knowledgeSources"] = [
{
"kind": "labeledData",
"containerUrl": target_container_sas_url,
"prefix": target_blob_folder,
"fileListPath": ""
}
]

# Ensure target directory exists
analyzer_json_path.parent.mkdir(parents=True, exist_ok=True)

Expand Down Expand Up @@ -287,7 +302,11 @@ def recursive_convert_di_label_to_cu_helper(value: dict) -> dict:
di_label["valueDate"] = date_string # going with the default
elif value_type == "number":
try:
di_label["valueNumber"] = float(value.get("content")) # content can be easily converted to a float
content_val = value.get("content")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aainav269, I encountered some errors when I tried to convert fields labeled by region in DI studio which would not have content. I'm wondering if we encountered this error before and if we are good to set value as None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do remember seeing these region fields before, but only in DI 3.1. I think we decided to just ignore these region fields when converting to CU.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I need to set the value to None to avoid the errors.

if not content_val:
di_label["valueNumber"] = None
else:
di_label["valueNumber"] = float(content_val) # content can be easily converted to a float
except Exception as ex:
# strip the string of all non-numerical values and periods
string_value = value.get("content")
Expand All @@ -296,16 +315,27 @@ def recursive_convert_di_label_to_cu_helper(value: dict) -> dict:
# if more than one period exists, remove them all
if cleaned_string.count('.') > 1:
print("More than one decimal point exists, so will be removing them all.")
cleaned_string = cleaned_string = re.sub(r'\.', '', string_value)
di_label["valueNumber"] = float(cleaned_string)
cleaned_string = re.sub(r'\.', '', string_value)

if not cleaned_string:
di_label["valueNumber"] = None
else:
di_label["valueNumber"] = float(cleaned_string)
elif value_type == "integer":
try:
di_label["valueInteger"] = int(value.get("content")) # content can be easily converted to an int
content_val = value.get("content")
if not content_val:
di_label["valueInteger"] = None
else:
di_label["valueInteger"] = int(content_val) # content can be easily converted to an int
except Exception as ex:
# strip the string of all non-numerical values
string_value = value.get("content")
cleaned_string = re.sub(r'[^0-9]', '', string_value)
di_label["valueInteger"] = int(cleaned_string)
if not cleaned_string:
di_label["valueInteger"] = None
else:
di_label["valueInteger"] = int(cleaned_string)
else:
di_label[value_part] = value.get("content")
di_label["spans"] = value.get("spans", [])
Expand Down
42 changes: 33 additions & 9 deletions python/di_to_cu_migration_tool/cu_converter_neural.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from rich import print # For colored output

# imports from same project
from constants import COMPLETE_DATE_FORMATS, CU_API_VERSION, MAX_FIELD_LENGTH, VALID_CU_FIELD_TYPES
from constants import COMPLETE_DATE_FORMATS, CU_API_VERSION, MAX_FIELD_LENGTH, VALID_CU_FIELD_TYPES, COMPLETION_MODEL, EMBEDDING_MODEL, ANALYZER_JSON
from field_definitions import FieldDefinitions

# schema constants subject to change
Expand All @@ -37,7 +37,7 @@ def convert_bounding_regions_to_source(page_number: int, polygon: list) -> str:
source = f"D({page_number},{polygon_str})"
return source

def convert_fields_to_analyzer_neural(fields_json_path: Path, analyzer_prefix: Optional[str], target_dir: Optional[Path], field_definitions: FieldDefinitions) -> Tuple[dict, dict]:
def convert_fields_to_analyzer_neural(fields_json_path: Path, analyzer_prefix: Optional[str], target_dir: Optional[Path], field_definitions: FieldDefinitions, target_container_sas_url: str = None, target_blob_folder: str = None) -> Tuple[dict, dict]:
"""
Convert DI 3.1/4.0GA Custom Neural fields.json to analyzer.json format.
Args:
Expand Down Expand Up @@ -67,7 +67,11 @@ def convert_fields_to_analyzer_neural(fields_json_path: Path, analyzer_prefix: O
# Build analyzer.json content
analyzer_data = {
"analyzerId": analyzer_prefix,
"baseAnalyzerId": "prebuilt-documentAnalyzer",
"baseAnalyzerId": "prebuilt-document",
"models": {
"completion": COMPLETION_MODEL,
"embedding": EMBEDDING_MODEL
},
"config": {
"returnDetails": True,
# Add the following line as a temp workaround before service issue is fixed.
Expand Down Expand Up @@ -128,10 +132,21 @@ def convert_fields_to_analyzer_neural(fields_json_path: Path, analyzer_prefix: O

# Determine output path
if target_dir:
analyzer_json_path = target_dir / 'analyzer.json'
analyzer_json_path = target_dir / ANALYZER_JSON
else:
analyzer_json_path = fields_json_path.parent / 'analyzer.json'
analyzer_json_path = fields_json_path.parent / ANALYZER_JSON

# Add knowledgeSources section if container info is provided
if target_container_sas_url and target_blob_folder:
analyzer_data["knowledgeSources"] = [
{
"kind": "labeledData",
"containerUrl": target_container_sas_url,
"prefix": target_blob_folder,
"fileListPath": ""
}
]

# Ensure target directory exists
analyzer_json_path.parent.mkdir(parents=True, exist_ok=True)

Expand Down Expand Up @@ -405,16 +420,25 @@ def creating_cu_label_for_neural(label:dict, label_type: str) -> dict:
# if more than one period exists, remove them all
if cleaned_string.count('.') > 1:
print("More than one decimal point exists, so will be removing them all.")
cleaned_string = cleaned_string = re.sub(r'\.', '', string_value)
final_content = float(cleaned_string)
cleaned_string = re.sub(r'\.', '', string_value)
if not cleaned_string:
final_content = None
else:
final_content = float(cleaned_string)
elif label_type == "integer":
try:
final_content = int(final_content)
if not final_content:
final_content = None
else:
final_content = int(final_content)
except Exception as ex:
# strip the string of all non-numerical values
string_value = final_content
cleaned_string = re.sub(r'[^0-9]', '', string_value)
final_content = int(cleaned_string)
if not cleaned_string:
final_content = None
else:
final_content = int(cleaned_string)
elif label_type == "date":
# dates can be dmy, mdy, ydm, or not specified
# for CU, the format of our dates should be "%Y-%m-%d"
Expand Down
Loading