Skip to content

Conversation

@jfilcik
Copy link
Contributor

@jfilcik jfilcik commented Dec 17, 2025

Updated the DI to CU labeled data migration tool from Azure AI Content Understanding Preview API (2025-05-01-preview) to General Availability API (2025-11-01).

Key Changes Implemented

  1. API Version Update

Changed API version from 2025-05-01-preview to 2025-11-01 across all configuration files
Updated .sample_env template with new GA version

  1. Analyzer Template Breaking Changes
    Updated 18 analyzer template JSON files with GA API requirements:

Added models section: Required object specifying completion and embedding models
Updated baseAnalyzerId naming: Changed from preview naming (e.g., prebuilt-documentAnalyzer) to GA naming (e.g., prebuilt-document)
Removed deprecated properties: Eliminated scenario property and pro mode configurations not supported in GA
3. OCR Extraction Improvements

Modified get_ocr.py to use prebuilt-read analyzer directly instead of creating temporary analyzers
Streamlined layout result generation by calling built-in analyzer API endpoint
4. Training Data Integration

Updated converter files (cu_converter_generative.py, cu_converter_neural.py) to add training data reference during conversion
Implemented correct knowledgeSources array format per GA API specification:
Added optional parameters for target container SAS URL and blob folder to support training data linking
5. Type Safety

Added missing Optional import to di_to_cu_converter.py for proper type hints
6. Documentation Updates

Updated README.md to reflect GA API version and documentation links
Added prerequisite warning about configuring default model deployments before migration
Clarified analyzer creation requirements and setup steps

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this file? I saw that this was from an October commit

@aainav269 aainav269 self-requested a review December 18, 2025 21:54
# Document Intelligence to Content Understanding Migration Tool (Python)

Welcome! This tool helps convert your Document Intelligence (DI) datasets to the Content Understanding (CU) **Preview.2** 2025-05-01-preview format, as used in AI Foundry. The following DI versions are supported:
Welcome! This tool helps convert your Document Intelligence (DI) datasets to the Content Understanding (CU) **GA** 2025-11-01 format, as used in AI Foundry. The following DI versions are supported:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to support both CU Preview and GA conversions?

3. Configure permissions and expiry for your SAS URL as follows:

- For the **DI source dataset**, please select permissions: _**Read & List**_
https://jfilcikditestdata.blob.core.windows.net/didata?sv=2025-07-05&spr=https&st=2025-12-16T22%3A17%3A06Z&se=2025-12-17T22%3A17%3A06Z&sr=c&sp=rl&sig=nvUIelZQ9yWEJx3jA%2FjUOIdHn6OVnp5gvKSJ3zgzwvE%3D
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to remove this secret SAS URL.


- For the **CU target dataset**, please select permissions: _**Read, Add, Create, & Write**_

https://jfilcikditestdata.blob.core.windows.net/cudata?sv=2025-07-05&spr=https&st=2025-12-16T22%3A19%3A39Z&se=2025-12-17T22%3A19%3A39Z&sr=c&sp=racwl&sig=K82dxEFNpYhuf5JRq3xJ4vc5SYE8A7FfsBnTJbB1VJY%3D
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't want to check in the secret blob SAS URL.

"baseAnalyzerId": "prebuilt-documentAnalyzer",
"baseAnalyzerId": "prebuilt-document",
"models": {
"completion": "gpt-4.1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to set completion and embedding models in multiple converters. It will be good to have these default values in constants.py to be reused. Another possible option could be allowing the users to put these as arguments when running the converter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants