-
Notifications
You must be signed in to change notification settings - Fork 50
Update the DI to CU Labeled data migration code to work with CU GA #128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… scripts with "knowledge source" property from the GA API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this file? I saw that this was from an October commit
| # Document Intelligence to Content Understanding Migration Tool (Python) | ||
|
|
||
| Welcome! This tool helps convert your Document Intelligence (DI) datasets to the Content Understanding (CU) **Preview.2** 2025-05-01-preview format, as used in AI Foundry. The following DI versions are supported: | ||
| Welcome! This tool helps convert your Document Intelligence (DI) datasets to the Content Understanding (CU) **GA** 2025-11-01 format, as used in AI Foundry. The following DI versions are supported: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be best to support both CU Preview and GA conversions?
| 3. Configure permissions and expiry for your SAS URL as follows: | ||
|
|
||
| - For the **DI source dataset**, please select permissions: _**Read & List**_ | ||
| https://jfilcikditestdata.blob.core.windows.net/didata?sv=2025-07-05&spr=https&st=2025-12-16T22%3A17%3A06Z&se=2025-12-17T22%3A17%3A06Z&sr=c&sp=rl&sig=nvUIelZQ9yWEJx3jA%2FjUOIdHn6OVnp5gvKSJ3zgzwvE%3D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to remove this secret SAS URL.
|
|
||
| - For the **CU target dataset**, please select permissions: _**Read, Add, Create, & Write**_ | ||
|
|
||
| https://jfilcikditestdata.blob.core.windows.net/cudata?sv=2025-07-05&spr=https&st=2025-12-16T22%3A19%3A39Z&se=2025-12-17T22%3A19%3A39Z&sr=c&sp=racwl&sig=K82dxEFNpYhuf5JRq3xJ4vc5SYE8A7FfsBnTJbB1VJY%3D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't want to check in the secret blob SAS URL.
| "baseAnalyzerId": "prebuilt-documentAnalyzer", | ||
| "baseAnalyzerId": "prebuilt-document", | ||
| "models": { | ||
| "completion": "gpt-4.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to set completion and embedding models in multiple converters. It will be good to have these default values in constants.py to be reused. Another possible option could be allowing the users to put these as arguments when running the converter.
Updated the DI to CU labeled data migration tool from Azure AI Content Understanding Preview API (2025-05-01-preview) to General Availability API (2025-11-01).
Key Changes Implemented
Changed API version from 2025-05-01-preview to 2025-11-01 across all configuration files
Updated .sample_env template with new GA version
Updated 18 analyzer template JSON files with GA API requirements:
Added models section: Required object specifying completion and embedding models
Updated baseAnalyzerId naming: Changed from preview naming (e.g., prebuilt-documentAnalyzer) to GA naming (e.g., prebuilt-document)
Removed deprecated properties: Eliminated scenario property and pro mode configurations not supported in GA
3. OCR Extraction Improvements
Modified get_ocr.py to use prebuilt-read analyzer directly instead of creating temporary analyzers
Streamlined layout result generation by calling built-in analyzer API endpoint
4. Training Data Integration
Updated converter files (cu_converter_generative.py, cu_converter_neural.py) to add training data reference during conversion
Implemented correct knowledgeSources array format per GA API specification:
Added optional parameters for target container SAS URL and blob folder to support training data linking
5. Type Safety
Added missing Optional import to di_to_cu_converter.py for proper type hints
6. Documentation Updates
Updated README.md to reflect GA API version and documentation links
Added prerequisite warning about configuring default model deployments before migration
Clarified analyzer creation requirements and setup steps