From 07e8d8104b5ed21140cc39fbf732b8cb35059728 Mon Sep 17 00:00:00 2001 From: Cristian Venticinque Date: Fri, 27 Jun 2025 11:16:56 -0300 Subject: [PATCH 1/3] Adding new topic about document quality and performance Created a reusable section about preparing and testing documents for extraction. --- .../pages/_partials/document-preparation.adoc | 11 ++++ ...ocument-quality-and-model-performance.adoc | 63 +++++++++++++++++++ 2 files changed, 74 insertions(+) create mode 100644 modules/ROOT/pages/_partials/document-preparation.adoc create mode 100644 modules/ROOT/pages/document-quality-and-model-performance.adoc diff --git a/modules/ROOT/pages/_partials/document-preparation.adoc b/modules/ROOT/pages/_partials/document-preparation.adoc new file mode 100644 index 0000000..f6a1b9c --- /dev/null +++ b/modules/ROOT/pages/_partials/document-preparation.adoc @@ -0,0 +1,11 @@ +// tag::documentPreparation[] +== Document Preparation and Testing + +Before creating document actions, ensure your sample documents represent the quality and variety of documents to process in production. The accuracy of your document actions depend significantly on the quality and diversity of your sample documents. + +Include both high-quality and challenging examples in your test set. Test with various document layouts and formats. Use documents with different font styles and sizes. Include examples with tables, forms, and complex layouts. + +Start with high-quality native digital PDFs to establish baseline accuracy. Gradually test with more challenging documents such as scanned PDFs or images. Monitor confidence scores across different document types. Adjust prompts and thresholds based on results. + +For additional details, see xref:document-quality-and-model-performance.adoc[] +// end::documentPreparation[] \ No newline at end of file diff --git a/modules/ROOT/pages/document-quality-and-model-performance.adoc b/modules/ROOT/pages/document-quality-and-model-performance.adoc new file mode 100644 index 0000000..3eeadae --- /dev/null +++ b/modules/ROOT/pages/document-quality-and-model-performance.adoc @@ -0,0 +1,63 @@ += Document Quality and Model Performance + +The accuracy of data extraction in MuleSoft IDP depends significantly on the quality and type of documents you process. Understanding these factors helps you set realistic expectations and achieve optimal results. Document types and their characteristics have a significant impact on extraction accuracy. + +== Native Digital Documents + +Native digital documents contain embedded text that is directly accessible within the document's internal structure. When processing these documents: + +* LLMs can extract text without requiring OCR (Optical Character Recognition) processing +* Extraction typically yields high accuracy results with confidence scores of 90% or higher +* These documents are recommended for achieving the best extraction performance + +== Scanned Documents and Images + +Scanned documents and images require OCR processing to convert visual elements into machine-readable text. When processing these documents: + +* Model accuracy depends heavily on the performance of the underlying OCR technology +* Results vary based on image quality and document complexity +* These documents may require human review more frequently than native digital documents + +== Factors Affecting Data Extraction + +The following factors impact the accuracy of data extraction from scanned documents and images: + +* *Image Quality* ++ +Higher resolution images provide better results. Clear, sharp images with good contrast improve extraction accuracy. Background artifacts, shadows, or blurring reduce accuracy. + +* *Document Layout* ++ +Documents with multiple columns, overlapping elements, or irregular layouts are more challenging to process. Inconsistent spacing, unusual fonts, or mixed formatting styles can affect results. Skewed or rotated documents may require preprocessing. + +* *Text Characteristics* ++ +Standard fonts are easier to process than decorative or unusual fonts. Very small or very large text may be difficult to extract accurately. Most models struggle with handwritten content. + +== Improving Extraction Results + +When you encounter inaccurate extraction results, consider these aspects: + +* *Document Quality Improvements* + +** Use higher quality source documents when possible. +** Improve scanning resolution and quality. +** Standardize document formats across your organization. + +* *Prompt Optimization* + +** Be specific about field locations and expected formats. +** Include examples in prompts for complex fields. +** Test prompts with various document qualities. +** Iterate and refine based on results. + +* *Model Selection* + +** Test different models with your specific document types. + +== See Also + +* xref:document-processing.adoc[] +* xref:creating-document-actions.adoc[] +* xref:supported-models.adoc[] +* xref:analyzing-documents-with-einstein.adoc[] \ No newline at end of file From 58f9bfe118decbdd319c1c45a5b03d4fcb7cbda3 Mon Sep 17 00:00:00 2001 From: Cristian Venticinque Date: Thu, 17 Jul 2025 16:28:46 -0300 Subject: [PATCH 2/3] Added links from other topics and nav. --- modules/ROOT/nav.adoc | 1 + modules/ROOT/pages/analyzing-documents-with-einstein.adoc | 3 +++ .../ROOT/pages/automate-document-processing-with-rpa.adoc | 5 +++-- .../pages/automate-document-processing-with-the-idp-api.adoc | 5 +++-- modules/ROOT/pages/creating-document-actions.adoc | 3 +++ modules/ROOT/pages/document-processing.adoc | 1 + .../ROOT/pages/enhancing-data-extraction-with-einstein.adoc | 1 + modules/ROOT/pages/index.adoc | 1 + modules/ROOT/pages/reviewing-processed-documents.adoc | 1 + modules/ROOT/pages/supported-models.adoc | 1 + 10 files changed, 18 insertions(+), 4 deletions(-) diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index 6cdfcff..6338ccd 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -2,6 +2,7 @@ * xref:index.adoc[IDP Overview] * xref:release-notes.adoc[Release Notes] * xref:document-processing.adoc[] +* xref:document-quality-and-model-performance.adoc[] * xref:analyzing-documents-with-einstein.adoc[] * xref:creating-document-actions.adoc[] ** xref:enhancing-data-extraction-with-einstein.adoc[] diff --git a/modules/ROOT/pages/analyzing-documents-with-einstein.adoc b/modules/ROOT/pages/analyzing-documents-with-einstein.adoc index 56eb579..627bfd4 100644 --- a/modules/ROOT/pages/analyzing-documents-with-einstein.adoc +++ b/modules/ROOT/pages/analyzing-documents-with-einstein.adoc @@ -24,6 +24,9 @@ include::partial$permissions.adoc[tag=permissionBuild] include::partial$einstein.adoc[tags=einsteinRequisites;!shortIntro] +//Document Preparation and Testing +include::partial$document-preparation.adoc[tag=documentPreparation] + == Create a Generic Document Action and Enable Customize Schema To analyze documents and fully customize the output structure, create a document action of the Generic type and enable *Customize Schema*: diff --git a/modules/ROOT/pages/automate-document-processing-with-rpa.adoc b/modules/ROOT/pages/automate-document-processing-with-rpa.adoc index 2fe2deb..126f2a6 100644 --- a/modules/ROOT/pages/automate-document-processing-with-rpa.adoc +++ b/modules/ROOT/pages/automate-document-processing-with-rpa.adoc @@ -24,7 +24,7 @@ include::partial$document-action.adoc[tag=modelUsage] The Submit Document to MuleSoft IDP action step executes document actions by impersonating a user in your organization. Therefore, you must use authentication credentials of a user that has the Execute Published Actions permission in Anypoint Platform. -See xref:rpa-builder::toolbox-mulesoft-idp-submit-document-to-mulesoft-idp.adoc[] for configuration details. +See xref:rpa-builder::toolbox-mulesoft-idp-submit-document-to-mulesoft-idp.adoc[MuleSoft RPA: Submit Document to MuleSoft IDP] for configuration details. == Retrieve the Results of the Execution @@ -32,9 +32,10 @@ To retrieve the results of a document action execution, use the Retrieve Results include::partial$document-action.adoc[tag=modelUsage] -See xref:rpa-builder::toolbox-mulesoft-idp-retrieve-results-from-mulesoft-idp.adoc[] for configuration details. +See xref:rpa-builder::toolbox-mulesoft-idp-retrieve-results-from-mulesoft-idp.adoc[MuleSoft RPA: Retrieve Results from MuleSoft IDP] for configuration details. == See Also +* xref:document-quality-and-model-performance.adoc[] * xref:creating-document-actions.adoc[] * xref:publishing-document-actions.adoc[] diff --git a/modules/ROOT/pages/automate-document-processing-with-the-idp-api.adoc b/modules/ROOT/pages/automate-document-processing-with-the-idp-api.adoc index 4c4ff49..22c8542 100644 --- a/modules/ROOT/pages/automate-document-processing-with-the-idp-api.adoc +++ b/modules/ROOT/pages/automate-document-processing-with-the-idp-api.adoc @@ -106,7 +106,8 @@ To confirm the endpoints to call to trigger document action executions and retri == See Also -* xref:rpa-builder::toolbox-mulesoft-idp-submit-document-to-mulesoft-idp.adoc[] -* xref:rpa-builder::toolbox-mulesoft-idp-retrieve-results-from-mulesoft-idp.adoc[] +* xref:document-quality-and-model-performance.adoc[] +* xref:rpa-builder::toolbox-mulesoft-idp-submit-document-to-mulesoft-idp.adoc[MuleSoft RPA: Submit Document to MuleSoft IDP] +* xref:rpa-builder::toolbox-mulesoft-idp-retrieve-results-from-mulesoft-idp.adoc[MuleSoft RPA: Retrieve Results from MuleSoft IDP] * xref:creating-document-actions.adoc[] * xref:publishing-document-actions.adoc[] diff --git a/modules/ROOT/pages/creating-document-actions.adoc b/modules/ROOT/pages/creating-document-actions.adoc index 5a71d4d..f97f6d8 100644 --- a/modules/ROOT/pages/creating-document-actions.adoc +++ b/modules/ROOT/pages/creating-document-actions.adoc @@ -28,6 +28,9 @@ include::partial$permissions.adoc[tag=permissionManage] include::partial$permissions.adoc[tag=permissionBuild] +//Document Preparation and Testing +include::partial$document-preparation.adoc[tag=documentPreparation] + [[upload-files]] == Upload Sample Files and Preview the Results diff --git a/modules/ROOT/pages/document-processing.adoc b/modules/ROOT/pages/document-processing.adoc index 08b8021..f4b48eb 100644 --- a/modules/ROOT/pages/document-processing.adoc +++ b/modules/ROOT/pages/document-processing.adoc @@ -41,6 +41,7 @@ For configuration and usage instructions, see: xref:automate-document-processing == See Also +* xref:document-quality-and-model-performance.adoc[] * xref:creating-document-actions.adoc[] * xref:publishing-document-actions.adoc[] * xref:reviewing-processed-documents.adoc[] diff --git a/modules/ROOT/pages/enhancing-data-extraction-with-einstein.adoc b/modules/ROOT/pages/enhancing-data-extraction-with-einstein.adoc index eb3a2d3..9a395e4 100644 --- a/modules/ROOT/pages/enhancing-data-extraction-with-einstein.adoc +++ b/modules/ROOT/pages/enhancing-data-extraction-with-einstein.adoc @@ -30,4 +30,5 @@ include::partial$document-action.adoc[tag=modelUsage] == See Also * xref:example-einstein-prompts.adoc[] +* xref:document-quality-and-model-performance.adoc[] * xref:creating-document-actions.adoc[] \ No newline at end of file diff --git a/modules/ROOT/pages/index.adoc b/modules/ROOT/pages/index.adoc index ae44abd..f81dfcf 100644 --- a/modules/ROOT/pages/index.adoc +++ b/modules/ROOT/pages/index.adoc @@ -37,6 +37,7 @@ Einstein doesn't use customer data to train any models for document analysis in == See Also * xref:document-processing.adoc[] +* xref:document-quality-and-model-performance.adoc[] * xref:analyzing-documents-with-einstein.adoc[] * xref:creating-document-actions.adoc[] * xref:publishing-document-actions.adoc[] diff --git a/modules/ROOT/pages/reviewing-processed-documents.adoc b/modules/ROOT/pages/reviewing-processed-documents.adoc index bc0d6bb..826ad14 100644 --- a/modules/ROOT/pages/reviewing-processed-documents.adoc +++ b/modules/ROOT/pages/reviewing-processed-documents.adoc @@ -30,6 +30,7 @@ If there are more than one results page, click *Submit and Next* and continue th == See Also +* xref:document-quality-and-model-performance.adoc[] * xref:automate-document-processing-with-the-idp-api.adoc[] * xref:automate-document-processing-with-rpa.adoc[] * xref:adding-reviewers.adoc[] \ No newline at end of file diff --git a/modules/ROOT/pages/supported-models.adoc b/modules/ROOT/pages/supported-models.adoc index 571b21c..0053a61 100644 --- a/modules/ROOT/pages/supported-models.adoc +++ b/modules/ROOT/pages/supported-models.adoc @@ -62,5 +62,6 @@ Select *Show properties* under *choices* to see the details. == See Also +* xref:document-quality-and-model-performance.adoc[] * https://platform.openai.com/docs/guides/text#prompt-engineering[OpenAI's Prompt Engineering Guide] * xref:analyzing-documents-with-einstein.adoc[] From d5a5f15482c0fd6bfc98180a6e30049f1d3e953a Mon Sep 17 00:00:00 2001 From: Cristian Venticinque Date: Thu, 17 Jul 2025 16:44:58 -0300 Subject: [PATCH 3/3] minor update to automation credits page. --- modules/ROOT/pages/ms-automation-credits-2.adoc | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/modules/ROOT/pages/ms-automation-credits-2.adoc b/modules/ROOT/pages/ms-automation-credits-2.adoc index 5720a4d..c85a89c 100644 --- a/modules/ROOT/pages/ms-automation-credits-2.adoc +++ b/modules/ROOT/pages/ms-automation-credits-2.adoc @@ -12,6 +12,11 @@ Analyzing documents with Einstein consumes Automation Credits and Einstein Reque The services contain features that use generative AI technology that may be provided by one or more third-parties as listed in the documentation applicable to the services. This documentation provides information and product requirements specific to these generative AI features and providers, including applicable third-party acceptable use policies which the customer must comply with when using the generative AI technology. Due to the nature of generative AI, the output that it generates may be unpredictable, and may include inaccurate or harmful responses. Before using any generative AI output, the customer is solely responsible for reviewing the output for accuracy, safety, and compliance with applicable laws and third-party acceptable use policies. The customer assumes all responsibility for output generated by the services and, as between Salesforce and the customer, such output is customer data. +include::idp::partial$einstein-model.adoc[] + +[NOTE] +Einstein doesn't use customer data to train any models for document analysis in IDP. + == See Also * xref:ms-automation-credits-usage-types.adoc[]