W-18902119 Adding new topic about document quality and performance #75

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

Cristian-Venticinque wants to merge 3 commits into latest from W-18902119-model-accuracy-performance

modules/ROOT/nav.adoc

-Original file line number
+Diff line change
@@ Expand Up / @@ -2,6 +2,7 @@ @@
     * xref:index.adoc[IDP Overview]
     * xref:release-notes.adoc[Release Notes]
     * xref:document-processing.adoc[]
+    * xref:document-quality-and-model-performance.adoc[]
     * xref:analyzing-documents-with-einstein.adoc[]
     * xref:creating-document-actions.adoc[]
     ** xref:enhancing-data-extraction-with-einstein.adoc[]
@@ Expand Down @@

modules/ROOT/pages/_partials/document-preparation.adoc

-Original file line number
+Diff line change
@@ -0,0 +1,11 @@
+    // tag::documentPreparation[]
+    == Document Preparation and Testing
+    Before creating document actions, ensure your sample documents represent the quality and variety of documents to process in production. The accuracy of your document actions depend significantly on the quality and diversity of your sample documents.
+    Include both high-quality and challenging examples in your test set. Test with various document layouts and formats. Use documents with different font styles and sizes. Include examples with tables, forms, and complex layouts.
+    Start with high-quality native digital PDFs to establish baseline accuracy. Gradually test with more challenging documents such as scanned PDFs or images. Monitor confidence scores across different document types. Adjust prompts and thresholds based on results.
+    For additional details, see xref:document-quality-and-model-performance.adoc[]
+    // end::documentPreparation[]

modules/ROOT/pages/analyzing-documents-with-einstein.adoc

-Original file line number
+Diff line change
@@ Expand Up / @@ -24,6 +24,9 @@ include::partial$permissions.adoc[tag=permissionBuild] @@
     include::partial$einstein.adoc[tags=einsteinRequisites;!shortIntro]
+    //Document Preparation and Testing
+    include::partial$document-preparation.adoc[tag=documentPreparation]
     == Create a Generic Document Action and Enable Customize Schema
     To analyze documents and fully customize the output structure, create a document action of the Generic type and enable *Customize Schema*:
@@ Expand Down @@

modules/ROOT/pages/automate-document-processing-with-rpa.adoc

-Original file line number
+Diff line change
@@ Expand Up / @@ -24,17 +24,18 @@ include::partial$document-action.adoc[tag=modelUsage] @@
     The Submit Document to MuleSoft IDP action step executes document actions by impersonating a user in your organization. Therefore, you must use authentication credentials of a user that has the Execute Published Actions permission in Anypoint Platform.
-    See xref:rpa-builder::toolbox-mulesoft-idp-submit-document-to-mulesoft-idp.adoc[] for configuration details.
+    See xref:rpa-builder::toolbox-mulesoft-idp-submit-document-to-mulesoft-idp.adoc[MuleSoft RPA: Submit Document to MuleSoft IDP] for configuration details.
     == Retrieve the Results of the Execution
     To retrieve the results of a document action execution, use the Retrieve Results from MuleSoft IDP action step in RPA. This action step enables you to query the results of a document action execution by providing an Execution ID that you used before in the corresponding Submit Document to MuleSoft IDP execution.
     include::partial$document-action.adoc[tag=modelUsage]
-    See xref:rpa-builder::toolbox-mulesoft-idp-retrieve-results-from-mulesoft-idp.adoc[] for configuration details.
+    See xref:rpa-builder::toolbox-mulesoft-idp-retrieve-results-from-mulesoft-idp.adoc[MuleSoft RPA: Retrieve Results from MuleSoft IDP] for configuration details.
     == See Also
+    * xref:document-quality-and-model-performance.adoc[]
     * xref:creating-document-actions.adoc[]
     * xref:publishing-document-actions.adoc[]

modules/ROOT/pages/automate-document-processing-with-the-idp-api.adoc

-Original file line number
+Diff line change
@@ Expand Up @@
     == See Also
-    * xref:rpa-builder::toolbox-mulesoft-idp-submit-document-to-mulesoft-idp.adoc[]
-    * xref:rpa-builder::toolbox-mulesoft-idp-retrieve-results-from-mulesoft-idp.adoc[]
+    * xref:document-quality-and-model-performance.adoc[]
+    * xref:rpa-builder::toolbox-mulesoft-idp-submit-document-to-mulesoft-idp.adoc[MuleSoft RPA: Submit Document to MuleSoft IDP]
+    * xref:rpa-builder::toolbox-mulesoft-idp-retrieve-results-from-mulesoft-idp.adoc[MuleSoft RPA: Retrieve Results from MuleSoft IDP]
     * xref:creating-document-actions.adoc[]
     * xref:publishing-document-actions.adoc[]

modules/ROOT/pages/creating-document-actions.adoc

-Original file line number
+Diff line change
@@ Expand Up / @@ -28,6 +28,9 @@ include::partial$permissions.adoc[tag=permissionManage] @@
     include::partial$permissions.adoc[tag=permissionBuild]
+    //Document Preparation and Testing
+    include::partial$document-preparation.adoc[tag=documentPreparation]
     [[upload-files]]
     == Upload Sample Files and Preview the Results
@@ Expand Down @@

modules/ROOT/pages/document-processing.adoc

-Original file line number
+Diff line change
@@ Expand Up @@
     == See Also
+    * xref:document-quality-and-model-performance.adoc[]
     * xref:creating-document-actions.adoc[]
     * xref:publishing-document-actions.adoc[]
     * xref:reviewing-processed-documents.adoc[]

modules/ROOT/pages/document-quality-and-model-performance.adoc

-Original file line number
+Diff line change
@@ -0,0 +1,63 @@
+    = Document Quality and Model Performance
+    The accuracy of data extraction in MuleSoft IDP depends significantly on the quality and type of documents you process. Understanding these factors helps you set realistic expectations and achieve optimal results. Document types and their characteristics have a significant impact on extraction accuracy.
+    == Native Digital Documents
+    Native digital documents contain embedded text that is directly accessible within the document's internal structure. When processing these documents:
+    * LLMs can extract text without requiring OCR (Optical Character Recognition) processing
+    * Extraction typically yields high accuracy results with confidence scores of 90% or higher
+    * These documents are recommended for achieving the best extraction performance
+    == Scanned Documents and Images
+    Scanned documents and images require OCR processing to convert visual elements into machine-readable text. When processing these documents:
+    * Model accuracy depends heavily on the performance of the underlying OCR technology
+    * Results vary based on image quality and document complexity
+    * These documents may require human review more frequently than native digital documents
+    == Factors Affecting Data Extraction
+    The following factors impact the accuracy of data extraction from scanned documents and images:
+    * *Image Quality*
+    +
+    Higher resolution images provide better results. Clear, sharp images with good contrast improve extraction accuracy. Background artifacts, shadows, or blurring reduce accuracy.
+    * *Document Layout*
+    +
+    Documents with multiple columns, overlapping elements, or irregular layouts are more challenging to process. Inconsistent spacing, unusual fonts, or mixed formatting styles can affect results. Skewed or rotated documents may require preprocessing.
+    * *Text Characteristics*
+    +
+    Standard fonts are easier to process than decorative or unusual fonts. Very small or very large text may be difficult to extract accurately. Most models struggle with handwritten content.
+    == Improving Extraction Results
+    When you encounter inaccurate extraction results, consider these aspects:
+    * *Document Quality Improvements*
+    ** Use higher quality source documents when possible.
+    ** Improve scanning resolution and quality.
+    ** Standardize document formats across your organization.
+    * *Prompt Optimization*
+    ** Be specific about field locations and expected formats.
+    ** Include examples in prompts for complex fields.
+    ** Test prompts with various document qualities.
+    ** Iterate and refine based on results.
+    * *Model Selection*
+    ** Test different models with your specific document types.
+    == See Also
+    * xref:document-processing.adoc[]
+    * xref:creating-document-actions.adoc[]
+    * xref:supported-models.adoc[]
+    * xref:analyzing-documents-with-einstein.adoc[]

modules/ROOT/pages/enhancing-data-extraction-with-einstein.adoc

-Original file line number
+Diff line change
@@ Expand Up / @@ -30,4 +30,5 @@ include::partial$document-action.adoc[tag=modelUsage] @@
     == See Also
     * xref:example-einstein-prompts.adoc[]
+    * xref:document-quality-and-model-performance.adoc[]
     * xref:creating-document-actions.adoc[]

modules/ROOT/pages/index.adoc

-Original file line number
+Diff line change
@@ Expand Up @@
     == See Also
     * xref:document-processing.adoc[]
+    * xref:document-quality-and-model-performance.adoc[]
     * xref:analyzing-documents-with-einstein.adoc[]
     * xref:creating-document-actions.adoc[]
     * xref:publishing-document-actions.adoc[]
@@ Expand Down @@

modules/ROOT/pages/ms-automation-credits-2.adoc

-Original file line number
+Diff line change
@@ Expand Up @@
     The services contain features that use generative AI technology that may be provided by one or more third-parties as listed in the documentation applicable to the services. This documentation provides information and product requirements specific to these generative AI features and providers, including applicable third-party acceptable use policies which the customer must comply with when using the generative AI technology. Due to the nature of generative AI, the output that it generates may be unpredictable, and may include inaccurate or harmful responses. Before using any generative AI output, the customer is solely responsible for reviewing the output for accuracy, safety, and compliance with applicable laws and third-party acceptable use policies. The customer assumes all responsibility for output generated by the services and, as between Salesforce and the customer, such output is customer data.
+    include::idp::partial$einstein-model.adoc[]
+    [NOTE]
+    Einstein doesn't use customer data to train any models for document analysis in IDP.
     == See Also
     * xref:ms-automation-credits-usage-types.adoc[]

modules/ROOT/pages/reviewing-processed-documents.adoc

-Original file line number
+Diff line change
@@ Expand Up @@
     == See Also
+    * xref:document-quality-and-model-performance.adoc[]
     * xref:automate-document-processing-with-the-idp-api.adoc[]
     * xref:automate-document-processing-with-rpa.adoc[]
     * xref:adding-reviewers.adoc[]

modules/ROOT/pages/supported-models.adoc

-Original file line number
+Diff line change
@@ Expand Up / @@ -62,5 +62,6 @@ Select *Show properties* under *choices* to see the details. @@
     == See Also
+    * xref:document-quality-and-model-performance.adoc[]
     * https://platform.openai.com/docs/guides/text#prompt-engineering[OpenAI's Prompt Engineering Guide]
     * xref:analyzing-documents-with-einstein.adoc[]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

W-18902119 Adding new topic about document quality and performance #75

Uh oh!

Diff view

Diff view

There are no files selected for viewing

W-18902119 Adding new topic about document quality and performance #75

Are you sure you want to change the base?

Uh oh!

W-18902119 Adding new topic about document quality and performance #75

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing