The Parserate document parsing feature uses Docling to extract and validate structured data from documents like PDFs, images, and forms. This helps automate HR document processing and reduces manual data entry errors.
- Parses documents: PDFs, images (PNG, JPG, TIFF, BMP)
- Extracts structured data: Names, SSNs, addresses, dates, etc.
- Validates field formats: Ensures data meets expected patterns
- Template support: Pre-configured templates for common forms
- Auto-detection: Can identify fields without templates
- Validation reporting: Shows what's missing or incorrect
cd backend
pip install doclingThe requirements.txt already includes docling. If you need to install manually:
pip install -r requirements.txtuvicorn server:app --host 0.0.0.0 --port 8000 --reloadPOST /parse-document
Headers: X-API-Key: your_api_key
Content-Type: multipart/form-data
Form Data:
- file: Document file (PDF, PNG, JPG, etc.)
- template_type: Optional (employment_form, tax_form, address_verification)
Response Example:
{
"success": true,
"filename": "employment_form.pdf",
"template_type": "employment_form",
"page_count": 2,
"fields": [
{
"page": 1,
"title": "Employment Application Form",
"question": "Full Name",
"answer": "John Smith",
"optional": false,
"required": true,
"field_type": "NAME",
"format_expectation": ".*",
"confidence": 0.8,
"validation_status": "PASSED",
"validation_errors": []
}
],
"validation_summary": {
"total_fields": 5,
"passed": 4,
"failed": 1,
"warnings": 0,
"missing_required": 1,
"validation_percentage": 80.0,
"ready_for_processing": false
}
}GET /document-templates
Headers: X-API-Key: your_api_key
POST /validate-document-field
Headers: X-API-Key: your_api_key
Content-Type: application/x-www-form-urlencoded
Form Data:
- field_type: SSN, EMAIL, PHONE, DATE, etc.
- field_value: Value to validate
- Full Name (required)
- Social Security Number (required)
- Address (required)
- Phone Number (required)
- Email Address (required)
- Emergency Contact (optional)
- Previous Employer (optional)
- SSN (required)
- Filing Status (required)
- Income (required)
- Street Address (required)
- City (required)
- State (required)
- ZIP Code (required)
- TEXT: General text fields
- SSN: Social Security Numbers (XXX-XX-XXXX)
- DATE: Dates (MM/DD/YYYY or YYYY-MM-DD)
- EMAIL: Email addresses
- PHONE: Phone numbers (+1-XXX-XXX-XXXX)
- ADDRESS_*: Street, City, State, ZIP
- SIGNATURE: Signature fields
- CHECKBOX: Checkbox values
- NUMBER: Numeric values
- CURRENCY: Currency amounts
- NAME: Person names
Run the document parsing test:
python test_document_parsing.pyThis will test:
- Health check and availability
- Template retrieval
- Field validation
- Document parsing with sample data
import requests
files = {"file": open("employment_form.pdf", "rb")}
headers = {"X-API-Key": "your_api_key"}
data = {"template_type": "employment_form"}
response = requests.post(
"http://localhost:8000/parse-document",
files=files,
data=data,
headers=headers
)
result = response.json()
print(f"Validation: {result['validation_summary']['ready_for_processing']}")response = requests.post(
"http://localhost:8000/validate-document-field",
data={
"field_type": "SSN",
"field_value": "123-45-6789"
},
headers={"X-API-Key": "your_api_key"}
)
print(f"Valid: {response.json()['is_valid']}")Add document parsing to your Cortex.ai workflow:
Endpoint: https://your-app.onrender.com/parse-document
Method: POST (Multipart Form)
Headers: X-API-Key: your_api_key
Form Fields:
- file: Document file
- template_type: employment_form (or other template)
- Automated Data Extraction: No manual typing from forms
- Format Validation: Catches errors before processing
- Completeness Check: Identifies missing required fields
- Quality Assurance: Reduces data entry errors
- Processing Ready: Clear indication when documents are complete
- Template Flexibility: Supports various document types
- PASSED: Field is valid and properly formatted
- FAILED: Field has validation errors (wrong format, etc.)
- WARNING: Field is missing but optional
- PENDING: Field not yet validated
- Install docling:
pip install docling - Restart the server
- Check health endpoint:
/health
- Use higher quality scans
- Ensure text is clearly readable
- Consider using appropriate templates
- Check field format expectations
- Use
/validate-document-fieldto test individual values - Review validation error messages
- Custom template creation
- Machine learning for better field detection
- Batch document processing
- Integration with external validation services
- Advanced OCR for handwritten text
Your DocAlert system now has powerful document parsing capabilities! ๐