Overview
Goal: Ensure AI correctly extracts structured data from unstructured text. Time Investment: 30 minutes to first insightsThe Challenge
You’re building a data extraction service for your business. Questions:- How accurate should extraction be?
- What should happen if data is ambiguous?
- How should the AI handle missing information?
- What format should extracted data use?
Quick Example: Invoice Data Extraction
Step 1: Create Project
System Prompt:Step 2: Add Scenarios
Step 3: Generate, Rate, Extract
Rate based on:- Accuracy: Are extracted values correct?
- Completeness: Did it extract all available data?
- Handling Missing Data: Properly uses null or defaults?
- Format: Valid JSON, correct structure?
Step 4: Get Insights
Patterns reveal:- All 5-star: Extract dates in ISO format
- All 5-star: Use null for missing fields
- Low-star: Infer missing data instead of using null
- Low-star: Inconsistent number formatting
Scenarios by Extraction Type
Invoice/Receipt Extraction
Form Data Extraction
Entity Extraction
Structured Conversion
Key Metrics for Extraction
What Makes Extraction “5-Star”?
- Accuracy: Values are correct
- Completeness: Extracts all available data
- Proper Handling of Missing Data: Uses null, not guesses
- Consistent Formatting: All values in correct format
- Proper Type Conversion: Numbers as numbers, not strings
Common Failure Patterns
Pattern 1: Incorrect Values- 5-star: Extracts correct number
- 1-star: Off-by-one or misread value
- 5-star: Extracts all available fields
- 1-star: Skips optional fields that are present
- 5-star: Uses null for missing data
- 1-star: Invents reasonable-sounding values
- 5-star: All dates in ISO 8601 format
- 1-star: Mixed date formats
- 5-star: “amount”: 100 (number)
- 1-star: “amount”: “100” (string)
Evaluation Tips
Validate Extracted Data:- Check against source text (accurate?)
- Check completeness (did it get everything?)
- Check types (are they correct?)
- Check nulls (properly handles missing data?)
- Missing fields
- Ambiguous data
- Typos/misspellings
- Multiple formats
- Different languages
Iteration Example
Iteration 1 (70% success):- Issue: Hallucinating missing amounts
- Fix: Add “Use null for missing fields, don’t guess”
- Issue: Inconsistent date formats
- Fix: Add “Always return dates in ISO 8601 format: YYYY-MM-DD”
- Issue: Some numeric values as strings
- Fix: Add “Return amounts as numbers, not strings. Example: ‘amount’: 100”
Validation Rules
After extraction, validate:Export for Engineering
- Export golden examples (correct extractions)
- Extract patterns (formatting rules discovered)
- Use in:
- Data validation rules
- Test cases
- Documentation
- Error handling guidelines
Performance Metrics
Track extraction accuracy:| Field | Accuracy |
|---|---|
| Invoice Number | 98% |
| Date | 95% |
| Total Amount | 94% |
| Items | 92% |
| Vendor | 91% |