Skip to content

When to Use Which Evaluator

This guide helps you choose the right evaluator for your optimization needs. Evaluators measure how well extracted data matches expected data, guiding the optimization process.

Quick Decision Tree

flowchart TD
    A[Need to Evaluate Field] --> B{Exact Match Required?}
    B -->|Yes| C[Use Exact Evaluator]
    B -->|No| D{Minor Variations OK?}
    D -->|Yes| E[Use Levenshtein]
    D -->|No| F{Meaning Matters?}
    F -->|Yes| G[Use Text Similarity]
    F -->|No| H{Complex Logic?}
    H -->|Yes| I[Use LLM-based or Custom]
    H -->|No| J[Use Exact]

Evaluator Comparison

Evaluator Speed Accuracy Use Case Data Types Best For
Exact (exact) Fast Exact Precise values that must match exactly Strings IDs, codes, exact strings
Levenshtein (levenshtein) Fast Fuzzy Text with minor spelling/formatting differences Strings Names, addresses with typos
Text Similarity (text_similarity) Medium Semantic Text where meaning matters more than exact wording Strings Descriptions, summaries
Score Judge (score_judge) Slow LLM-based Numeric scores needing quality assessment Numbers Ratings, scores
Label Model Grader (label_model_grader) Slow LLM-based Classification labels needing context-aware evaluation Labels/Categories Sentiment, categories
Python Code (python_code) Medium Custom Custom evaluation logic for complex business rules Any Business rules, thresholds
Predefined Score (predefined_score) Fastest Pre-computed Pre-computed scores (no evaluation needed) Any Ground truth exists

Field Type → Evaluator Mapping

Field Type Recommended Evaluator Alternative Options Reason
IDs, SKUs, Codes exact - Must match exactly
Names, Addresses levenshtein exact if no typos expected Handles minor variations
Descriptions, Summaries text_similarity levenshtein for short text Meaning matters more than wording
Ratings, Scores score_judge python_code for simple rules Context-aware quality assessment
Categories, Labels label_model_grader exact if unambiguous Context-dependent classification
Numeric Values python_code exact for integers Custom thresholds, ranges
Timestamps, Dates exact python_code for formats Standardized formats

Performance Comparison

Evaluator Speed API Calls Cost Best For
exact Fastest None Free Most fields
levenshtein Fast None Free Text with variations
text_similarity Medium Embedding API Low Semantic matching
score_judge Slow LLM API High Complex scoring
label_model_grader Slow LLM API High Complex classification
python_code Medium None Free Custom logic
predefined_score Fastest None Free Pre-computed scores

Common Patterns

Pattern 1: Most Fields Exact, Some Semantic

evaluator_config = {
    "default": "exact",  # Most fields
    "field_overrides": {
        "description": "text_similarity",  # Meaning matters
    },
}

Use when: Most fields are precise, but descriptions need semantic matching.

Pattern 2: Text Fields with Variations

evaluator_config = {
    "default": "levenshtein",  # Handle typos
    "field_overrides": {
        "id": "exact",  # IDs must match exactly
    },
}

Use when: Text fields may have typos, but IDs must be exact.

Pattern 3: Complex Evaluation

evaluator_config = {
    "default": "exact",
    "field_overrides": {
        "rating": "score_judge",  # Context-aware
        "sentiment": "label_model_grader",  # Context-aware
    },
}

Use when: Some fields need LLM-based evaluation for context.

Pattern 4: Custom Business Rules

evaluator_config = {
    "default": "exact",
    "field_overrides": {
        "age": {
            "type": "python_code",
            "config": {
                "code": "def evaluate(extracted, expected): return 1.0 if abs(extracted - expected) <= 2 else 0.0",
            },
        },
    },
}

Use when: You have specific business rules (e.g., age within 2 years is acceptable).

Quick Reference

Need Evaluator Example
Exact match exact Product SKU: "SKU-12345"
Handle typos levenshtein Name: "John Smith" vs "Jon Smith"
Semantic similarity text_similarity Description: "excellent camera" vs "great photo quality"
Rate quality score_judge Rating: 4.5 vs 5.0 (is 4.5 reasonable?)
Classify context label_model_grader Sentiment: "positive" (is it correct for this review?)
Custom rules python_code Age: within 2 years is acceptable
Pre-computed predefined_score Already have scores from previous evaluation

Tips

  • Start with exact for most fields - it's fastest and most reliable
  • Use levenshtein when you expect minor variations
  • Use text_similarity for longer text where meaning matters
  • Use LLM-based evaluators sparingly - they're slower and more expensive
  • Use predefined_score when you already have ground truth scores
  • Test evaluator choice with a few examples before full optimization

See Also