Skip to content

Configure Evaluators

This guide shows you how to configure evaluators for different fields and use cases. Evaluators guide the optimization process by measuring how well extracted data matches expected data.

Problem

You need different evaluation strategies for different fields, or want to use pre-computed scores instead of running evaluations during optimization.

Solution

Use evaluator_config to configure evaluators per field or use PredefinedScoreEvaluator for pre-computed scores.

Available Evaluator Types

Evaluator Alias Use Case Data Types Speed
exact exact Precise values that must match exactly Strings Fast
levenshtein levenshtein Text with minor spelling/formatting differences Strings Fast
text_similarity text_similarity Text where meaning matters more than exact wording Strings Medium
score_judge score_judge Numeric scores needing quality assessment Numbers Slow
label_model_grader label_model_grader Classification labels needing context-aware evaluation Labels/Categories Slow
python_code python_code Custom evaluation logic for complex business rules Any Medium
predefined_score predefined_score Pre-computed scores (no evaluation needed) Any Fastest

Common Configuration Patterns

Pattern Configuration Use Case
Most Fields Exact default: "exact" Most fields need exact matching
Text with Variations default: "levenshtein" Text fields may have typos
Semantic Matching default: "text_similarity" Meaning matters more than wording
Mixed Strategy default + field_overrides Different fields need different evaluators

Using Pre-defined Scores

If you already have scores, use PredefinedScoreEvaluator:

from dspydantic import Prompter
from dspydantic.evaluators import PredefinedScoreEvaluator

# Pre-computed scores
scores = [0.95, 0.87, 0.92, 1.0, 0.78]
evaluator = PredefinedScoreEvaluator(config={"scores": scores})

# Configure DSPy first
import dspy
lm = dspy.LM("openai/gpt-4o", api_key="your-api-key")
dspy.configure(lm=lm)

prompter = Prompter(model=MyModel)

result = prompter.optimize(
    examples=examples,
    evaluate_fn=evaluator,
)

Per-Field Evaluator Configuration

Configure different evaluators for different fields:

# Configure DSPy first
import dspy
lm = dspy.LM("openai/gpt-4o", api_key="your-api-key")
dspy.configure(lm=lm)

prompter = Prompter(model=User)

result = prompter.optimize(
    examples=examples,
    evaluator_config={
        "default": {
            "type": "exact",
            "config": {"case_sensitive": False},
        },
        "field_overrides": {
            "name": {
                "type": "exact",
                "config": {"case_sensitive": True},  # Names must match exactly
            },
            "description": {
                "type": "text_similarity",
                "config": {
                    "model": "sentence-transformers/all-MiniLM-L6-v2",
                    "threshold": 0.7,
                },
            },
            "rating": {
                "type": "score_judge",
                "config": {
                    "criteria": "Rate the quality of this rating on a scale of 0-1",
                    "temperature": 0.0,
                },
            },
        },
    },
)

Configuration Examples Table

Field Type Evaluator Configuration Reason
ID, SKU exact case_sensitive: True Must match exactly
Name exact case_sensitive: False Case variations OK
Description text_similarity threshold: 0.7 Meaning matters
Rating score_judge Custom criteria Context-aware
Age python_code Custom function Business rules

Custom Evaluator Class

Create a custom evaluator:

# Configure DSPy first
import dspy
lm = dspy.LM("openai/gpt-4o", api_key="your-api-key")
dspy.configure(lm=lm)

class ThresholdEvaluator:
    """Custom evaluator that checks if values are within a threshold."""

    def __init__(self, config: dict) -> None:
        self.threshold = config.get("threshold", 0.1)

    def evaluate(
        self,
        extracted: float,
        expected: float,
        input_data: dict | None = None,
        field_path: str | None = None,
    ) -> float:
        """Check if extracted value is within threshold of expected."""
        diff = abs(extracted - expected)
        return 1.0 if diff <= self.threshold else max(0.0, 1.0 - (diff / expected))

prompter = Prompter(model=RatingModel)

result = prompter.optimize(
    examples=examples,
    evaluator_config={
        "default": {
            "class": ThresholdEvaluator,
            "config": {"threshold": 0.05},
        },
    },
)

Python Code Evaluator

Use a callable for custom evaluation logic:

def age_evaluator(extracted, expected, input_data=None, field_path=None):
    """Custom evaluation function for age field."""
    if field_path == "age":
        diff = abs(extracted - expected)
        if diff == 0:
            return 1.0
        elif diff <= 2:
            return 0.8
        else:
            return max(0.0, 1.0 - (diff / 10))
    # For other fields, use exact match
    return 1.0 if extracted == expected else 0.0

# Configure DSPy first
import dspy
lm = dspy.LM("openai/gpt-4o", api_key="your-api-key")
dspy.configure(lm=lm)

prompter = Prompter(model=SimpleUser)

result = prompter.optimize(
    examples=examples,
    evaluator_config={
        "default": "exact",
        "field_overrides": {
            "age": {
                "type": "python_code",
                "config": {
                    "function": age_evaluator,
                },
            },
        },
    },
)

Tips

  • Use default for most fields, override specific fields as needed
  • Pre-defined scores are fastest when you have ground truth
  • Text similarity works well for semantic matching
  • See When to Use Which for evaluator selection guidance
  • See Reference: Evaluators for all options

See Also