Evaluators¶

Evaluation system for measuring extraction quality.

Evaluator Overview¶

Evaluator	Alias	When to Use	Data Types	Speed	Accuracy
StringCheckEvaluator	`exact`	Precise values that must match exactly (IDs, codes, exact strings)	Strings	Fast	Exact
LevenshteinEvaluator	`levenshtein`	Text with minor spelling or formatting differences	Strings	Fast	Fuzzy
TextSimilarityEvaluator	`text_similarity`	Text where meaning matters more than exact wording	Strings	Medium	Semantic
ScoreJudge	`score_judge`	Numeric scores or ratings needing quality assessment	Numbers	Slow	LLM-based
LabelModelGrader	`label_model_grader`	Classification labels needing context-aware evaluation	Labels/Categories	Slow	LLM-based
PythonCodeEvaluator	`python_code`	Custom evaluation logic for complex business rules	Any	Medium	Custom
PredefinedScoreEvaluator	`predefined_score`	Pre-computed scores (no evaluation needed)	Any	Fastest	Pre-computed

Quick Selection Guide¶

Exact match needed? → Use exact (StringCheckEvaluator)
Minor variations OK? → Use levenshtein (LevenshteinEvaluator)
Semantic similarity? → Use text_similarity (TextSimilarityEvaluator)
Complex evaluation? → Use score_judge or label_model_grader
Custom logic? → Use python_code (PythonCodeEvaluator)
Already have scores? → Use predefined_score (PredefinedScoreEvaluator)

API Reference¶

BaseEvaluator ¶

Bases: Protocol

Protocol for all evaluators.

All evaluators must implement the evaluate method that takes extracted and expected values and returns a score between 0.0 and 1.0.

Functions¶

evaluate ¶

evaluate(extracted, expected, input_data=None, field_path=None)

Evaluate extracted value against expected value.

Parameters:

Name	Type	Description	Default
`extracted`	`Any`	The extracted value to evaluate.	required
`expected`	`Any`	The expected value to compare against.	required
`input_data`	`dict[str, Any] \| None`	Optional input data dictionary for context.	`None`
`field_path`	`str \| None`	Optional field path (e.g., "name", "address.street") for context.	`None`

Returns:

Type	Description
`float`	Score between 0.0 and 1.0, where 1.0 is a perfect match.

Source code in src/dspydantic/evaluators/config.py

def evaluate(
    self,
    extracted: Any,
    expected: Any,
    input_data: dict[str, Any] | None = None,
    field_path: str | None = None,
) -> float:
    """Evaluate extracted value against expected value.

    Args:
        extracted: The extracted value to evaluate.
        expected: The expected value to compare against.
        input_data: Optional input data dictionary for context.
        field_path: Optional field path (e.g., "name", "address.street") for context.

    Returns:
        Score between 0.0 and 1.0, where 1.0 is a perfect match.
    """
    ...

StringCheckEvaluator ¶

StringCheckEvaluator(config)

Evaluator that performs exact string matching.

Best for IDs, codes, enums, and other values that must match exactly.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration dictionary with options: - case_sensitive (bool): Whether comparison is case-sensitive (default: True) - strip_whitespace (bool): Whether to strip whitespace (default: True)	required

Example

evaluator = StringCheckEvaluator(config={}) evaluator.evaluate("ABC123", "ABC123") 1.0 evaluator.evaluate("abc123", "ABC123") # Case mismatch 0.0 evaluator.evaluate(" ABC123 ", "ABC123") # Whitespace stripped 1.0

Case-insensitive matching:

evaluator = StringCheckEvaluator(config={"case_sensitive": False}) evaluator.evaluate("abc123", "ABC123") 1.0

Initialize StringCheckEvaluator.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration dictionary with case_sensitive and strip_whitespace options.	required

Source code in src/dspydantic/evaluators/string_check.py

def __init__(self, config: dict[str, Any]) -> None:
    """Initialize StringCheckEvaluator.

    Args:
        config: Configuration dictionary with case_sensitive and strip_whitespace options.
    """
    self.config = config
    self.case_sensitive = config.get("case_sensitive", True)
    self.strip_whitespace = config.get("strip_whitespace", True)

Functions¶

evaluate ¶

evaluate(extracted, expected, input_data=None, field_path=None)

Evaluate using exact string matching.

Parameters:

Name	Type	Description	Default
`extracted`	`Any`	Extracted value.	required
`expected`	`Any`	Expected value.	required
`input_data`	`dict[str, Any] \| None`	Optional input data (not used).	`None`
`field_path`	`str \| None`	Optional field path (not used).	`None`

Returns:

Type	Description
`float`	Score 1.0 if match, 0.0 otherwise.

Source code in src/dspydantic/evaluators/string_check.py

def evaluate(
    self,
    extracted: Any,
    expected: Any,
    input_data: dict[str, Any] | None = None,
    field_path: str | None = None,
) -> float:
    """Evaluate using exact string matching.

    Args:
        extracted: Extracted value.
        expected: Expected value.
        input_data: Optional input data (not used).
        field_path: Optional field path (not used).

    Returns:
        Score 1.0 if match, 0.0 otherwise.
    """
    extracted_str = str(extracted)
    expected_str = str(expected)

    if self.strip_whitespace:
        extracted_str = extracted_str.strip()
        expected_str = expected_str.strip()

    if not self.case_sensitive:
        extracted_str = extracted_str.lower()
        expected_str = expected_str.lower()

    return 1.0 if extracted_str == expected_str else 0.0

LevenshteinEvaluator ¶

LevenshteinEvaluator(config)

Evaluator that uses Levenshtein distance for fuzzy string matching.

Useful when extracted values may have minor typos or formatting differences compared to expected values.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration dictionary with options: - threshold (float): Minimum similarity threshold (0-1, default: 0.0). Values below threshold return 0.0.	required

Example

evaluator = LevenshteinEvaluator(config={}) evaluator.evaluate("John Doe", "John Doe") 1.0 evaluator.evaluate("Jon Doe", "John Doe") # Minor typo 0.875 evaluator.evaluate("Jane Smith", "John Doe") # Very different 0.25

With threshold:

evaluator = LevenshteinEvaluator(config={"threshold": 0.8}) evaluator.evaluate("Jon Doe", "John Doe") # Above threshold 0.875 evaluator.evaluate("Jane", "John") # Below threshold, returns 0 0.0

Initialize LevenshteinEvaluator.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration dictionary with threshold option.	required

Source code in src/dspydantic/evaluators/levenshtein.py

def __init__(self, config: dict[str, Any]) -> None:
    """Initialize LevenshteinEvaluator.

    Args:
        config: Configuration dictionary with threshold option.
    """
    self.config = config
    self.threshold = config.get("threshold", 0.0)

Functions¶

evaluate ¶

evaluate(extracted, expected, input_data=None, field_path=None)

Evaluate using Levenshtein distance.

Parameters:

Name	Type	Description	Default
`extracted`	`Any`	Extracted value.	required
`expected`	`Any`	Expected value.	required
`input_data`	`dict[str, Any] \| None`	Optional input data (not used).	`None`
`field_path`	`str \| None`	Optional field path (not used).	`None`

Returns:

Type	Description
`float`	Similarity score between 0.0 and 1.0.

Source code in src/dspydantic/evaluators/levenshtein.py

def evaluate(
    self,
    extracted: Any,
    expected: Any,
    input_data: dict[str, Any] | None = None,
    field_path: str | None = None,
) -> float:
    """Evaluate using Levenshtein distance.

    Args:
        extracted: Extracted value.
        expected: Expected value.
        input_data: Optional input data (not used).
        field_path: Optional field path (not used).

    Returns:
        Similarity score between 0.0 and 1.0.
    """
    extracted_str = str(extracted).strip()
    expected_str = str(expected).strip()

    if extracted_str == expected_str:
        return 1.0

    max_len = max(len(extracted_str), len(expected_str))
    if max_len == 0:
        return 1.0

    distance = self._levenshtein_distance(extracted_str, expected_str)
    similarity = 1.0 - (distance / max_len)

    return max(0.0, similarity) if similarity >= self.threshold else 0.0

TextSimilarityEvaluator ¶

TextSimilarityEvaluator(config)

Evaluator that uses embeddings for semantic similarity.

Best for text where meaning matters more than exact wording. Uses embedding models to compute cosine similarity between extracted and expected values.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration dictionary with options: - model (str): Embedding model (default: "sentence-transformers/all-MiniLM-L6-v2") - provider (str): "sentence-transformers" or "openai" (default: "sentence-transformers") - api_key (str): API key for OpenAI provider - threshold (float): Minimum similarity (0-1, default: 0.0)	required

Raises:

Type	Description
`ImportError`	If sentence-transformers is not installed when using that provider.

Example

Requires: pip install sentence-transformers¶

evaluator = TextSimilarityEvaluator(config={}) # doctest: +SKIP evaluator.evaluate("CEO", "Chief Executive Officer") # doctest: +SKIP 0.82 # Semantically similar

With OpenAI embeddings:

evaluator = TextSimilarityEvaluator(config={ # doctest: +SKIP ... "provider": "openai", ... "model": "text-embedding-ada-002", ... "api_key": "your-key" ... })

Initialize TextSimilarityEvaluator.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration dictionary with model, provider, api_key, threshold options.	required

Source code in src/dspydantic/evaluators/text_similarity.py

def __init__(self, config: dict[str, Any]) -> None:
    """Initialize TextSimilarityEvaluator.

    Args:
        config: Configuration dictionary with model, provider, api_key, threshold options.
    """
    self.config = config
    self.model_name = config.get("model", "sentence-transformers/all-MiniLM-L6-v2")
    self.provider = config.get("provider", "sentence-transformers")
    self.api_key = config.get("api_key")
    self.threshold = config.get("threshold", 0.0)
    self._embedder = None

Functions¶

evaluate ¶

evaluate(extracted, expected, input_data=None, field_path=None)

Evaluate using semantic similarity via embeddings.

Parameters:

Name	Type	Description	Default
`extracted`	`Any`	Extracted value.	required
`expected`	`Any`	Expected value.	required
`input_data`	`dict[str, Any] \| None`	Optional input data (not used).	`None`
`field_path`	`str \| None`	Optional field path (not used).	`None`

Returns:

Type	Description
`float`	Similarity score between 0.0 and 1.0.

Source code in src/dspydantic/evaluators/text_similarity.py

def evaluate(
    self,
    extracted: Any,
    expected: Any,
    input_data: dict[str, Any] | None = None,
    field_path: str | None = None,
) -> float:
    """Evaluate using semantic similarity via embeddings.

    Args:
        extracted: Extracted value.
        expected: Expected value.
        input_data: Optional input data (not used).
        field_path: Optional field path (not used).

    Returns:
        Similarity score between 0.0 and 1.0.
    """
    extracted_str = str(extracted)
    expected_str = str(expected)

    if extracted_str == expected_str:
        return 1.0

    try:
        embeddings = self._get_embeddings([extracted_str, expected_str])
        similarity = self._cosine_similarity(embeddings[0], embeddings[1])
        return max(0.0, similarity) if similarity >= self.threshold else 0.0
    except Exception:
        # Fallback to exact match if embeddings fail
        return 1.0 if extracted_str == expected_str else 0.0

ScoreJudge ¶

ScoreJudge(config)

Evaluator that uses an LLM to assign a numeric score.

Uses a language model to evaluate extraction quality when expected values are not available or when semantic judgment is needed.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration dictionary with options: - criteria (str): Scoring criteria/prompt (default: "Rate the quality on a scale of 0-1") - lm (dspy.LM \| None): Custom LM instance (default: uses dspy.settings.lm) - temperature (float): LLM temperature (default: 0.0) - system_prompt (str \| None): Custom system prompt for the judge	required

Raises:

Type	Description
`ValueError`	If no LM is available.

Example

import dspy # doctest: +SKIP dspy.configure(lm=dspy.LM("openai/gpt-4o-mini")) # doctest: +SKIP evaluator = ScoreJudge(config={ # doctest: +SKIP ... "criteria": "Rate how well the extracted summary captures the key points" ... }) evaluator.evaluate( # doctest: +SKIP ... extracted="Company reported strong Q3 earnings", ... expected=None, # No expected value - judge evaluates quality ... input_data={"text": "Acme Corp announced record Q3 profits..."} ... ) 0.85

Initialize ScoreJudge.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration dictionary with criteria, lm, temperature, system_prompt options.	required

Source code in src/dspydantic/evaluators/score_judge.py

def __init__(self, config: dict[str, Any]) -> None:
    """Initialize ScoreJudge.

    Args:
        config: Configuration dictionary with criteria, lm, temperature, system_prompt options.
    """
    self.config = config
    self.criteria = config.get("criteria", "Rate the quality on a scale of 0-1")
    self.lm = config.get("lm")
    self.temperature = config.get("temperature", 0.0)
    self.system_prompt = config.get("system_prompt")

Functions¶

evaluate ¶

evaluate(extracted, expected, input_data=None, field_path=None)

Evaluate using LLM-based scoring.

Parameters:

Name	Type	Description	Default
`extracted`	`Any`	Extracted value.	required
`expected`	`Any`	Expected value.	required
`input_data`	`dict[str, Any] \| None`	Optional input data for context.	`None`
`field_path`	`str \| None`	Optional field path for context.	`None`

Returns:

Type	Description
`float`	Score between 0.0 and 1.0.

Source code in src/dspydantic/evaluators/score_judge.py

def evaluate(
    self,
    extracted: Any,
    expected: Any,
    input_data: dict[str, Any] | None = None,
    field_path: str | None = None,
) -> float:
    """Evaluate using LLM-based scoring.

    Args:
        extracted: Extracted value.
        expected: Expected value.
        input_data: Optional input data for context.
        field_path: Optional field path for context.

    Returns:
        Score between 0.0 and 1.0.
    """
    if self.lm is None:
        # Use default LM from dspy settings
        lm = dspy.settings.lm
        if lm is None:
            raise ValueError("No LM available for ScoreJudge")
    else:
        lm = self.lm

    # Build prompt
    prompt_parts = []
    if self.system_prompt:
        prompt_parts.append(f"System: {self.system_prompt}")

    prompt_parts.append(f"Criteria: {self.criteria}")
    prompt_parts.append(f"\nExpected value: {expected}")
    prompt_parts.append(f"Extracted value: {extracted}")

    if field_path:
        prompt_parts.append(f"\nField: {field_path}")

    if input_data:
        prompt_parts.append(f"\nInput context: {input_data}")

    prompt_parts.append(
        "\nRespond with a JSON object containing a 'score' field (float between 0.0 and 1.0) "
        "and optionally a 'reasoning' field explaining your evaluation."
    )

    prompt = "\n\n".join(prompt_parts)

    # Use DSPy's ChainOfThought
    signature = "prompt -> evaluation"
    judge = dspy.ChainOfThought(signature)
    result = judge(prompt=prompt)

    # Extract evaluation from result
    evaluation_text = str(result.evaluation) if hasattr(result, "evaluation") else str(result)

    # Try to parse JSON from evaluation
    try:
        evaluation = json.loads(evaluation_text)
        score = float(evaluation.get("score", 0.5))
    except (json.JSONDecodeError, ValueError, AttributeError):
        # Try to extract score from text using regex
        score_match = re.search(r'"score"\s*:\s*([0-9.]+)', evaluation_text)
        if score_match:
            try:
                score = float(score_match.group(1))
            except ValueError:
                score = 0.5
        else:
            # Fallback: try to find a number between 0 and 1
            score_match = re.search(r"\b(0\.\d+|1\.0|1)\b", evaluation_text)
            if score_match:
                try:
                    score = float(score_match.group(1))
                except ValueError:
                    score = 0.5
            else:
                score = 0.5

    return max(0.0, min(1.0, score))

LabelModelGrader ¶

LabelModelGrader(config)

Evaluator that uses an LLM to compare categorical labels.

Best for classification fields where labels may have semantic equivalence (e.g., "urgent" vs "high priority") that exact matching would miss.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration dictionary with: - allowed_labels (list[str]): Valid categorical labels (required) - lm (dspy.LM \| None): Custom LM instance (default: uses dspy.settings.lm) - exact_match_score (float): Score for exact matches (default: 1.0) - partial_match_score (float): Score for partial matches (default: 0.5)	required

Raises:

Type	Description
`ValueError`	If allowed_labels is not provided or empty.

Example

import dspy # doctest: +SKIP dspy.configure(lm=dspy.LM("openai/gpt-4o-mini")) # doctest: +SKIP evaluator = LabelModelGrader(config={ # doctest: +SKIP ... "allowed_labels": ["positive", "neutral", "negative"] ... }) evaluator.evaluate("positive", "positive") # Exact match # doctest: +SKIP 1.0 evaluator.evaluate("good", "positive") # Semantic match via LLM # doctest: +SKIP 0.5

Initialize LabelModelGrader.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration dictionary with allowed_labels, lm, exact_match_score, partial_match_score options.	required

Source code in src/dspydantic/evaluators/label_model_grader.py

def __init__(self, config: dict[str, Any]) -> None:
    """Initialize LabelModelGrader.

    Args:
        config: Configuration dictionary with allowed_labels, lm, exact_match_score,
            partial_match_score options.
    """
    self.config = config
    self.allowed_labels = config.get("allowed_labels", [])
    if not self.allowed_labels:
        raise ValueError("allowed_labels must be provided for LabelModelGrader")
    self.lm = config.get("lm")
    self.exact_match_score = config.get("exact_match_score", 1.0)
    self.partial_match_score = config.get("partial_match_score", 0.5)

Functions¶

evaluate ¶

evaluate(extracted, expected, input_data=None, field_path=None)

Evaluate using LLM-based label selection.

Parameters:

Name	Type	Description	Default
`extracted`	`Any`	Extracted value.	required
`expected`	`Any`	Expected value (should be one of allowed_labels).	required
`input_data`	`dict[str, Any] \| None`	Optional input data for context.	`None`
`field_path`	`str \| None`	Optional field path for context.	`None`

Returns:

Type	Description
`float`	Score between 0.0 and 1.0.

Source code in src/dspydantic/evaluators/label_model_grader.py

def evaluate(
    self,
    extracted: Any,
    expected: Any,
    input_data: dict[str, Any] | None = None,
    field_path: str | None = None,
) -> float:
    """Evaluate using LLM-based label selection.

    Args:
        extracted: Extracted value.
        expected: Expected value (should be one of allowed_labels).
        input_data: Optional input data for context.
        field_path: Optional field path for context.

    Returns:
        Score between 0.0 and 1.0.
    """
    if self.lm is None:
        # Use default LM from dspy settings
        lm = dspy.settings.lm
        if lm is None:
            raise ValueError("No LM available for LabelModelGrader")
    else:
        lm = self.lm

    # Convert to strings for comparison
    extracted_str = str(extracted).strip().lower()
    expected_str = str(expected).strip().lower()

    # Check for exact match first
    if extracted_str == expected_str:
        return self.exact_match_score

    # Check if expected is in allowed labels
    expected_lower = [label.lower() for label in self.allowed_labels]
    if expected_str not in expected_lower:
        # Expected label not in allowed labels - use LLM to determine match
        prompt_parts = []
        prompt_parts.append(
            f"Select the best matching label from: {', '.join(self.allowed_labels)}"
        )
        prompt_parts.append(f"\nExpected label: {expected}")
        prompt_parts.append(f"Extracted label: {extracted}")

        if field_path:
            prompt_parts.append(f"\nField: {field_path}")

        if input_data:
            prompt_parts.append(f"\nInput context: {input_data}")

        prompt_parts.append(
            "\nRespond with a JSON object containing a 'label' field (selected label) "
            "and optionally a 'reasoning' field."
        )

        prompt = "\n\n".join(prompt_parts)

        # Use DSPy's ChainOfThought
        signature = "prompt -> label_selection"
        grader = dspy.ChainOfThought(signature)
        result = grader(prompt=prompt)

        # Extract label from result
        label_text = str(result.label_selection) if hasattr(result, "label_selection") else str(result)

        # Try to parse JSON
        try:
            label_data = json.loads(label_text)
            selected_label = str(label_data.get("label", "")).strip().lower()
        except (json.JSONDecodeError, ValueError):
            # Try to find label in text
            selected_label = label_text.strip().lower()
            for label in self.allowed_labels:
                if label.lower() in selected_label:
                    selected_label = label.lower()
                    break

        # Compare selected label with expected
        if selected_label == expected_str:
            return self.exact_match_score
        elif expected_str in selected_label or selected_label in expected_str:
            return self.partial_match_score
        else:
            return 0.0
    else:
        # Expected is in allowed labels, check if extracted matches
        extracted_lower = extracted_str.lower()
        if extracted_lower in expected_lower:
            idx = expected_lower.index(extracted_lower)
            if self.allowed_labels[idx].lower() == expected_str:
                return self.exact_match_score

        # Check for partial match
        for label in self.allowed_labels:
            if expected_str in label.lower() or label.lower() in expected_str:
                if extracted_lower in label.lower() or label.lower() in extracted_lower:
                    return self.partial_match_score

        return 0.0

PythonCodeEvaluator ¶

PythonCodeEvaluator(config)

Evaluator that uses a callable for custom evaluation logic.

Use this when built-in evaluators don't match your requirements, such as domain-specific validation rules or complex business logic.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration dictionary with: - function (Callable): Function that takes (extracted, expected, input_data, field_path) and returns a float score between 0.0 and 1.0.	required

Raises:

Type	Description
`ValueError`	If 'function' is not provided or not callable.
`RuntimeError`	If the function raises an exception during evaluation.

Example

def age_evaluator(extracted, expected, input_data=None, field_path=None): ... if extracted == expected: ... return 1.0 ... diff = abs(int(extracted) - int(expected)) ... return max(0.0, 1.0 - (diff / 10)) evaluator = PythonCodeEvaluator(config={"function": age_evaluator}) evaluator.evaluate(30, 30) 1.0 evaluator.evaluate(28, 30) # Off by 2 years 0.8 evaluator.evaluate(20, 30) # Off by 10 years 0.0

Initialize PythonCodeEvaluator.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration dictionary with 'function' key containing a callable.	required

Source code in src/dspydantic/evaluators/python_code.py

def __init__(self, config: dict[str, Any]) -> None:
    """Initialize PythonCodeEvaluator.

    Args:
        config: Configuration dictionary with 'function' key containing a callable.
    """
    self.config = config
    self.function = config.get("function")

    if self.function is None:
        raise ValueError("'function' must be provided for PythonCodeEvaluator")

    if not callable(self.function):
        raise ValueError("'function' must be a callable")

Functions¶

evaluate ¶

evaluate(extracted, expected, input_data=None, field_path=None)

Evaluate using the provided callable.

Parameters:

Name	Type	Description	Default
`extracted`	`Any`	Extracted value.	required
`expected`	`Any`	Expected value.	required
`input_data`	`dict[str, Any] \| None`	Optional input data for context.	`None`
`field_path`	`str \| None`	Optional field path for context.	`None`

Returns:

Type	Description
`float`	Score between 0.0 and 1.0.

Source code in src/dspydantic/evaluators/python_code.py

def evaluate(
    self,
    extracted: Any,
    expected: Any,
    input_data: dict[str, Any] | None = None,
    field_path: str | None = None,
) -> float:
    """Evaluate using the provided callable.

    Args:
        extracted: Extracted value.
        expected: Expected value.
        input_data: Optional input data for context.
        field_path: Optional field path for context.

    Returns:
        Score between 0.0 and 1.0.
    """
    try:
        score = float(
            self.function(extracted, expected, input_data=input_data, field_path=field_path)
        )
        return max(0.0, min(1.0, score))
    except Exception as e:
        raise RuntimeError(f"Error executing Python code evaluator function: {e}") from e

PredefinedScoreEvaluator ¶

PredefinedScoreEvaluator(config=None)

Evaluator that uses pre-computed scores from a list.

This evaluator pops scores from a provided list in order as examples are evaluated. Useful when you already have ground truth scores and don't want to recompute them.

Supports: - Float scores (0.0-1.0): Used directly - Bool values: True → 1.0, False → 0.0 - Numbers: Normalized to 0.0-1.0 range (assumes max is 100 if not specified)

Thread-safe for parallel evaluation using thread-local storage.

Examples:

# Float scores
scores = [0.95, 0.87, 0.92, 1.0, 0.78]
evaluator = PredefinedScoreEvaluator(config={"scores": scores})

# Bool values
bool_scores = [True, False, True, True]
evaluator = PredefinedScoreEvaluator(config={"scores": bool_scores})

# Numbers (normalized)
numeric_scores = [95, 87, 92, 100]
evaluator = PredefinedScoreEvaluator(config={"scores": numeric_scores, "max_value": 100})

Initialize PredefinedScoreEvaluator.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any] \| None`	Configuration dictionary with: - "scores": List of scores (float, bool, or numbers) - "max_value": Optional max value for normalization (default: 100)	`None`

Source code in src/dspydantic/evaluators/predefined_score.py

def __init__(self, config: dict[str, Any] | None = None) -> None:
    """Initialize PredefinedScoreEvaluator.

    Args:
        config: Configuration dictionary with:
            - "scores": List of scores (float, bool, or numbers)
            - "max_value": Optional max value for normalization (default: 100)
    """
    config = config or {}
    self.scores = config.get("scores", [])
    self.max_value = config.get("max_value", 100.0)

    if not isinstance(self.scores, list):
        raise ValueError("scores must be a list")

    # Thread-local storage for tracking which score to use
    self._local = threading.local()

Functions¶

evaluate ¶

evaluate(extracted, expected, input_data=None, field_path=None)

Evaluate using pre-defined score.

This method ignores extracted/expected values and returns the next pre-defined score from the list.

Parameters:

Name	Type	Description	Default
`extracted`	`Any`	The extracted value (ignored).	required
`expected`	`Any`	The expected value (ignored).	required
`input_data`	`dict[str, Any] \| None`	Optional input data (ignored).	`None`
`field_path`	`str \| None`	Optional field path (ignored).	`None`

Returns:

Type	Description
`float`	Pre-defined score between 0.0 and 1.0.

Source code in src/dspydantic/evaluators/predefined_score.py

def evaluate(
    self,
    extracted: Any,
    expected: Any,
    input_data: dict[str, Any] | None = None,
    field_path: str | None = None,
) -> float:
    """Evaluate using pre-defined score.

    This method ignores extracted/expected values and returns the next
    pre-defined score from the list.

    Args:
        extracted: The extracted value (ignored).
        expected: The expected value (ignored).
        input_data: Optional input data (ignored).
        field_path: Optional field path (ignored).

    Returns:
        Pre-defined score between 0.0 and 1.0.
    """
    return self._get_next_score()

Evaluators¶

Evaluator Overview¶

Quick Selection Guide¶

API Reference¶

BaseEvaluator ¶

Functions¶

evaluate ¶

StringCheckEvaluator ¶

Functions¶

evaluate ¶

LevenshteinEvaluator ¶

Functions¶

evaluate ¶

TextSimilarityEvaluator ¶

Requires: pip install sentence-transformers¶

Functions¶

evaluate ¶

ScoreJudge ¶

Functions¶

evaluate ¶

LabelModelGrader ¶

Functions¶

evaluate ¶

PythonCodeEvaluator ¶

Functions¶

evaluate ¶

PredefinedScoreEvaluator ¶

Functions¶

evaluate ¶

See Also¶