Evaluators¶
Evaluation system for measuring extraction quality.
Evaluator Overview¶
| Evaluator | Alias | When to Use | Data Types | Speed | Accuracy |
|---|---|---|---|---|---|
| StringCheckEvaluator | exact |
Precise values that must match exactly (IDs, codes, exact strings) | Strings | Fast | Exact |
| LevenshteinEvaluator | levenshtein |
Text with minor spelling or formatting differences | Strings | Fast | Fuzzy |
| TextSimilarityEvaluator | text_similarity |
Text where meaning matters more than exact wording | Strings | Medium | Semantic |
| ScoreJudge | score_judge |
Numeric scores or ratings needing quality assessment | Numbers | Slow | LLM-based |
| LabelModelGrader | label_model_grader |
Classification labels needing context-aware evaluation | Labels/Categories | Slow | LLM-based |
| PythonCodeEvaluator | python_code |
Custom evaluation logic for complex business rules | Any | Medium | Custom |
| PredefinedScoreEvaluator | predefined_score |
Pre-computed scores (no evaluation needed) | Any | Fastest | Pre-computed |
Quick Selection Guide¶
- Exact match needed? → Use
exact(StringCheckEvaluator) - Minor variations OK? → Use
levenshtein(LevenshteinEvaluator) - Semantic similarity? → Use
text_similarity(TextSimilarityEvaluator) - Complex evaluation? → Use
score_judgeorlabel_model_grader - Custom logic? → Use
python_code(PythonCodeEvaluator) - Already have scores? → Use
predefined_score(PredefinedScoreEvaluator)
API Reference¶
BaseEvaluator ¶
Bases: Protocol
Protocol for all evaluators.
All evaluators must implement the evaluate method that takes extracted and expected values and returns a score between 0.0 and 1.0.
Functions¶
evaluate ¶
Evaluate extracted value against expected value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
extracted
|
Any
|
The extracted value to evaluate. |
required |
expected
|
Any
|
The expected value to compare against. |
required |
input_data
|
dict[str, Any] | None
|
Optional input data dictionary for context. |
None
|
field_path
|
str | None
|
Optional field path (e.g., "name", "address.street") for context. |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Score between 0.0 and 1.0, where 1.0 is a perfect match. |
Source code in src/dspydantic/evaluators/config.py
StringCheckEvaluator ¶
Evaluator that performs exact string matching.
Best for IDs, codes, enums, and other values that must match exactly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict[str, Any]
|
Configuration dictionary with options: - case_sensitive (bool): Whether comparison is case-sensitive (default: True) - strip_whitespace (bool): Whether to strip whitespace (default: True) |
required |
Example
evaluator = StringCheckEvaluator(config={}) evaluator.evaluate("ABC123", "ABC123") 1.0 evaluator.evaluate("abc123", "ABC123") # Case mismatch 0.0 evaluator.evaluate(" ABC123 ", "ABC123") # Whitespace stripped 1.0
Case-insensitive matching:
evaluator = StringCheckEvaluator(config={"case_sensitive": False}) evaluator.evaluate("abc123", "ABC123") 1.0
Initialize StringCheckEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict[str, Any]
|
Configuration dictionary with case_sensitive and strip_whitespace options. |
required |
Source code in src/dspydantic/evaluators/string_check.py
Functions¶
evaluate ¶
Evaluate using exact string matching.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
extracted
|
Any
|
Extracted value. |
required |
expected
|
Any
|
Expected value. |
required |
input_data
|
dict[str, Any] | None
|
Optional input data (not used). |
None
|
field_path
|
str | None
|
Optional field path (not used). |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Score 1.0 if match, 0.0 otherwise. |
Source code in src/dspydantic/evaluators/string_check.py
LevenshteinEvaluator ¶
Evaluator that uses Levenshtein distance for fuzzy string matching.
Useful when extracted values may have minor typos or formatting differences compared to expected values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict[str, Any]
|
Configuration dictionary with options: - threshold (float): Minimum similarity threshold (0-1, default: 0.0). Values below threshold return 0.0. |
required |
Example
evaluator = LevenshteinEvaluator(config={}) evaluator.evaluate("John Doe", "John Doe") 1.0 evaluator.evaluate("Jon Doe", "John Doe") # Minor typo 0.875 evaluator.evaluate("Jane Smith", "John Doe") # Very different 0.25
With threshold:
evaluator = LevenshteinEvaluator(config={"threshold": 0.8}) evaluator.evaluate("Jon Doe", "John Doe") # Above threshold 0.875 evaluator.evaluate("Jane", "John") # Below threshold, returns 0 0.0
Initialize LevenshteinEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict[str, Any]
|
Configuration dictionary with threshold option. |
required |
Source code in src/dspydantic/evaluators/levenshtein.py
Functions¶
evaluate ¶
Evaluate using Levenshtein distance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
extracted
|
Any
|
Extracted value. |
required |
expected
|
Any
|
Expected value. |
required |
input_data
|
dict[str, Any] | None
|
Optional input data (not used). |
None
|
field_path
|
str | None
|
Optional field path (not used). |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Similarity score between 0.0 and 1.0. |
Source code in src/dspydantic/evaluators/levenshtein.py
TextSimilarityEvaluator ¶
Evaluator that uses embeddings for semantic similarity.
Best for text where meaning matters more than exact wording. Uses embedding models to compute cosine similarity between extracted and expected values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict[str, Any]
|
Configuration dictionary with options: - model (str): Embedding model (default: "sentence-transformers/all-MiniLM-L6-v2") - provider (str): "sentence-transformers" or "openai" (default: "sentence-transformers") - api_key (str): API key for OpenAI provider - threshold (float): Minimum similarity (0-1, default: 0.0) |
required |
Raises:
| Type | Description |
|---|---|
ImportError
|
If sentence-transformers is not installed when using that provider. |
Example
Requires: pip install sentence-transformers¶
evaluator = TextSimilarityEvaluator(config={}) # doctest: +SKIP evaluator.evaluate("CEO", "Chief Executive Officer") # doctest: +SKIP 0.82 # Semantically similar
With OpenAI embeddings:
evaluator = TextSimilarityEvaluator(config={ # doctest: +SKIP ... "provider": "openai", ... "model": "text-embedding-ada-002", ... "api_key": "your-key" ... })
Initialize TextSimilarityEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict[str, Any]
|
Configuration dictionary with model, provider, api_key, threshold options. |
required |
Source code in src/dspydantic/evaluators/text_similarity.py
Functions¶
evaluate ¶
Evaluate using semantic similarity via embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
extracted
|
Any
|
Extracted value. |
required |
expected
|
Any
|
Expected value. |
required |
input_data
|
dict[str, Any] | None
|
Optional input data (not used). |
None
|
field_path
|
str | None
|
Optional field path (not used). |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Similarity score between 0.0 and 1.0. |
Source code in src/dspydantic/evaluators/text_similarity.py
ScoreJudge ¶
Evaluator that uses an LLM to assign a numeric score.
Uses a language model to evaluate extraction quality when expected values are not available or when semantic judgment is needed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict[str, Any]
|
Configuration dictionary with options: - criteria (str): Scoring criteria/prompt (default: "Rate the quality on a scale of 0-1") - lm (dspy.LM | None): Custom LM instance (default: uses dspy.settings.lm) - temperature (float): LLM temperature (default: 0.0) - system_prompt (str | None): Custom system prompt for the judge |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no LM is available. |
Example
import dspy # doctest: +SKIP dspy.configure(lm=dspy.LM("openai/gpt-4o-mini")) # doctest: +SKIP evaluator = ScoreJudge(config={ # doctest: +SKIP ... "criteria": "Rate how well the extracted summary captures the key points" ... }) evaluator.evaluate( # doctest: +SKIP ... extracted="Company reported strong Q3 earnings", ... expected=None, # No expected value - judge evaluates quality ... input_data={"text": "Acme Corp announced record Q3 profits..."} ... ) 0.85
Initialize ScoreJudge.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict[str, Any]
|
Configuration dictionary with criteria, lm, temperature, system_prompt options. |
required |
Source code in src/dspydantic/evaluators/score_judge.py
Functions¶
evaluate ¶
Evaluate using LLM-based scoring.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
extracted
|
Any
|
Extracted value. |
required |
expected
|
Any
|
Expected value. |
required |
input_data
|
dict[str, Any] | None
|
Optional input data for context. |
None
|
field_path
|
str | None
|
Optional field path for context. |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Score between 0.0 and 1.0. |
Source code in src/dspydantic/evaluators/score_judge.py
LabelModelGrader ¶
Evaluator that uses an LLM to compare categorical labels.
Best for classification fields where labels may have semantic equivalence (e.g., "urgent" vs "high priority") that exact matching would miss.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict[str, Any]
|
Configuration dictionary with: - allowed_labels (list[str]): Valid categorical labels (required) - lm (dspy.LM | None): Custom LM instance (default: uses dspy.settings.lm) - exact_match_score (float): Score for exact matches (default: 1.0) - partial_match_score (float): Score for partial matches (default: 0.5) |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If allowed_labels is not provided or empty. |
Example
import dspy # doctest: +SKIP dspy.configure(lm=dspy.LM("openai/gpt-4o-mini")) # doctest: +SKIP evaluator = LabelModelGrader(config={ # doctest: +SKIP ... "allowed_labels": ["positive", "neutral", "negative"] ... }) evaluator.evaluate("positive", "positive") # Exact match # doctest: +SKIP 1.0 evaluator.evaluate("good", "positive") # Semantic match via LLM # doctest: +SKIP 0.5
Initialize LabelModelGrader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict[str, Any]
|
Configuration dictionary with allowed_labels, lm, exact_match_score, partial_match_score options. |
required |
Source code in src/dspydantic/evaluators/label_model_grader.py
Functions¶
evaluate ¶
Evaluate using LLM-based label selection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
extracted
|
Any
|
Extracted value. |
required |
expected
|
Any
|
Expected value (should be one of allowed_labels). |
required |
input_data
|
dict[str, Any] | None
|
Optional input data for context. |
None
|
field_path
|
str | None
|
Optional field path for context. |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Score between 0.0 and 1.0. |
Source code in src/dspydantic/evaluators/label_model_grader.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | |
PythonCodeEvaluator ¶
Evaluator that uses a callable for custom evaluation logic.
Use this when built-in evaluators don't match your requirements, such as domain-specific validation rules or complex business logic.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict[str, Any]
|
Configuration dictionary with: - function (Callable): Function that takes (extracted, expected, input_data, field_path) and returns a float score between 0.0 and 1.0. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If 'function' is not provided or not callable. |
RuntimeError
|
If the function raises an exception during evaluation. |
Example
def age_evaluator(extracted, expected, input_data=None, field_path=None): ... if extracted == expected: ... return 1.0 ... diff = abs(int(extracted) - int(expected)) ... return max(0.0, 1.0 - (diff / 10)) evaluator = PythonCodeEvaluator(config={"function": age_evaluator}) evaluator.evaluate(30, 30) 1.0 evaluator.evaluate(28, 30) # Off by 2 years 0.8 evaluator.evaluate(20, 30) # Off by 10 years 0.0
Initialize PythonCodeEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict[str, Any]
|
Configuration dictionary with 'function' key containing a callable. |
required |
Source code in src/dspydantic/evaluators/python_code.py
Functions¶
evaluate ¶
Evaluate using the provided callable.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
extracted
|
Any
|
Extracted value. |
required |
expected
|
Any
|
Expected value. |
required |
input_data
|
dict[str, Any] | None
|
Optional input data for context. |
None
|
field_path
|
str | None
|
Optional field path for context. |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Score between 0.0 and 1.0. |
Source code in src/dspydantic/evaluators/python_code.py
PredefinedScoreEvaluator ¶
Evaluator that uses pre-computed scores from a list.
This evaluator pops scores from a provided list in order as examples are evaluated. Useful when you already have ground truth scores and don't want to recompute them.
Supports: - Float scores (0.0-1.0): Used directly - Bool values: True → 1.0, False → 0.0 - Numbers: Normalized to 0.0-1.0 range (assumes max is 100 if not specified)
Thread-safe for parallel evaluation using thread-local storage.
Examples:
# Float scores
scores = [0.95, 0.87, 0.92, 1.0, 0.78]
evaluator = PredefinedScoreEvaluator(config={"scores": scores})
# Bool values
bool_scores = [True, False, True, True]
evaluator = PredefinedScoreEvaluator(config={"scores": bool_scores})
# Numbers (normalized)
numeric_scores = [95, 87, 92, 100]
evaluator = PredefinedScoreEvaluator(config={"scores": numeric_scores, "max_value": 100})
Initialize PredefinedScoreEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict[str, Any] | None
|
Configuration dictionary with: - "scores": List of scores (float, bool, or numbers) - "max_value": Optional max value for normalization (default: 100) |
None
|
Source code in src/dspydantic/evaluators/predefined_score.py
Functions¶
evaluate ¶
Evaluate using pre-defined score.
This method ignores extracted/expected values and returns the next pre-defined score from the list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
extracted
|
Any
|
The extracted value (ignored). |
required |
expected
|
Any
|
The expected value (ignored). |
required |
input_data
|
dict[str, Any] | None
|
Optional input data (ignored). |
None
|
field_path
|
str | None
|
Optional field path (ignored). |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Pre-defined score between 0.0 and 1.0. |