Getting Started¶
Get from zero to optimized extraction in 5 minutes.
Prerequisites¶
Set your API key:
Step 1: Define Your Model¶
Create a Pydantic model describing what you want to extract:
from pydantic import BaseModel, Field
from typing import Literal
class JobPosting(BaseModel):
"""Extract structured data from job postings."""
title: str = Field(description="Job title")
company: str = Field(description="Company name")
location: str = Field(description="Job location")
salary_range: str | None = Field(description="Salary range if mentioned")
experience_years: str | None = Field(description="Required years of experience")
employment_type: Literal["full_time", "part_time", "contract", "internship"] = Field(
description="Type of employment"
)
remote: bool = Field(description="Whether remote work is available")
skills: list[str] = Field(description="Required skills or technologies")
Tips:
- Field descriptions guide the LLM—be specific
- Use
Literalfor categorical fields - Use
| Nonefor optional fields - Lists work for multi-value fields
Step 2: Create Examples¶
Provide examples of input text and expected output:
from dspydantic import Example
examples = [
Example(
text="""
Senior Software Engineer at TechCorp
Location: San Francisco, CA (Hybrid - 3 days onsite)
Salary: $180,000 - $220,000
We're looking for an experienced engineer with 5+ years of experience
in Python and cloud infrastructure. Strong background in AWS, Kubernetes,
and CI/CD pipelines required.
Full-time position with competitive benefits.
""",
expected_output={
"title": "Senior Software Engineer",
"company": "TechCorp",
"location": "San Francisco, CA",
"salary_range": "$180,000 - $220,000",
"experience_years": "5+ years",
"employment_type": "full_time",
"remote": True,
"skills": ["Python", "AWS", "Kubernetes", "CI/CD"]
}
),
Example(
text="""
Data Analyst Intern - FinanceHub
NYC Office, No Remote
3-month internship for current students. Must know SQL and Excel.
Experience with Tableau is a plus.
""",
expected_output={
"title": "Data Analyst Intern",
"company": "FinanceHub",
"location": "NYC Office",
"salary_range": None,
"experience_years": None,
"employment_type": "internship",
"remote": False,
"skills": ["SQL", "Excel", "Tableau"]
}
),
Example(
text="""
Contract DevOps Engineer
RemoteFirst Inc. | 100% Remote | $85-95/hr
6-month contract. Looking for someone with 3 years experience in
Terraform, Docker, and GitHub Actions. Azure certification preferred.
""",
expected_output={
"title": "Contract DevOps Engineer",
"company": "RemoteFirst Inc.",
"location": "100% Remote",
"salary_range": "$85-95/hr",
"experience_years": "3 years",
"employment_type": "contract",
"remote": True,
"skills": ["Terraform", "Docker", "GitHub Actions", "Azure"]
}
),
]
How many examples?
- 5-10: Good for simple models
- 10-20: Recommended for most cases
- 20+: For complex schemas or edge cases
Step 3: Optimize¶
from dspydantic import Prompter
prompter = Prompter(
model=JobPosting,
model_id="openai/gpt-4o-mini",
)
result = prompter.optimize(examples=examples, verbose=True)
Optimization takes 1-5 minutes depending on example count. With verbose=True, you'll see real-time progress with:
- Rich-formatted headers showing your model and optimization mode
- Field-by-field optimization progress with improved/unchanged indicators
- The actual optimized descriptions and prompts
- Final summary table with scores and API costs
See Configure Optimizations for options like sequential, early_stopping_patience, auto_generate_prompts, compile_kwargs, max_val_examples, include_fields, exclude_fields, and custom progress callbacks.
Step 4: Check Results¶
print(f"Before: {result.baseline_score:.0%}")
print(f"After: {result.optimized_score:.0%}")
print(f"API calls: {result.api_calls}")
print(f"Tokens: {result.total_tokens:,}")
Typical output:
View optimized descriptions:
Step 5: Extract¶
Use your optimized prompter:
job = prompter.run("""
ML Engineer - AI Startup
Boston, MA or Remote
$150K-200K base + equity
Join our team building next-gen recommendation systems.
Need 4+ years with PyTorch, transformers, and production ML.
Full-time. Start immediately.
""")
print(job)
# JobPosting(
# title='ML Engineer',
# company='AI Startup',
# location='Boston, MA or Remote',
# salary_range='$150K-200K base + equity',
# experience_years='4+ years',
# employment_type='full_time',
# remote=True,
# skills=['PyTorch', 'transformers', 'production ML']
# )
Step 6: Save for Production¶
# Save the optimized prompter
prompter.save("./job_parser")
# Later, in production:
prompter = Prompter.load(
"./job_parser",
model=JobPosting,
model_id="openai/gpt-4o-mini"
)
job = prompter.run(new_posting_text)
Quick Reference¶
| Method | Purpose |
|---|---|
Prompter(model, model_id) |
Create prompter |
prompter.optimize(examples) |
Optimize with examples |
prompter.run(text) |
Extract from text |
prompter.predict_batch(texts) |
Batch extraction |
prompter.save(path) |
Save optimized state |
Prompter.load(path, model, model_id) |
Load saved prompter |
Next Steps¶
| Topic | Guide |
|---|---|
| Different input types | Modalities |
| Images and PDFs | Modalities - Images/PDFs |
| Customize evaluation | Configure Evaluators |
| Complex schemas | Nested Models |
| Production deployment | Save and Load |
| Integration patterns | Integration Patterns |
Troubleshooting¶
Low accuracy after optimization?
- Add more diverse examples
- Check that examples are correct
- Try a more capable model (
gpt-4ovsgpt-4o-mini)
Optimization takes too long?
- Reduce example count for initial testing
- Use
gpt-4o-minifor faster iterations - Use
compile_kwargs={"num_trials": 5}to limit MiPROv2 trials - Use
early_stopping_patience=2in sequential mode
API key issues?
See Configure Models for more options.