Skip to content

Extract Structured Data

Learn the complete workflow: define a schema, optimize with examples, and extract with high accuracy.

When to Use

  • You want structured output (a typed object, not just text)
  • Your data has multiple fields with specific types
  • You need validated extraction (Pydantic ensures correct types)

If you just want text output, see Extract Free-form Text instead.


Step 1: Define Your Model

Create a Pydantic model describing what you want to extract:

from pydantic import BaseModel, Field
from typing import Literal

class JobPosting(BaseModel):
    """Extract structured data from job postings."""
    title: str = Field(description="Job title")
    company: str = Field(description="Company name")
    location: str = Field(description="Job location")
    salary_range: str | None = Field(description="Salary range if mentioned")
    experience_years: str | None = Field(description="Required years of experience")
    employment_type: Literal["full_time", "part_time", "contract", "internship"] = Field(
        description="Type of employment"
    )
    remote: bool = Field(description="Whether remote work is available")
    skills: list[str] = Field(description="Required skills or technologies")

Tips:

  • Field descriptions guide the LLM — be specific
  • Use Literal for categorical fields with known values
  • Use | None for optional fields
  • Use lists for multi-value fields

Step 2: Create Examples

Provide examples of input text and expected output as dicts:

from dspydantic import Example

examples = [
    Example(
        text="""
        Senior Software Engineer at TechCorp

        Location: San Francisco, CA (Hybrid - 3 days onsite)
        Salary: $180,000 - $220,000

        We're looking for an experienced engineer with 5+ years of experience
        in Python and cloud infrastructure. Strong background in AWS, Kubernetes,
        and CI/CD pipelines required.

        Full-time position with competitive benefits.
        """,
        expected_output={
            "title": "Senior Software Engineer",
            "company": "TechCorp",
            "location": "San Francisco, CA",
            "salary_range": "$180,000 - $220,000",
            "experience_years": "5+ years",
            "employment_type": "full_time",
            "remote": True,
            "skills": ["Python", "AWS", "Kubernetes", "CI/CD"]
        }
    ),
    Example(
        text="""
        Data Analyst Intern - FinanceHub

        NYC Office, No Remote

        3-month internship for current students. Must know SQL and Excel.
        Experience with Tableau is a plus.
        """,
        expected_output={
            "title": "Data Analyst Intern",
            "company": "FinanceHub",
            "location": "NYC Office",
            "salary_range": None,
            "experience_years": None,
            "employment_type": "internship",
            "remote": False,
            "skills": ["SQL", "Excel", "Tableau"]
        }
    ),
    Example(
        text="""
        Contract DevOps Engineer

        RemoteFirst Inc. | 100% Remote | $85-95/hr

        6-month contract. Looking for someone with 3 years experience in
        Terraform, Docker, and GitHub Actions. Azure certification preferred.
        """,
        expected_output={
            "title": "Contract DevOps Engineer",
            "company": "RemoteFirst Inc.",
            "location": "100% Remote",
            "salary_range": "$85-95/hr",
            "experience_years": "3 years",
            "employment_type": "contract",
            "remote": True,
            "skills": ["Terraform", "Docker", "GitHub Actions", "Azure"]
        }
    ),
]

How many examples?

  • 5-10: Good for simple models
  • 10-20: Recommended for most cases
  • 20+: For complex schemas or edge cases

Step 3: Optimize

Create a prompter and optimize with your examples:

import dspy
from dspydantic import Prompter

# Configure language model (see Configure a Language Model tutorial)
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini", api_key="your-api-key"))

# Create prompter
prompter = Prompter(
    model=JobPosting,
    model_id="openai/gpt-4o-mini",
)

# Optimize with examples
result = prompter.optimize(examples=examples)

What gets optimized:

What Impact
Field descriptions High — clarifies what each field should extract
System/instruction prompts Medium — guides overall extraction behavior

How the default mode works:

By default (sequential=False), all fields and prompts are optimized together in a single pass for speed. For higher quality, use sequential=True to optimize each field independently (starting with deepest-nested fields), then optimize system and instruction prompts.

See Configure Optimization Parameters for more options like early_stopping_patience, auto_generate_prompts, and compile_kwargs.

Optimization takes 1-5 minutes depending on example count and model.


Step 4: Check Results

print(f"Before: {result.baseline_score:.0%}")
print(f"After:  {result.optimized_score:.0%}")
print(f"API calls: {result.api_calls}")
print(f"Tokens: {result.total_tokens:,}")

Typical output:

Before: 72%
After:  91%
API calls: 47
Tokens: 28,450

View optimized descriptions:

for field, desc in result.optimized_descriptions.items():
    print(f"{field}: {desc}")

You'll see how the optimizer refined the field descriptions to guide the model better.


Step 5: Extract

Use your optimized prompter on new data:

job = prompter.run("""
    ML Engineer - AI Startup

    Boston, MA or Remote
    $150K-200K base + equity

    Join our team building next-gen recommendation systems.
    Need 4+ years with PyTorch, transformers, and production ML.
    Full-time. Start immediately.
""")

print(job)
# JobPosting(
#     title='ML Engineer',
#     company='AI Startup',
#     location='Boston, MA or Remote',
#     salary_range='$150K-200K base + equity',
#     experience_years='4+ years',
#     employment_type='full_time',
#     remote=True,
#     skills=['PyTorch', 'transformers', 'production ML']
# )

The result is a fully typed JobPosting instance. Pydantic validates all fields before returning.


Step 6: Save for Production

# Save the optimized prompter
prompter.save("./job_parser")

# Later, in production:
prompter = Prompter.load(
    "./job_parser",
    model=JobPosting,
    model_id="openai/gpt-4o-mini"
)

job = prompter.run(new_posting_text)

See Save and Load a Prompter for detailed deployment instructions.


Going Further: Images and PDFs

The same workflow works with images and PDFs. Just use different input modalities:

Images

from pydantic import BaseModel, Field

class Digit(BaseModel):
    digit: int = Field(description="The handwritten digit (0-9)")

examples = [
    Example(image_path="digit_5.png", expected_output={"digit": 5}),
    Example(image_path="digit_3.png", expected_output={"digit": 3}),
]

prompter = Prompter(model=Digit)
result = prompter.optimize(examples=examples)
digit = prompter.run(image_path="new_digit.png")
print(digit.digit)  # 7

PDFs

class Invoice(BaseModel):
    invoice_number: str = Field(description="Invoice ID")
    total: str = Field(description="Total amount")

examples = [
    Example(pdf_path="invoice_001.pdf", expected_output={"invoice_number": "INV-001", "total": "$500"}),
    Example(pdf_path="invoice_002.pdf", expected_output={"invoice_number": "INV-002", "total": "$750"}),
]

prompter = Prompter(model=Invoice)
result = prompter.optimize(examples=examples)
inv = prompter.run(pdf_path="new_invoice.pdf")
print(inv.invoice_number)  # "INV-003"

Full input format details are in Use Images and PDFs.


Quick Reference

Method Purpose
Prompter(model, model_id) Create prompter
prompter.optimize(examples) Optimize with examples
prompter.run(text) Extract from text
prompter.predict_batch(texts) Batch extraction
prompter.save(path) Save optimized state
Prompter.load(path, model, model_id) Load saved prompter

Next Steps

Topic Guide
Extract text instead Extract Free-form Text
Dynamic prompts Optimize with Prompt Templates
Nested models Optimize Nested Models
Customize evaluation Configure Evaluators
Production deployment Save and Load a Prompter
Integration patterns Integrate with Applications

Troubleshooting

Low accuracy after optimization?

  • Add more diverse examples (aim for 10-20)
  • Check that your examples are correct
  • Try a more capable model (gpt-4o vs gpt-4o-mini)
  • Review the optimized field descriptions to see if they make sense

Optimization takes too long?

  • Reduce example count for initial testing
  • Use gpt-4o-mini for faster iterations
  • Use single-pass mode (default) or limit trials with compile_kwargs={"num_trials": 5}

API key issues?

import dspy
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini", api_key="sk-..."))

See Configure a Language Model for all options.