Structured Outputs > Prompting: How I Make LLMs Deterministic
Moving beyond prompt engineering to contract-first design with Pydantic schemas, validation layers, and structured outputs.
Every LLM tutorial focuses on writing better prompts. "Be specific!" "Use few-shot examples!" "Add system instructions!"
These help, but they're not enough. In production, prompts alone give you maybe 85% reliability. The other 15% will haunt your on-call rotations.
The solution: structured outputs. Stop asking the LLM to return formatted text and start forcing it to return typed data.
The Problem with String Parsing
Here's how most tutorials teach you to extract data:
response = client.chat.completions.create(
model="gpt-4",
messages=[{
"role": "user",
"content": """Extract the following from this job posting:
- Job title
- Salary range
- Required skills
Format as JSON.
Job posting: {job_text}"""
}]
)
# Now parse the response... 🙏
data = json.loads(response.choices[0].message.content)
This breaks in 100 ways:
- LLM wraps output in markdown code blocks:
```json ... ``` - LLM adds explanatory text before/after the JSON
- LLM uses wrong field names (
job_titlevstitle) - LLM returns
nullas the string"null" - LLM forgets a comma and returns invalid JSON
At ApplyPass, we process 800K job listings per month. A 1% parse failure rate = 8,000 failed requests. Unacceptable.
Solution 1: Function Calling
OpenAI's function calling forces the model to output valid JSON matching your schema:
from openai import OpenAI
client = OpenAI()
tools = [{
"type": "function",
"function": {
"name": "classify_job",
"description": "Classify a job posting",
"parameters": {
"type": "object",
"properties": {
"title": {"type": "string"},
"seniority": {
"type": "string",
"enum": ["junior", "mid", "senior", "staff"]
},
"salary_min": {"type": "integer"},
"salary_max": {"type": "integer"},
"remote": {"type": "boolean"},
"skills": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["title", "seniority", "remote", "skills"]
}
}
}]
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": job_text}],
tools=tools,
tool_choice={"type": "function", "function": {"name": "classify_job"}}
)
# Guaranteed valid JSON
args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
Result: Parse errors dropped from ~5% to ~0.1%.
Solution 2: Pydantic + Response Format
With GPT-4o and later, you can use Pydantic directly:
from pydantic import BaseModel, Field
from typing import Literal, Optional
from openai import OpenAI
client = OpenAI()
class JobClassification(BaseModel):
"""Structured classification of a job posting."""
title: str = Field(description="The job title")
seniority: Literal["junior", "mid", "senior", "staff", "principal"]
remote: bool = Field(description="Whether the job is remote")
salary_min: Optional[int] = Field(default=None, ge=0)
salary_max: Optional[int] = Field(default=None, ge=0)
required_skills: list[str] = Field(default_factory=list)
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "You classify job postings."},
{"role": "user", "content": job_text}
],
response_format=JobClassification
)
job = response.choices[0].message.parsed # Type-safe Pydantic model!
This is my preferred approach because:
- Type safety: IDE autocomplete works
- Validation built-in: Pydantic validates ranges, enums, etc.
- Self-documenting: Schema IS the documentation
- Easy iteration: Change schema, re-run, done
Handling Edge Cases
Even with structured outputs, LLMs can produce semantically wrong data. Add validation layers:
class JobClassification(BaseModel):
title: str
salary_min: Optional[int] = None
salary_max: Optional[int] = None
@model_validator(mode='after')
def validate_salary_range(self) -> 'JobClassification':
if self.salary_min and self.salary_max:
if self.salary_min > self.salary_max:
# Swap if reversed
self.salary_min, self.salary_max = self.salary_max, self.salary_min
return self
@field_validator('title')
@classmethod
def title_not_empty(cls, v: str) -> str:
if not v.strip():
raise ValueError("Title cannot be empty")
return v.strip()
For complex validation, I use a two-pass approach:
async def classify_with_validation(job_text: str) -> JobClassification:
# First pass: basic extraction
result = await extract_job_data(job_text)
# Validation
issues = validate_job_data(result)
if issues:
# Second pass: fix specific issues
result = await fix_job_data(job_text, result, issues)
return result
Schema Evolution
Production schemas change. Handle it gracefully:
Versioning
class JobClassificationV1(BaseModel):
title: str
remote: bool
class JobClassificationV2(BaseModel):
title: str
remote: bool
hybrid: bool # New field!
@classmethod
def from_v1(cls, v1: JobClassificationV1) -> 'JobClassificationV2':
return cls(
title=v1.title,
remote=v1.remote,
hybrid=False # Default for old data
)
Backwards Compatibility
Use Optional with defaults for new fields:
class JobClassification(BaseModel):
# Original fields
title: str
remote: bool
# Added in v2 - optional with default
hybrid: Optional[bool] = False
work_location: Optional[str] = None
Real Results
After implementing structured outputs at ApplyPass:
| Metric | Before | After | |--------|--------|-------| | Parse errors | 5.2% | 0.1% | | Schema violations | 8.1% | 0.3% | | Downstream failures | 12% | 0.5% | | Engineer on-call pages | Weekly | Monthly |
The investment in proper schemas paid for itself in the first month.
Key Takeaways
- Stop parsing strings: Use function calling or structured outputs
- Pydantic is your friend: Type safety + validation in one
- Add validation layers: LLMs can return valid JSON with invalid data
- Plan for evolution: Version schemas, use optional fields
- Measure everything: Know your error rates before and after
Prompting is an art. Structured outputs are engineering. In production, engineering wins.
Questions? Reach out on LinkedIn.