Structured Outputs > Prompting: How I Make LLMs Deterministic

Every LLM tutorial focuses on writing better prompts. "Be specific!" "Use few-shot examples!" "Add system instructions!"

These help, but they're not enough. In production, prompts alone give you maybe 85% reliability. The other 15% will haunt your on-call rotations.

The solution: structured outputs. Stop asking the LLM to return formatted text and start forcing it to return typed data.

The Problem with String Parsing

Here's how most tutorials teach you to extract data:

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{
        "role": "user",
        "content": """Extract the following from this job posting:
        - Job title
        - Salary range
        - Required skills

        Format as JSON.

        Job posting: {job_text}"""
    }]
)

# Now parse the response... 🙏
data = json.loads(response.choices[0].message.content)

This breaks in 100 ways:

LLM wraps output in markdown code blocks: ```json ... ```
LLM adds explanatory text before/after the JSON
LLM uses wrong field names (job_title vs title)
LLM returns null as the string "null"
LLM forgets a comma and returns invalid JSON

At ApplyPass, we process 800K job listings per month. A 1% parse failure rate = 8,000 failed requests. Unacceptable.

Solution 1: Function Calling

OpenAI's function calling forces the model to output valid JSON matching your schema:

from openai import OpenAI

client = OpenAI()

tools = [{
    "type": "function",
    "function": {
        "name": "classify_job",
        "description": "Classify a job posting",
        "parameters": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "seniority": {
                    "type": "string",
                    "enum": ["junior", "mid", "senior", "staff"]
                },
                "salary_min": {"type": "integer"},
                "salary_max": {"type": "integer"},
                "remote": {"type": "boolean"},
                "skills": {
                    "type": "array",
                    "items": {"type": "string"}
                }
            },
            "required": ["title", "seniority", "remote", "skills"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{"role": "user", "content": job_text}],
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "classify_job"}}
)

# Guaranteed valid JSON
args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)

Result: Parse errors dropped from ~5% to ~0.1%.

Solution 2: Pydantic + Response Format

With GPT-4o and later, you can use Pydantic directly:

from pydantic import BaseModel, Field
from typing import Literal, Optional
from openai import OpenAI

client = OpenAI()

class JobClassification(BaseModel):
    """Structured classification of a job posting."""

    title: str = Field(description="The job title")
    seniority: Literal["junior", "mid", "senior", "staff", "principal"]
    remote: bool = Field(description="Whether the job is remote")
    salary_min: Optional[int] = Field(default=None, ge=0)
    salary_max: Optional[int] = Field(default=None, ge=0)
    required_skills: list[str] = Field(default_factory=list)

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You classify job postings."},
        {"role": "user", "content": job_text}
    ],
    response_format=JobClassification
)

job = response.choices[0].message.parsed  # Type-safe Pydantic model!

This is my preferred approach because:

Type safety: IDE autocomplete works
Validation built-in: Pydantic validates ranges, enums, etc.
Self-documenting: Schema IS the documentation
Easy iteration: Change schema, re-run, done

Handling Edge Cases

Even with structured outputs, LLMs can produce semantically wrong data. Add validation layers:

class JobClassification(BaseModel):
    title: str
    salary_min: Optional[int] = None
    salary_max: Optional[int] = None

    @model_validator(mode='after')
    def validate_salary_range(self) -> 'JobClassification':
        if self.salary_min and self.salary_max:
            if self.salary_min > self.salary_max:
                # Swap if reversed
                self.salary_min, self.salary_max = self.salary_max, self.salary_min
        return self

    @field_validator('title')
    @classmethod
    def title_not_empty(cls, v: str) -> str:
        if not v.strip():
            raise ValueError("Title cannot be empty")
        return v.strip()

For complex validation, I use a two-pass approach:

async def classify_with_validation(job_text: str) -> JobClassification:
    # First pass: basic extraction
    result = await extract_job_data(job_text)

    # Validation
    issues = validate_job_data(result)

    if issues:
        # Second pass: fix specific issues
        result = await fix_job_data(job_text, result, issues)

    return result

Schema Evolution

Production schemas change. Handle it gracefully:

Versioning

class JobClassificationV1(BaseModel):
    title: str
    remote: bool

class JobClassificationV2(BaseModel):
    title: str
    remote: bool
    hybrid: bool  # New field!

    @classmethod
    def from_v1(cls, v1: JobClassificationV1) -> 'JobClassificationV2':
        return cls(
            title=v1.title,
            remote=v1.remote,
            hybrid=False  # Default for old data
        )

Backwards Compatibility

Use Optional with defaults for new fields:

class JobClassification(BaseModel):
    # Original fields
    title: str
    remote: bool

    # Added in v2 - optional with default
    hybrid: Optional[bool] = False
    work_location: Optional[str] = None

Real Results

After implementing structured outputs at ApplyPass:

| Metric | Before | After | |--------|--------|-------| | Parse errors | 5.2% | 0.1% | | Schema violations | 8.1% | 0.3% | | Downstream failures | 12% | 0.5% | | Engineer on-call pages | Weekly | Monthly |

The investment in proper schemas paid for itself in the first month.

Key Takeaways

Stop parsing strings: Use function calling or structured outputs
Pydantic is your friend: Type safety + validation in one
Add validation layers: LLMs can return valid JSON with invalid data
Plan for evolution: Version schemas, use optional fields
Measure everything: Know your error rates before and after

Prompting is an art. Structured outputs are engineering. In production, engineering wins.

Questions? Reach out on LinkedIn.