Production LLM Architecture Patterns

Building LLM applications that work in production requires a fundamentally different mindset than prototyping in notebooks. After shipping systems handling 800K+ requests/month at ApplyPass and building enterprise AI tools at Verizon, here are the patterns I've found essential.

The Production Mindset Shift

Most LLM tutorials show you how to call openai.chat.completions.create() and parse the response. That's maybe 5% of the actual work. The other 95%:

Preprocessing: How do you handle 50-page PDFs? Messy HTML? Scanned documents?
Reliability: What happens when OpenAI returns a 500? Or rate limits you?
Cost: How do you not bankrupt your company on API calls?
Latency: How do you make it feel fast to users?
Quality: How do you know if it's working?

Let me walk through each layer.

1. The Request Pipeline

Every production LLM system I've built follows this pattern:

User Input → Preprocessing → Context Assembly → LLM Call → Postprocessing → Response
                  ↓
            Caching Layer

Preprocessing Layer

This is where most of the complexity lives. At ApplyPass, we process job listings from 20+ different ATS platforms. Each has different HTML structure, different encoding, different edge cases.

Key decisions:

Document parsing: I use unstructured or custom parsers depending on document type. PDFs go through OCR if needed.
Chunking strategy: For RAG, I chunk by semantic sections, not arbitrary character counts. Headers and structure matter.
Cleaning: Strip boilerplate, normalize whitespace, handle encoding issues. This sounds boring but it's 30% of production bugs.

# Bad: Arbitrary chunking
chunks = [text[i:i+1000] for i in range(0, len(text), 1000)]

# Better: Semantic chunking with overlap
chunks = semantic_chunker.chunk(
    text,
    max_tokens=500,
    overlap_tokens=50,
    split_on=["## ", "\n\n", ". "]
)

Context Assembly

Once you have clean inputs, you need to assemble the context window intelligently:

Prioritize: Most relevant content goes first (LLMs pay more attention to the start)
Truncate intelligently: Don't cut off mid-sentence
Leave headroom: Always leave room for the model to respond

At ApplyPass, we rank chunks by embedding similarity AND recency, then pack them into the context window with metadata.

2. The LLM Call Layer

This is the "simple" part, but production requires several patterns:

Retry Logic with Exponential Backoff

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=60),
    retry=retry_if_exception_type((RateLimitError, APITimeoutError))
)
async def call_llm(messages: list, **kwargs) -> ChatCompletion:
    return await client.chat.completions.create(
        messages=messages,
        **kwargs
    )

Timeout Management

Set aggressive timeouts and have fallback strategies:

try:
    response = await asyncio.wait_for(
        call_llm(messages, model="gpt-4-turbo"),
        timeout=30.0
    )
except asyncio.TimeoutError:
    # Fallback to faster model
    response = await call_llm(messages, model="gpt-3.5-turbo")

Model Cascading

Not every request needs GPT-4. We use a tiered approach:

Fast path: GPT-3.5-turbo for simple, well-defined tasks
Quality path: GPT-4-turbo for complex reasoning
Fallback: Fine-tuned models for specific task types

The router decides based on input complexity and user tier.

3. Caching Strategies

Caching is the single biggest lever for cost and latency. We use multiple layers:

Semantic Caching

Exact match caching misses 90% of opportunities. Instead, we embed the query and find "close enough" cached responses:

async def get_cached_response(query: str) -> Optional[str]:
    query_embedding = await embed(query)

    # Find similar queries in cache
    similar = await vector_store.search(
        query_embedding,
        threshold=0.95,  # High threshold for cache hits
        limit=1
    )

    if similar:
        return similar[0].cached_response
    return None

Response Caching

For deterministic queries (same input → same output), cache aggressively:

Job classification: Same listing → same structured output
Entity extraction: Same document → same entities
FAQ responses: Common questions → cached answers

At ApplyPass, semantic caching reduced our LLM costs by 40%.

4. Structured Outputs

Stop parsing strings. Use function calling or structured outputs:

from pydantic import BaseModel
from openai import OpenAI

class JobClassification(BaseModel):
    title: str
    seniority: Literal["junior", "mid", "senior", "staff", "principal"]
    remote: bool
    salary_min: Optional[int]
    salary_max: Optional[int]
    required_skills: list[str]

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[{"role": "user", "content": job_description}],
    response_format=JobClassification
)

job = response.choices[0].message.parsed  # Type-safe!

This eliminated 90% of our parsing bugs and made the system dramatically more reliable.

5. Observability

You can't improve what you can't measure. Every LLM call should log:

Input tokens: Track context window usage
Output tokens: Track generation costs
Latency: Time-to-first-token and total time
Model used: Which model handled this request
Cache hit: Did we avoid an LLM call?
Quality score: If you have ground truth, measure accuracy

We use a decorator pattern:

@observe(name="classify_job")
async def classify_job(listing: str) -> JobClassification:
    # Your implementation
    pass

This feeds into dashboards that show:

Cost per request over time
P50/P95/P99 latency
Error rates by error type
Cache hit rates

6. Error Handling

LLMs fail in weird ways. Be defensive:

Content Filter Errors

OpenAI sometimes flags benign content. Have a retry strategy:

if response.choices[0].finish_reason == "content_filter":
    # Log for review, return safe fallback
    logger.warning(f"Content filter triggered: {input_hash}")
    return default_response

Malformed Outputs

Even with structured outputs, LLMs can produce garbage. Validate:

try:
    result = JobClassification.model_validate(raw_output)
except ValidationError as e:
    # Retry with more explicit instructions
    result = await classify_with_retry(listing, error_context=str(e))

Graceful Degradation

When the LLM layer fails completely:

Return cached responses if available (even stale)
Fall back to rule-based systems
Queue for async processing
Show user-friendly error with ETA

Key Takeaways

Preprocessing is half the battle: Clean inputs → better outputs
Cache everything: Semantic caching is a superpower
Use structured outputs: Stop parsing strings
Build for failure: Retries, fallbacks, graceful degradation
Measure obsessively: You can't optimize what you can't see

The difference between a prototype and production is handling the 1000 edge cases that tutorials never mention. Start building these patterns in from day one.

Questions or want to discuss? Reach out on LinkedIn.