Production LLM Architecture Patterns
How I design LLM pipelines that scale—preprocessing, chunking strategies, prompt engineering, and system architecture that survives production.
Building LLM applications that work in production requires a fundamentally different mindset than prototyping in notebooks. After shipping systems handling 800K+ requests/month at ApplyPass and building enterprise AI tools at Verizon, here are the patterns I've found essential.
The Production Mindset Shift
Most LLM tutorials show you how to call openai.chat.completions.create() and parse the response. That's maybe 5% of the actual work. The other 95%:
- Preprocessing: How do you handle 50-page PDFs? Messy HTML? Scanned documents?
- Reliability: What happens when OpenAI returns a 500? Or rate limits you?
- Cost: How do you not bankrupt your company on API calls?
- Latency: How do you make it feel fast to users?
- Quality: How do you know if it's working?
Let me walk through each layer.
1. The Request Pipeline
Every production LLM system I've built follows this pattern:
User Input → Preprocessing → Context Assembly → LLM Call → Postprocessing → Response
↓
Caching Layer
Preprocessing Layer
This is where most of the complexity lives. At ApplyPass, we process job listings from 20+ different ATS platforms. Each has different HTML structure, different encoding, different edge cases.
Key decisions:
- Document parsing: I use
unstructuredor custom parsers depending on document type. PDFs go through OCR if needed. - Chunking strategy: For RAG, I chunk by semantic sections, not arbitrary character counts. Headers and structure matter.
- Cleaning: Strip boilerplate, normalize whitespace, handle encoding issues. This sounds boring but it's 30% of production bugs.
# Bad: Arbitrary chunking
chunks = [text[i:i+1000] for i in range(0, len(text), 1000)]
# Better: Semantic chunking with overlap
chunks = semantic_chunker.chunk(
text,
max_tokens=500,
overlap_tokens=50,
split_on=["## ", "\n\n", ". "]
)
Context Assembly
Once you have clean inputs, you need to assemble the context window intelligently:
- Prioritize: Most relevant content goes first (LLMs pay more attention to the start)
- Truncate intelligently: Don't cut off mid-sentence
- Leave headroom: Always leave room for the model to respond
At ApplyPass, we rank chunks by embedding similarity AND recency, then pack them into the context window with metadata.
2. The LLM Call Layer
This is the "simple" part, but production requires several patterns:
Retry Logic with Exponential Backoff
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=60),
retry=retry_if_exception_type((RateLimitError, APITimeoutError))
)
async def call_llm(messages: list, **kwargs) -> ChatCompletion:
return await client.chat.completions.create(
messages=messages,
**kwargs
)
Timeout Management
Set aggressive timeouts and have fallback strategies:
try:
response = await asyncio.wait_for(
call_llm(messages, model="gpt-4-turbo"),
timeout=30.0
)
except asyncio.TimeoutError:
# Fallback to faster model
response = await call_llm(messages, model="gpt-3.5-turbo")
Model Cascading
Not every request needs GPT-4. We use a tiered approach:
- Fast path: GPT-3.5-turbo for simple, well-defined tasks
- Quality path: GPT-4-turbo for complex reasoning
- Fallback: Fine-tuned models for specific task types
The router decides based on input complexity and user tier.
3. Caching Strategies
Caching is the single biggest lever for cost and latency. We use multiple layers:
Semantic Caching
Exact match caching misses 90% of opportunities. Instead, we embed the query and find "close enough" cached responses:
async def get_cached_response(query: str) -> Optional[str]:
query_embedding = await embed(query)
# Find similar queries in cache
similar = await vector_store.search(
query_embedding,
threshold=0.95, # High threshold for cache hits
limit=1
)
if similar:
return similar[0].cached_response
return None
Response Caching
For deterministic queries (same input → same output), cache aggressively:
- Job classification: Same listing → same structured output
- Entity extraction: Same document → same entities
- FAQ responses: Common questions → cached answers
At ApplyPass, semantic caching reduced our LLM costs by 40%.
4. Structured Outputs
Stop parsing strings. Use function calling or structured outputs:
from pydantic import BaseModel
from openai import OpenAI
class JobClassification(BaseModel):
title: str
seniority: Literal["junior", "mid", "senior", "staff", "principal"]
remote: bool
salary_min: Optional[int]
salary_max: Optional[int]
required_skills: list[str]
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{"role": "user", "content": job_description}],
response_format=JobClassification
)
job = response.choices[0].message.parsed # Type-safe!
This eliminated 90% of our parsing bugs and made the system dramatically more reliable.
5. Observability
You can't improve what you can't measure. Every LLM call should log:
- Input tokens: Track context window usage
- Output tokens: Track generation costs
- Latency: Time-to-first-token and total time
- Model used: Which model handled this request
- Cache hit: Did we avoid an LLM call?
- Quality score: If you have ground truth, measure accuracy
We use a decorator pattern:
@observe(name="classify_job")
async def classify_job(listing: str) -> JobClassification:
# Your implementation
pass
This feeds into dashboards that show:
- Cost per request over time
- P50/P95/P99 latency
- Error rates by error type
- Cache hit rates
6. Error Handling
LLMs fail in weird ways. Be defensive:
Content Filter Errors
OpenAI sometimes flags benign content. Have a retry strategy:
if response.choices[0].finish_reason == "content_filter":
# Log for review, return safe fallback
logger.warning(f"Content filter triggered: {input_hash}")
return default_response
Malformed Outputs
Even with structured outputs, LLMs can produce garbage. Validate:
try:
result = JobClassification.model_validate(raw_output)
except ValidationError as e:
# Retry with more explicit instructions
result = await classify_with_retry(listing, error_context=str(e))
Graceful Degradation
When the LLM layer fails completely:
- Return cached responses if available (even stale)
- Fall back to rule-based systems
- Queue for async processing
- Show user-friendly error with ETA
Key Takeaways
- Preprocessing is half the battle: Clean inputs → better outputs
- Cache everything: Semantic caching is a superpower
- Use structured outputs: Stop parsing strings
- Build for failure: Retries, fallbacks, graceful degradation
- Measure obsessively: You can't optimize what you can't see
The difference between a prototype and production is handling the 1000 edge cases that tutorials never mention. Start building these patterns in from day one.
Questions or want to discuss? Reach out on LinkedIn.