Migrating a Production LLM Backend from AWS to DigitalOcean
We just finished moving ApplyPass's AI backend — the FastAPI service that does job classification, answer generation, and embedding-based matching — off AWS ECS Fargate and onto DigitalOcean App Platform. The cutover was zero-downtime and the service has been stable since.
It was also not as boring as a "lift-and-shift" is supposed to be. Three things broke in ways that didn't show up until production traffic hit them, and the most expensive one didn't show up in our logs at all. This is the write-up I wish I'd had before starting.
Why move at all
The honest answer: cost and operational overhead, not any single AWS failure.
ECS Fargate is a fine platform, but our setup had accumulated the usual tax — a Terraform stack spanning VPCs, subnets, security groups, an ALB, target groups, ECR lifecycle policies, and CodeBuild. For a small team running one primary service, that is a lot of surface area to keep patched and reasoned about.
DigitalOcean App Platform collapses most of that into a single app spec: you point it at a repo, declare the run command and environment, and it handles the build, the registry, TLS, and the load balancer. The trade is real — you give up fine-grained control — but for our workload that control was overhead, not leverage.
This was a replatform, not a rewrite. (If you know the AWS "7 Rs" of migration, this is the Replatform R — keep the application essentially as-is, swap the platform underneath it.) One deliberate exception: we left model training on AWS SageMaker. Training is a periodic batch job with very different requirements from a latency-sensitive inference service, and there was no reason to couple the two migrations.
The platform mapping
Most of the migration was mechanical — a translation table:
| AWS | DigitalOcean | Notes |
|---|---|---|
| ECS Fargate task | App Platform service | Run command + instance size |
| Application Load Balancer | Built into App Platform | TLS handled for you |
| ECR | DigitalOcean Container Registry (or build-from-source) | |
| RDS / Aurora PostgreSQL | DO Managed PostgreSQL | See below |
| S3 | DO Spaces | S3-compatible API |
| Secrets Manager | App-level encrypted env vars | |
| CodeBuild / CodePipeline | GitHub Actions |
The translation table is the easy part. The interesting part is everything the table doesn't tell you — the behavioral differences between two services that present the same API.
Bug 1: prepared statements vs. a transaction-mode pooler
The first failure showed up immediately in the test environment: roughly every
other database query failed with a DuplicatePreparedStatementError.
The cause is a classic, and it's worth understanding because it bites anyone running async Python against a managed Postgres. DO Managed PostgreSQL puts a PgBouncer connection pooler in front of the database, in transaction mode. In transaction mode, a client connection is only yours for the duration of a single transaction — the next query may land on a completely different backend connection.
asyncpg doesn't know that. By default it prepares statements and caches them,
keyed to what it believes is a stable connection. Behind a transaction-mode
pooler that assumption is false: the statement gets prepared on backend A and
re-used on backend B, where it doesn't exist.
The fix is one argument:
import asyncpg
# DO Managed Postgres routes through PgBouncer in transaction mode.
# asyncpg's prepared-statement cache assumes a session-pinned connection,
# so it must be disabled — otherwise statements prepared on one backend
# connection are re-used on another, where they don't exist.
pool = await asyncpg.create_pool(
dsn=DATABASE_URL,
statement_cache_size=0, # required behind a transaction-mode pooler
)
There is a second option worth knowing: managed Postgres usually also exposes a direct (non-pooled) connection on a different port. For a service that holds its own long-lived connection pool — which an async FastAPI app does — connecting directly and letting the application pool do the pooling avoids the PgBouncer quirk entirely and keeps prepared statements working. We disabled the statement cache as the fast fix and noted the direct-connection route as a follow-up.
The lesson: "PostgreSQL" is not one thing. Two providers can both be spec-compliant and still differ in pooling behavior in ways that break your driver. Test the driver against the actual managed instance, not against a local Postgres.
Bug 2: the env var that wasn't carried over
This one cost no errors at all — it just made everything 2.4× slower.
After the test environment was up, latency was visibly worse than AWS. Same code, same model, same database. The service worked; it was just sluggish under any real concurrency.
The cause was embarrassingly small. Our container starts the app like this:
uvicorn app.main:app --host 0.0.0.0 --port 8080 --workers ${WORKERS:-1}
On AWS, the task definition set WORKERS=4. When we wrote the DigitalOcean app
spec, that one variable didn't make the jump — and the :-1 default quietly took
over. The service had been running on a single worker process. One worker
can't overlap request handling across CPU cores, so under concurrency requests
queued behind each other. Hence the ~2.4× latency gap.
Setting WORKERS=4 in the app spec closed it.
envs:
- key: DATABASE_URL
scope: RUN_TIME
+ - key: WORKERS
+ value: "4"
+ scope: RUN_TIME
The lesson: environment parity is part of the migration, not a detail. A missing env var won't crash — it'll silently pick a default and degrade you. Before any cutover, diff the full resolved environment of the old platform against the new one, variable by variable. We now treat the env matrix as a first-class migration artifact.
Bug 3: the silent fallback that hid a broken path
This is the one I think about most, because it's the one that didn't show up in the AI backend's own logs.
The matching service loads a small projection model. On AWS it loaded from S3. On DigitalOcean we pointed it at DO Spaces — set the endpoint, set the bucket — and matching kept working. Migration done. Or so it looked.
It wasn't. The model object had never actually been synced to the Spaces bucket
at the expected key. Every model-version check was doing a head_object on a key
that returned HTTP 404. Matching still worked only because of this:
# Before — a bare except turns a broken config into an invisible fallback.
def get_latest_model_version() -> int:
try:
obj = s3.head_object(Bucket=BUCKET, Key=MODEL_KEY)
return parse_version(obj)
except Exception: # swallows a 404 from a misconfigured bucket...
return 0 # ...and "0" happens to mean "load from MongoDB instead"
The object-storage path was dead, and the system was quietly running on a MongoDB fallback that was only ever meant to be a safety net. Nothing was down, so nothing paged. The misconfiguration was completely invisible.
The fix isn't "add object storage back" — it's to stop the except from lying:
# After — fall back only on the expected error; fail loud on everything else.
def get_latest_model_version() -> int:
try:
obj = s3.head_object(Bucket=BUCKET, Key=MODEL_KEY)
return parse_version(obj)
except s3.exceptions.NoSuchKey:
log.warning("model object missing; using MongoDB fallback", key=MODEL_KEY)
return 0
# auth failures, wrong endpoint, network errors — all propagate.
A bare except that returns a fallback value is one of the most dangerous
patterns in production code. It collapses "the expected thing is absent" and
"something is fundamentally misconfigured" into the same silent outcome. The
first deserves a fallback; the second deserves a page.
The lesson: catch the specific exception you have a plan for. Everything else should be loud. A fallback that works is worse than an error that fires — the fallback hides the bug until something else makes it expensive.
The cutover — and the incident after it
The cutover itself went the way you want: DNS moved over with both platforms warm, traffic shifted, the old ECS service scaled to zero as a standby, no downtime.
The lesson came a few hours later.
The new service wedged. Throughput didn't error — it collapsed. Log volume fell from thousands of lines per ten minutes to almost nothing and stayed flat. The app wasn't crash-looping; it wasn't OOM-killed; it just stopped doing work. It took a manual redeploy, hours later, to bring it back.
The part worth internalizing: the AI backend's own logs and traces were
clean. When an App Platform component hangs, the platform's edge returns
504 Gateway Timeout to callers — but those 5xx responses are generated at the
edge and never reach the application. The app has no idea it's failing.
The evidence was entirely in the wrong place. The real signal was in the
calling service, which had logged thousands of 504s against the answer-
generation endpoint, and in the throughput graph, which showed the
work-stoppage cliff. If you only watch a service's own error rate, a clean hang
is invisible.
The lesson — and this is now baked into our runbook:
- Monitor from the caller's side. A synthetic check or an upstream service's error rate catches hangs that the hung service itself cannot report.
- Alert on the absence of work, not just on errors. A throughput floor ("requests per minute < N") catches a silent wedge that an error-rate alert never will.
- A clean hang and a healthy service look identical from the inside. Decide in advance where you'd look, because the logs won't volunteer it.
Root-causing the hang itself is ongoing — the signature (total work-stoppage, no crash-loop, recovered only by redeploy) points at a deadlock rather than a resource exhaustion. But the operational lesson didn't need the root cause.
A migration checklist
If I did this again, this is the list I'd start from:
- Build the environment matrix first. Every variable, both platforms, side by side. Treat any difference as a bug until proven intentional.
- Test your DB driver against the actual managed instance. Pooling mode, prepared statements, connection limits — local Postgres won't reveal these.
- Audit every
excepton a migrated I/O path. Anything that points at storage, a queue, or a model artifact. Narrow the catch; log the fallback. - Load-test the new platform before the cutover, not after. The single- worker latency gap would have been obvious under load on day one.
- Keep the old platform warm as a standby. Scale it to zero, don't delete it. Rollback should be a DNS change, not a redeploy.
- Add caller-side and throughput-floor monitoring before you cut over. The failure mode you can't see is the one that will hurt.
- Don't migrate what doesn't need to move. Leaving SageMaker training on AWS removed an entire dimension of risk for zero benefit lost.
Takeaways
The mechanical part of a replatform — the service mapping, the Terraform teardown, the CI/CD rewrite — is the part that looks hard and isn't. The hard part is the behavioral gap between two platforms that claim to offer the same thing, and the failure modes your existing observability was never built to see.
Every real bug here was a visibility failure as much as a technical one. The pooler incompatibility was loud and easy. The missing worker, the dead storage path, and the silent hang were all quiet — and quiet is what costs you. Migrate the monitoring with the same rigor you migrate the code, and a replatform is genuinely routine.
Building or migrating an LLM backend? I'm always happy to compare notes — reach out on LinkedIn.