Skip to main content
Back to Blog
InfrastructureMigrationFastAPIDigitalOcean

Migrating a Production LLM Backend from AWS to DigitalOcean

May 20, 202611 min

We just finished moving ApplyPass's AI backend — the FastAPI service that does job classification, answer generation, and embedding-based matching — off AWS ECS Fargate and onto DigitalOcean App Platform. The cutover was zero-downtime and the service has been stable since.

It was also not as boring as a "lift-and-shift" is supposed to be. Three things broke in ways that didn't show up until production traffic hit them, and the most expensive one didn't show up in our logs at all. This is the write-up I wish I'd had before starting.


Why move at all

The honest answer: cost and operational overhead, not any single AWS failure.

ECS Fargate is a fine platform, but our setup had accumulated the usual tax — a Terraform stack spanning VPCs, subnets, security groups, an ALB, target groups, ECR lifecycle policies, and CodeBuild. For a small team running one primary service, that is a lot of surface area to keep patched and reasoned about.

DigitalOcean App Platform collapses most of that into a single app spec: you point it at a repo, declare the run command and environment, and it handles the build, the registry, TLS, and the load balancer. The trade is real — you give up fine-grained control — but for our workload that control was overhead, not leverage.

This was a replatform, not a rewrite. (If you know the AWS "7 Rs" of migration, this is the Replatform R — keep the application essentially as-is, swap the platform underneath it.) One deliberate exception: we left model training on AWS SageMaker. Training is a periodic batch job with very different requirements from a latency-sensitive inference service, and there was no reason to couple the two migrations.


The platform mapping

Most of the migration was mechanical — a translation table:

AWSDigitalOceanNotes
ECS Fargate taskApp Platform serviceRun command + instance size
Application Load BalancerBuilt into App PlatformTLS handled for you
ECRDigitalOcean Container Registry (or build-from-source)
RDS / Aurora PostgreSQLDO Managed PostgreSQLSee below
S3DO SpacesS3-compatible API
Secrets ManagerApp-level encrypted env vars
CodeBuild / CodePipelineGitHub Actions

The translation table is the easy part. The interesting part is everything the table doesn't tell you — the behavioral differences between two services that present the same API.


Bug 1: prepared statements vs. a transaction-mode pooler

The first failure showed up immediately in the test environment: roughly every other database query failed with a DuplicatePreparedStatementError.

The cause is a classic, and it's worth understanding because it bites anyone running async Python against a managed Postgres. DO Managed PostgreSQL puts a PgBouncer connection pooler in front of the database, in transaction mode. In transaction mode, a client connection is only yours for the duration of a single transaction — the next query may land on a completely different backend connection.

asyncpg doesn't know that. By default it prepares statements and caches them, keyed to what it believes is a stable connection. Behind a transaction-mode pooler that assumption is false: the statement gets prepared on backend A and re-used on backend B, where it doesn't exist.

The fix is one argument:

import asyncpg

# DO Managed Postgres routes through PgBouncer in transaction mode.
# asyncpg's prepared-statement cache assumes a session-pinned connection,
# so it must be disabled — otherwise statements prepared on one backend
# connection are re-used on another, where they don't exist.
pool = await asyncpg.create_pool(
    dsn=DATABASE_URL,
    statement_cache_size=0,  # required behind a transaction-mode pooler
)

There is a second option worth knowing: managed Postgres usually also exposes a direct (non-pooled) connection on a different port. For a service that holds its own long-lived connection pool — which an async FastAPI app does — connecting directly and letting the application pool do the pooling avoids the PgBouncer quirk entirely and keeps prepared statements working. We disabled the statement cache as the fast fix and noted the direct-connection route as a follow-up.

The lesson: "PostgreSQL" is not one thing. Two providers can both be spec-compliant and still differ in pooling behavior in ways that break your driver. Test the driver against the actual managed instance, not against a local Postgres.


Bug 2: the env var that wasn't carried over

This one cost no errors at all — it just made everything 2.4× slower.

After the test environment was up, latency was visibly worse than AWS. Same code, same model, same database. The service worked; it was just sluggish under any real concurrency.

The cause was embarrassingly small. Our container starts the app like this:

uvicorn app.main:app --host 0.0.0.0 --port 8080 --workers ${WORKERS:-1}

On AWS, the task definition set WORKERS=4. When we wrote the DigitalOcean app spec, that one variable didn't make the jump — and the :-1 default quietly took over. The service had been running on a single worker process. One worker can't overlap request handling across CPU cores, so under concurrency requests queued behind each other. Hence the ~2.4× latency gap.

Setting WORKERS=4 in the app spec closed it.

  envs:
    - key: DATABASE_URL
      scope: RUN_TIME
+   - key: WORKERS
+     value: "4"
+     scope: RUN_TIME

The lesson: environment parity is part of the migration, not a detail. A missing env var won't crash — it'll silently pick a default and degrade you. Before any cutover, diff the full resolved environment of the old platform against the new one, variable by variable. We now treat the env matrix as a first-class migration artifact.


Bug 3: the silent fallback that hid a broken path

This is the one I think about most, because it's the one that didn't show up in the AI backend's own logs.

The matching service loads a small projection model. On AWS it loaded from S3. On DigitalOcean we pointed it at DO Spaces — set the endpoint, set the bucket — and matching kept working. Migration done. Or so it looked.

It wasn't. The model object had never actually been synced to the Spaces bucket at the expected key. Every model-version check was doing a head_object on a key that returned HTTP 404. Matching still worked only because of this:

# Before — a bare except turns a broken config into an invisible fallback.
def get_latest_model_version() -> int:
    try:
        obj = s3.head_object(Bucket=BUCKET, Key=MODEL_KEY)
        return parse_version(obj)
    except Exception:   # swallows a 404 from a misconfigured bucket...
        return 0        # ...and "0" happens to mean "load from MongoDB instead"

The object-storage path was dead, and the system was quietly running on a MongoDB fallback that was only ever meant to be a safety net. Nothing was down, so nothing paged. The misconfiguration was completely invisible.

The fix isn't "add object storage back" — it's to stop the except from lying:

# After — fall back only on the expected error; fail loud on everything else.
def get_latest_model_version() -> int:
    try:
        obj = s3.head_object(Bucket=BUCKET, Key=MODEL_KEY)
        return parse_version(obj)
    except s3.exceptions.NoSuchKey:
        log.warning("model object missing; using MongoDB fallback", key=MODEL_KEY)
        return 0
    # auth failures, wrong endpoint, network errors — all propagate.

A bare except that returns a fallback value is one of the most dangerous patterns in production code. It collapses "the expected thing is absent" and "something is fundamentally misconfigured" into the same silent outcome. The first deserves a fallback; the second deserves a page.

The lesson: catch the specific exception you have a plan for. Everything else should be loud. A fallback that works is worse than an error that fires — the fallback hides the bug until something else makes it expensive.


The cutover — and the incident after it

The cutover itself went the way you want: DNS moved over with both platforms warm, traffic shifted, the old ECS service scaled to zero as a standby, no downtime.

The lesson came a few hours later.

The new service wedged. Throughput didn't error — it collapsed. Log volume fell from thousands of lines per ten minutes to almost nothing and stayed flat. The app wasn't crash-looping; it wasn't OOM-killed; it just stopped doing work. It took a manual redeploy, hours later, to bring it back.

The part worth internalizing: the AI backend's own logs and traces were clean. When an App Platform component hangs, the platform's edge returns 504 Gateway Timeout to callers — but those 5xx responses are generated at the edge and never reach the application. The app has no idea it's failing.

The evidence was entirely in the wrong place. The real signal was in the calling service, which had logged thousands of 504s against the answer- generation endpoint, and in the throughput graph, which showed the work-stoppage cliff. If you only watch a service's own error rate, a clean hang is invisible.

The lesson — and this is now baked into our runbook:

  • Monitor from the caller's side. A synthetic check or an upstream service's error rate catches hangs that the hung service itself cannot report.
  • Alert on the absence of work, not just on errors. A throughput floor ("requests per minute < N") catches a silent wedge that an error-rate alert never will.
  • A clean hang and a healthy service look identical from the inside. Decide in advance where you'd look, because the logs won't volunteer it.

Root-causing the hang itself is ongoing — the signature (total work-stoppage, no crash-loop, recovered only by redeploy) points at a deadlock rather than a resource exhaustion. But the operational lesson didn't need the root cause.


A migration checklist

If I did this again, this is the list I'd start from:

  1. Build the environment matrix first. Every variable, both platforms, side by side. Treat any difference as a bug until proven intentional.
  2. Test your DB driver against the actual managed instance. Pooling mode, prepared statements, connection limits — local Postgres won't reveal these.
  3. Audit every except on a migrated I/O path. Anything that points at storage, a queue, or a model artifact. Narrow the catch; log the fallback.
  4. Load-test the new platform before the cutover, not after. The single- worker latency gap would have been obvious under load on day one.
  5. Keep the old platform warm as a standby. Scale it to zero, don't delete it. Rollback should be a DNS change, not a redeploy.
  6. Add caller-side and throughput-floor monitoring before you cut over. The failure mode you can't see is the one that will hurt.
  7. Don't migrate what doesn't need to move. Leaving SageMaker training on AWS removed an entire dimension of risk for zero benefit lost.

Takeaways

The mechanical part of a replatform — the service mapping, the Terraform teardown, the CI/CD rewrite — is the part that looks hard and isn't. The hard part is the behavioral gap between two platforms that claim to offer the same thing, and the failure modes your existing observability was never built to see.

Every real bug here was a visibility failure as much as a technical one. The pooler incompatibility was loud and easy. The missing worker, the dead storage path, and the silent hang were all quiet — and quiet is what costs you. Migrate the monitoring with the same rigor you migrate the code, and a replatform is genuinely routine.


Building or migrating an LLM backend? I'm always happy to compare notes — reach out on LinkedIn.