Submit an issue View all issues Source
MIR-926

ACME TLS-ALPN-01 broken, causes rate limit doom loop and cert provisioning failures

Done public
phinze phinze Opened Mar 29, 2026 Updated Mar 30, 2026

Problem

ACME certificate provisioning fails for domains during DNS migration and cannot recover due to a chain of issues that compound into a rate limit doom loop.

Root Cause: TLS-ALPN-01 is broken

ServeTLSWithController in autotls.go creates a tls.Config without acme.ALPNProto in NextProtos:

tlsConfig := &tls.Config{
    GetCertificate: certProvider.GetCertificate,
    MinVersion:     tls.VersionTLS12,
    // Missing: NextProtos with acme.ALPNProto
}

The autocert.Manager docs explicitly state: "add acme.ALPNProto to NextProtos for tls-alpn-01, or use HTTPHandler for http-01."

Without it, when Let's Encrypt connects with ALPN ["acme-tls/1"] to validate, Go's TLS stack rejects the handshake at the transport level:

http: TLS handshake error from 23.178.112.104:46993: 
  tls: client requested unsupported application protocols (["acme-tls/1"])

The challenge cert that autocert prepared is never served. The authorization fails.

The Doom Loop

autocert.Manager.supportedChallengeTypes() returns ["tls-alpn-01", "http-01"]. The verifyRFC method creates a new ACME order per challenge type attempt. Each failed order counts as a failed authorization against Let's Encrypt's rate limit (5 per identifier per hour).

The sequence per GetCertificate call:

  1. AuthorizeOrder → new order (authorization #1)
  2. Try TLS-ALPN-01 → Let's Encrypt probes port 443 → ALPN rejected → WaitAuthorization eventually fails
  3. continue AuthorizeOrderLoopAuthorizeOrder again → new order (authorization #2)
  4. Try HTTP-01 → but authorization #2 is rate-limited (429) because we've burned through the budget

Result: 2 authorizations burned per attempt, but only 5 allowed per hour. After 2-3 incoming TLS handshakes the rate limit is hit. Subsequent attempts get 429 at AuthorizeOrder, which may extend the sliding window. The cert can never be obtained.

Additional Issues Found

No backoff on ACME failures (upstream)

autocert.Manager has no failure caching or backoff (upstream has a TODO: cache error results? comment at line 291). Every call to GetCertificate that doesn't find a cached cert triggers a full ACME flow. On a site getting regular traffic, this means rapid-fire ACME attempts on every TLS handshake.

5-minute hardcoded timeout blocks TLS handshakes

autocert.Manager.GetCertificate creates its own context.WithTimeout(context.Background(), 5*time.Minute) (line 278). When ACME fails, the first TLS handshake for a domain can block for up to 5 minutes before falling back to the self-signed cert. Users see a hung browser/curl with no feedback.

No DNS pre-check before ACME attempts

Eager provisioning (triggered by miren route set) immediately attempts ACME without checking if DNS actually points to the cluster. During DNS migration, this guarantees failures that count against the rate limit. A cheap DNS lookup to verify the domain resolves to one of our known IPs would prevent wasting authorization attempts.

Inline path doesn't log success

GetCertificate in autocert_controller.go:173-185 only logs on failure. When a cert is obtained via the inline handshake path, there's no log line, making it impossible to diagnose whether a cert was ever successfully provisioned.

Observed Impact

Discovered while migrating phinze.com from GitHub Pages to a Hetzner VPS running Miren. The .run.garden subdomains got certs fine (DNS pointed to selkie from the start, HTTP-01 worked). phinze.com entered the doom loop and has been unable to obtain a cert for over an hour.

Fix

The immediate fix is adding acme.ALPNProto to NextProtos in autotls.go:30-33:

tlsConfig := &tls.Config{
    GetCertificate: certProvider.GetCertificate,
    MinVersion:     tls.VersionTLS12,
    NextProtos:     []string{"h2", "http/1.1", acme.ALPNProto},
}

This makes TLS-ALPN-01 actually work, which means:

  • Cert provisioning succeeds on the first attempt (only 1 authorization used, not 2)
  • TLS-ALPN-01 works during DNS migration as soon as DNS propagates (unlike HTTP-01, TLS-ALPN-01 validation connects to port 443 which is already serving)

Follow-up improvements:

  • Add backoff/failure caching around autocert.Manager.GetCertificate calls to prevent rapid-fire ACME attempts
  • Add DNS pre-check before eager provisioning to avoid wasting rate limit budget
  • Add a timeout wrapper so handshakes fall back to self-signed cert after ~10 seconds instead of blocking for 5 minutes
  • Log successful cert provisioning on the inline path

Files

  • components/autotls/autotls.go:30-33 — missing acme.ALPNProto in NextProtos
  • controllers/certificate/autocert_controller.go:173-185GetCertificate wrapper, no success logging
  • controllers/certificate/autocert_controller.go:136-143 — eager provisioning, no DNS pre-check
  • Upstream: golang.org/x/crypto@v0.48.0/acme/autocert/autocert.go:278 — hardcoded 5min timeout
  • Upstream: golang.org/x/crypto@v0.48.0/acme/autocert/autocert.go:291 — no error caching