Harden autocert controller against ACME failure modes
We shipped 4266960 to add rate limit protections to the autocert controller, but a real-world incident exposed several gaps. I was moving my blog to a new domain on a personal miren cluster running an older build (pre-fix), and hit the exact scenario the fix was meant to prevent: the initial HTTP-01 challenge failed, every subsequent TLS handshake fired a new ACME attempt, and LE's 5-failed-authorizations-per-hour limit was exhausted in about 9 minutes.
After upgrading to main, three more issues surfaced. The Reconcile path blocks for up to 5 minutes without a timeout (which also wedges server shutdown since the process can't stop gracefully). The timeout codepath in GetCertificate doesn't record a failure, so the cooldown never kicks in and every handshake spawns a goroutine that hits LE with a doomed request. And the synthetic ClientHello used for eager provisioning doesn't include cipher suite preferences, so it only provisions an RSA cert while browsers prefer ECDSA. The site stayed broken even after the eager path reported "certificate provisioned successfully."
What needs to change
The Reconcile path needs a timeout and needs to check the failure cooldown map before attempting ACME, matching what the inline GetCertificate path does. The timeout codepath in GetCertificate needs to store a failure so the cooldown actually suppresses subsequent attempts. And the synthetic ClientHello in Reconcile needs realistic cipher suites so eager provisioning gets the cert type browsers actually want (ECDSA), not just RSA.
There may also be a subtler issue where autocert.Manager serializes concurrent requests per domain internally. When the inline GetCertificate times out, it leaks a goroutine still running inside the manager. If many pile up they can hold internal locks and cause subsequent calls to block even when a valid cert exists on disk. Worth investigating whether we need to pass a context with deadline into the manager to properly cancel these.