ReadMe Refactored Outage

Incident Report for ReadMe

Postmortem

What Happened

Beginning Tuesday, May 26, 2026, customers were experiencing slow loading and 503 errors across ReadMe-hosted docs and the admin dashboard. The outage was intermittent but recurring, with the worst periods hitting during business hours when traffic spiked.

Root Cause

An internal data backup process was generating excessive I/O on our storage layer. Under normal traffic conditions, this additional load was manageable. But when it coincided with peak customer traffic and elevated bot activity, total I/O demand exceeded system capacity, causing cascading request timeouts.

The maintenance process ran on a recurring schedule, which is why the degradation followed a predictable pattern of spikes throughout each day. Separately, a surge in bot traffic to non-existent pages (404s) amplified the problem because those requests were not being served from cache.

Resolution

Immediate fix: We identified and disabled the maintenance process causing the excess I/O load. Within three hours, storage utilization returned to normal levels and remained stable.

Additional improvements shipped during the incident:

  • Expanded caching across multiple layers, significantly reducing load on backend storage
  • Hardened 404 handling to serve error pages from cache instead of hitting the backend
  • Implemented rate limiting and IP-based protections against abusive bot traffic
  • Optimized several high-traffic API endpoints to reduce redundant backend calls
  • Added new monitoring and alerting for storage I/O thresholds

Timeline

May 26 - June 1, 2026

Time Status Details
Mon 5/26, 6:39 AM PDT Investigating Issue reported
Mon 5/26, 7:26 AM PDT Monitoring Fix implemented, monitoring results
Mon 5/26, 8:37 AM PDT Resolved Admin hub incident resolved
Mon 5/26, 10:41 AM PDT Investigating Slow performance across customer hubs
Mon 5/26, 10:49 AM PDT Monitoring Quick fix applied, investigating thorough fix
Mon 5/26, 8:54 PM PDT Resolved Customer hub incident resolved
Tue 5/27, 6:51 AM PDT Investigating Issue reported
Tue 5/27, 7:52 AM PDT Identified Fix in progress
Tue 5/27, 8:57 AM PDT Monitoring Slowly recovering
Tue 5/27, 10:17 AM PDT Update Updated 6/3: A routine configuration update coincided with the downtime window, which led us to initially identify it as the cause. Further investigation confirmed the two were unrelated. See root cause and resolution above.
Tue 5/27, 1:54 PM PDT Update Performance and loading still affected, rolling out fixes
Tue 5/27, 3:37 PM PDT Resolved Incident resolved
Wed 5/28, 6:40 AM PDT Investigating Issue reported
Wed 5/28, 6:43 AM PDT Identified Fix being implemented
Wed 5/28, 7:35 AM PDT Update Systems coming back up, working on permanent fix
Wed 5/28, 8:27 AM PDT Monitoring Fix implemented, monitoring results
Wed 5/28, 10:53 AM PDT Update Reports of degraded performance, actively investigating
Wed 5/28, 1:04 PM PDT Update Systems appear stable, rolling out fixes
Wed 5/28, 6:49 PM PDT Monitoring Improvements and fixes deployed, continuing to monitor
Thu 5/29, 10:03 AM PDT Monitoring Systems stable. Bi-directional sync maintenance 9:50–11:59 AM ET.
Thu 5/29, 11:36 AM PDT Monitoring Deploying targeted fixes. Serving all 404s from cache. Degraded read performance.
Fri 5/30 Monitoring Monitoring continues
Sat 5/31 Monitoring Weekend monitoring
Mon 6/1, 11:16 AM PDT Resolved Root cause identified and resolved. Systems stable since Thursday, May 29.

Path Forward

We are using this incident to make lasting improvements to reliability and incident response:

  • Storage capacity and isolation: Restructuring how background processes interact with production storage to eliminate contention under load.
  • Caching and performance: The caching improvements shipped during the incident are permanent. We are continuing to expand cache coverage across additional endpoints and page types.
  • Bot and traffic protection: Strengthening rate limiting and abuse detection to prevent bot traffic from contributing to backend load.
  • Monitoring and alerting: Adding proactive capacity monitoring with earlier thresholds so the team can intervene before customers are affected.
  • Incident response: Improving our internal processes for faster escalation and more frequent status page updates during multi-day incidents.

Final Note

During the incident, we posted an update referencing a platform update. That was our initial hypothesis based on timing. Further investigation confirmed it was unrelated. The change in question was a routine, isolated configuration update and had no impact on the outage or any other customers. We should have waited for confirmation before publishing it, and we're tightening our internal process for status page updates as a result.

We know how critical your documentation is to your customers, and this level of disruption is not acceptable. We have already shipped meaningful improvements to prevent recurrence, and the work outlined above will continue through the coming weeks. If you have questions, reach out to your account team or contact support@readme.io.

Posted Jun 03, 2026 - 08:32 PDT

Resolved

This incident has been resolved.
Posted May 27, 2026 - 15:37 PDT

Update

Performance and loading is still being affected, and we're in the process of rolling out fixes
Posted May 27, 2026 - 13:54 PDT

Monitoring

Performance has improved, but we're still investigating the issue.
Posted May 27, 2026 - 11:02 PDT

Update

We are continuing to investigate this issue.
Posted May 27, 2026 - 10:47 PDT

Investigating

We are seeing some slow performance, possibly 500 errors. investigating
Posted May 27, 2026 - 10:43 PDT

Update

Updated 6/3: A routine configuration update coincided with the downtime window, which led us to initially identify it as the cause. Further investigation confirmed the two were unrelated. See the postmortem for full details.
Posted May 27, 2026 - 10:17 PDT

Update

We are monitoring the issue.
Posted May 27, 2026 - 10:08 PDT

Monitoring

We're slowly recovering and monitoring the situation.
Posted May 27, 2026 - 08:57 PDT

Identified

The issue has been identified and a fix is being implemented.
Posted May 27, 2026 - 07:52 PDT

Investigating

We are currently investigating this issue.
Posted May 27, 2026 - 06:51 PDT
This incident affected: ReadMe Hubs, ReadMe Knowledge Base, and Admin Dashboard.