Beginning Tuesday, May 26, 2026, customers were experiencing slow loading and 503 errors across ReadMe-hosted docs and the admin dashboard. The outage was intermittent but recurring, with the worst periods hitting during business hours when traffic spiked.
An internal data backup process was generating excessive I/O on our storage layer. Under normal traffic conditions, this additional load was manageable. But when it coincided with peak customer traffic and elevated bot activity, total I/O demand exceeded system capacity, causing cascading request timeouts.
The maintenance process ran on a recurring schedule, which is why the degradation followed a predictable pattern of spikes throughout each day. Separately, a surge in bot traffic to non-existent pages (404s) amplified the problem because those requests were not being served from cache.
Immediate fix: We identified and disabled the maintenance process causing the excess I/O load. Within three hours, storage utilization returned to normal levels and remained stable.
Additional improvements shipped during the incident:
May 26 - June 1, 2026
| Time | Status | Details |
|---|---|---|
| Mon 5/26, 6:39 AM PDT | Investigating | Issue reported |
| Mon 5/26, 7:26 AM PDT | Monitoring | Fix implemented, monitoring results |
| Mon 5/26, 8:37 AM PDT | Resolved | Admin hub incident resolved |
| Mon 5/26, 10:41 AM PDT | Investigating | Slow performance across customer hubs |
| Mon 5/26, 10:49 AM PDT | Monitoring | Quick fix applied, investigating thorough fix |
| Mon 5/26, 8:54 PM PDT | Resolved | Customer hub incident resolved |
| Tue 5/27, 6:51 AM PDT | Investigating | Issue reported |
| Tue 5/27, 7:52 AM PDT | Identified | Fix in progress |
| Tue 5/27, 8:57 AM PDT | Monitoring | Slowly recovering |
| Tue 5/27, 10:17 AM PDT | Update | Updated 6/3: A routine configuration update coincided with the downtime window, which led us to initially identify it as the cause. Further investigation confirmed the two were unrelated. See root cause and resolution above. |
| Tue 5/27, 1:54 PM PDT | Update | Performance and loading still affected, rolling out fixes |
| Tue 5/27, 3:37 PM PDT | Resolved | Incident resolved |
| Wed 5/28, 6:40 AM PDT | Investigating | Issue reported |
| Wed 5/28, 6:43 AM PDT | Identified | Fix being implemented |
| Wed 5/28, 7:35 AM PDT | Update | Systems coming back up, working on permanent fix |
| Wed 5/28, 8:27 AM PDT | Monitoring | Fix implemented, monitoring results |
| Wed 5/28, 10:53 AM PDT | Update | Reports of degraded performance, actively investigating |
| Wed 5/28, 1:04 PM PDT | Update | Systems appear stable, rolling out fixes |
| Wed 5/28, 6:49 PM PDT | Monitoring | Improvements and fixes deployed, continuing to monitor |
| Thu 5/29, 10:03 AM PDT | Monitoring | Systems stable. Bi-directional sync maintenance 9:50–11:59 AM ET. |
| Thu 5/29, 11:36 AM PDT | Monitoring | Deploying targeted fixes. Serving all 404s from cache. Degraded read performance. |
| Fri 5/30 | Monitoring | Monitoring continues |
| Sat 5/31 | Monitoring | Weekend monitoring |
| Mon 6/1, 11:16 AM PDT | Resolved | Root cause identified and resolved. Systems stable since Thursday, May 29. |
We are using this incident to make lasting improvements to reliability and incident response:
During the incident, we posted an update referencing a platform update. That was our initial hypothesis based on timing. Further investigation confirmed it was unrelated. The change in question was a routine, isolated configuration update and had no impact on the outage or any other customers. We should have waited for confirmation before publishing it, and we're tightening our internal process for status page updates as a result.
We know how critical your documentation is to your customers, and this level of disruption is not acceptable. We have already shipped meaningful improvements to prevent recurrence, and the work outlined above will continue through the coming weeks. If you have questions, reach out to your account team or contact support@readme.io.