ReadMe experienced a significant outage on Tuesday, March 26 beginning at 16:06 UTC (9:06am Pacific). This outage affected all ReadMe services including our management dashboard and the developer hubs we host for our customers.
We recovered the majority of our service by 16:42 UTC (9:42am Pacific) including most access to the Dash and the Hubs. The rest of the service fully recovered at 17:34 UTC (10:34am Pacific).
Although the outage began with one of ReadMe’s service providers, we take full responsibility and we’re truly sorry for the inconvenience to our customers. We’re working through ways to prevent the same issue from happening again and to reduce the impact from similar events in the future.
ReadMe uses a number of third party service providers to host our Internet-facing services including our customer-facing dashboard (dash.readme.com) and developer documentation hubs. One of our primary service providers is Render, a web application hosting platform. This outage began when Render experienced a broad range of outages. We’re still learning more about what happened and we will update this document when those details are available.
We have redundant systems running at Render and can handle a partial Render service outage. Further, it’s usually very fast to replace cloud services on Render in a partial outage. But our infrastructure is not resilient to a full outage of the entire Render service, which is what happened on the 26th.
Update (April 1, 2024): Render has confirmed that the issue began with an unintended restart of all customer and system workloads on their platform, which was caused by a faulty code change. Render has provided a Root Cause Analysis for their underlying incident. Although the incident was triggered by our service provider, we’re ultimately responsible for our own uptime and we are working on remediations to reduce the scope and severity of this class of incidents.
We host many services on Render including our Node.js web application and our Redis data stores. Redis is an in-memory data store that we use for caches and queues. We don’t use Redis for long-term (persistent) data storage, but many other companies do. Because of the unique challenges of restoring persistent data stores, Render’s managed Redis services took significantly longer to recover.
We implemented two temporary workarounds to restore ReadMe service: we removed Redis from the critical path in areas of our service where this was possible, and we launched temporary replacement Redis services until our managed Redis instances were recovered. After the managed Redis service was available and stable, we resumed normal operations on our managed Redis instances.
ReadMe is committed to maintaining a high level of service availability; we sincerely apologize for letting our customers down. We will be holding an internal retrospective later this week to learn from this incident and improve our response to future incidents.
This incident identified a number of tightly-coupled services in our infrastructure — failures in some internal services caused unforeseen problems in other related services. Among other improvements, we’ll look into ways to decouple those services.
This incident alone isn’t enough to reevaluate our relationship with Render, but we continually monitor our partners’ performance relative to our service targets. If we are unable to meet our service targets with our current providers, we will engage additional providers for redundancy, or look for replacements depending on the situation.
Finally, our close relationship with Render allowed us to get accurate technical details during the incident. This information allowed us to move quickly and take corrective action.
ReadMe takes our mission to make the API developer experience magical very seriously. We deeply regret this service outage and are using it as an opportunity to strengthen our processes, provide transparency, and improve our level of service going forward. We apologize for this disruption and thank you for being a valued customer.