Intermittent connection issues causing 500 errors
Incident Report for Chameleon
Postmortem

Summary

N‌ame: Data loss due to silent connectivity error

Summary: Our database responsible for queueing specific actions to be processed offline became un-connectable for a period of ~30 mins. Non-queued application behavior continued as normal.

Impact: Events logged to Cameleon during the incident were not processed or stored.

Cause: Our Redis database did not properly failover in the case of a possible underlying hardware failure within Compose.com. The associated alerts were silenced because an underlying bug in Redis 4.x presented them as network errors and not application errors.

Timing: 33 minutes from 0925 PDT to 0958 PDT

Impact details

During the period listed above, Chameleon did not process or store data including:

  • User properties or events updated or sent via our client-side JS or server-side REST APIs
  • Data from integrations such as Segment.com
  • Automatically collected events such as “Started Tour”, “Exited Step” etc.

This means that any Experiences reliant on these events will not have shown. The data here will also not be available for any future Experiences. It also means that Chameleon events will not have been sent to other connected systems during this period.

However, the logging of the state of Chameleon Experiences was not affected. This means that while the analytics on whether a Tour was completed may not be correct, the Tour will not be shown again to a user if not meant to be.

Root cause

We believe an underlying hardware component within our Compose.com Redis cluster failed. Typically this would have resolved itself within the normal failover window of 2-10 seconds. However, in this specific case, a failover did not occur and the un-connectable Redis master stayed registered as master. This seems to have happened because of a bug in Redis 4.x Sentinel instances relating to the specific way that the master went offline. The bug essentially prevented the Sentinels from becoming aware of the underlying hardware failure and thus did not properly initiate a failover. This meant that requests were still routed to the failed database. This should have created clear errors in our application monitoring, but these errors were masked as network errors (and not application errors) and so were silently swallowed. The result was a successful response to the client but not a fully successful completion of the API request.

Resolution

Once we became aware of this issue, our engineers manually failed over to a new Redis cluster and brought with it the most recent version backup (approximately 1 hour old). This type of switch was possible because this portion of the system is used for primarily queueing. At the time of this incident the queue had no backlog of work. When the new Redis cluster was swapped in, the application recovered in a matter of 2-3 minutes.

Future prevention

In order to prevent this specific issue reoccurring we are taking the following steps:

  • Re-classify the Redis network error as an application error in order to surface it as a status code 500 error to the API consumers.
  • Add a standby Redis Cluster to step in in the case that this issue surfaces again (until Redis 5.x, in which the underlying bug is resolved, is available).

In addition to further strengthen our ability to prevent related issues and increase our resilience, over the next few weeks we will investigate possible fail-safe methods of ingesting events and other data into our queueing system (e.g. a short-lived log of all data as it is received, stored in a separate warehouse).

Posted Mar 21, 2020 - 12:24 PDT

Resolved
All runtime issues have been resolved and our engineers are working to re-process any data / api interactions that were impacted at the time.

Timing: from 0835 PDT to 0958 PDT today
Summary: Our primary Redis started to have issues holding connections and new connections started to either take too long (timeout) or connect and not become active for reads/writes. Our use of Redis is primarily for queuing data to be added to our primary database and to be further aggregated "offline" relative to the app servers.
- Two primary uses of this queueing system are storing logged events and publishing Chameleon experiences.
- All published content during this outage was re-processed as it would have been. 😀
- Some events from 0835 PDT to 0925 PDT and most events from 0925 PDT to 0958 PDT were dropped and are not recoverable. 😭
- Updates to user properties, taking Chameleon Experiences (i.e. Tours/Surveys) were not directly affected but this incident and were delivered to end-users as they normally would have been. However, the metrics/CSV for this time period will not be accurate -- please get in touch to restore on a case-by-case basis and reference this post on statuspage

Will add more information here as it's available
Posted Mar 13, 2020 - 11:39 PDT
Monitoring
Received this reply back from the Compose team -- "there seems to be an issue with the node [redacted]. We are actively investigating the issue and will provide an update as soon as more information is available."

In the meantime the Chameleon engineers have failed over to a secondary Redis cluster to continue processing data adn service the Chameleon web-app/sidebar in real time. The re-import step will follow one the issue with Compose is resolved.
Posted Mar 13, 2020 - 10:13 PDT
Update
We have failed-over to the new redis database/cluster and are currently processing at real time. We will continue to backfill from the last 1 hour and 6 minutes.
Posted Mar 13, 2020 - 10:08 PDT
Update
Quite less intermittent than it appeared at the initial assessment -- going to failover to a new redis database/portal. At this time it appears all data processing is happening normally but we will have to re-import/aggregate data about tours/surveys/launchers from the past few hours when this issue is resolved
Posted Mar 13, 2020 - 09:46 PDT
Identified
We are seeing errors associated with a redis database causing intermittent connection failures which is resulting in visible 500 errors for people trying to use Chameleon Editor, logging in or viewing their dashboard.
Posted Mar 13, 2020 - 09:01 PDT
This incident affected: Sidebar API.