Name: Data loss due to silent connectivity error
Summary: Our database responsible for queueing specific actions to be processed offline became un-connectable for a period of ~30 mins. Non-queued application behavior continued as normal.
Impact: Events logged to Cameleon during the incident were not processed or stored.
Cause: Our Redis database did not properly failover in the case of a possible underlying hardware failure within Compose.com. The associated alerts were silenced because an underlying bug in Redis 4.x presented them as network errors and not application errors.
Timing: 33 minutes from 0925 PDT to 0958 PDT
During the period listed above, Chameleon did not process or store data including:
This means that any Experiences reliant on these events will not have shown. The data here will also not be available for any future Experiences. It also means that Chameleon events will not have been sent to other connected systems during this period.
However, the logging of the state of Chameleon Experiences was not affected. This means that while the analytics on whether a Tour was completed may not be correct, the Tour will not be shown again to a user if not meant to be.
We believe an underlying hardware component within our Compose.com Redis cluster failed. Typically this would have resolved itself within the normal failover window of 2-10 seconds. However, in this specific case, a failover did not occur and the un-connectable Redis master stayed registered as master. This seems to have happened because of a bug in Redis 4.x Sentinel instances relating to the specific way that the master went offline. The bug essentially prevented the Sentinels from becoming aware of the underlying hardware failure and thus did not properly initiate a failover. This meant that requests were still routed to the failed database. This should have created clear errors in our application monitoring, but these errors were masked as network errors (and not application errors) and so were silently swallowed. The result was a successful response to the client but not a fully successful completion of the API request.
Once we became aware of this issue, our engineers manually failed over to a new Redis cluster and brought with it the most recent version backup (approximately 1 hour old). This type of switch was possible because this portion of the system is used for primarily queueing. At the time of this incident the queue had no backlog of work. When the new Redis cluster was swapped in, the application recovered in a matter of 2-3 minutes.
In order to prevent this specific issue reoccurring we are taking the following steps:
In addition to further strengthen our ability to prevent related issues and increase our resilience, over the next few weeks we will investigate possible fail-safe methods of ingesting events and other data into our queueing system (e.g. a short-lived log of all data as it is received, stored in a separate warehouse).