The outage itself was due to an oversight during the upgrade. A unique index for the site database was missing, which caused incorrect balances for some accounts, and permitted others to attempt to place oversized orders or withdraw more coins than they actually had. The community’s IRC channel spotted this problem before any serious damage was done, and the development team swung into rollback mode.
The architecture of CoinSwap has some unique defensive features which enabled their recovery. The database that was mis-indexed is a record of events and is used to feed the customer facing web interface, but the actual transactions occur in ‘the engine’, a construct that exists only in memory. Childress, sounding a bit ragged after his long weekend, cheered up considerably on this point: “We have the best logging, we can literally replay that content and rebuild the state of the system from scratch. We didn’t have go that far in this case, but we did manually review the types of transactions that would have been affected by the database problem.”
CoinSwap is known for the light, quick feel of the site. The combination of a world wide customer base eager to get back in, coupled with the volume of transactions due to Paycoin, led them to upgrade their frontend to two 16 core machines with 120 gigs of memory. The database lives on a 32 core system with 240 gigs of memory, and there are a dozen other machines hidden from the public eye which make up the overall system. When asked if they used Cloudflare, Childress reported that DDoS had never been an issue for them and that they were happy with the combination of Nginx features and Amazon’s scalable cloud.
Childress and the rest of the team were confident they were ready to go back into service around 18:00 eastern time Sunday night. They have many Australian customers and enough in Russia that they have a translator on staff. Conversations with the community, many of whom had work the next day, led to the push back to 10:00 Eastern on Monday and the cancellation of all pending orders. This was felt to be the fairest way to climb down from the awkward post-outage position.
Incident Response PR Timeline
An outage like this is a public relations nightmare for an entity that wants to be trusted, not just with cryptocoins, but also with the timely execution of transactions. CoinSwap addressed their customer concerns with a flow of timestamped updates. Key events in the incident response timeline were:
- Immediately dispel rumors of the service being a scam by posting owner’s LinkedIn account
- Verify problem was purely internal, dispel rumors of an intrusion
- Realizing a full audit would be required, communicating to customers the timeline
- Delaying restart to Monday morning so as to not disadvantage European users
Disclosure: Author was using CoinSwap when he noticed the outage.
Images from Shutterstock.
Last modified: March 4, 2021 4:42 PM