Overclock.net banner

1 - 13 of 13 Posts

·
Premium Member
Joined
·
22,000 Posts
Discussion Starter · #1 ·
Hi everyone,

As you may have noticed, the site was down for roughly 3.5 hours this afternoon. This post is to explain why that was, what the results were, and what we're doing in the future to prevent the situation from happening again.

I'll start with the summary of news:
  • A database table that suffered a failure stored private messages.
  • 252 private messages were irreversibly lost.
  • These 252 messages were sent between the hours of 7:40am - 2:16pm EST today, January 24th. Any message sent during this time is not in the database.
There were several causes for the above failure, most notably a two-tier failure in our backup system. In the name of full disclosure, here is what transpired:

We make backups transactionally - that is, every time an action is taken on the database the results of that action are immediately pushed to local backups. Two programs are continuously monitoring these on-site backups. The programs are designed to run synchronously with one another, one shuffling data to off-site storage in Amazon's network and the other moving it to dedicated backup storage within the same datacenter as the production servers. These processes eventually worked their way out of sync, and as a result, a copy of the local transaction data was being deleted from the production system before it was fully moved onto the on-site and off-site backups. Integrity of backups is routinely checked, but this problem developed between the most recent integrity check and today.

The problem described above was discovered as a result of a failure during a session of maintenance this morning. Typically such a failure would be easy to recover from with the transactional backups, but when it was determined that they were not available the site was immediately taken offline to further diagnose and minimize the amount of new data introduced to the system while we could not say for sure it was being backed up. Restoration work began with the most recent copy of data we had, which was last night's complete dump. Incremental backups are made on a a constant basis, while complete database dumps are taken nightly.

The only effected database table was the one that stores private messages - only private messages suffered loss, all other data has been verified as intact. The only messages effected were sent in the time between last night's complete dump and the discovery of the sync problem this morning - in total, this was 252 private messages.

Next steps from here include more frequent verification of incremental backups, and redevelopment of the programs monitoring the internal data shuffle of transactional backups to prevent sync related issues from causing data loss. This event was a multi-stage failure, any single part of which would have been recoverable, but the combined total of these failures leaves us short data, as mentioned above. We're terribly sorry for that - its our first data loss in over 7 years, and we're laying out plans to make it our last.

If we can answer any specific questions, please do let us know, and again - our most sincere apologies for this. It could have been much worse; but that does not take away from the fact it shouldn't have happened at all.

Thanks,
Chipp
 

·
Premium Member
Joined
·
14,173 Posts
Ouch, that's no good, but it could have been much worse, and I have seen worse, even on relatively large forums. Thanks for the clarification, Chipp
thumb.gif
 

·
Moderator
Joined
·
13,100 Posts
Glad it's fixed, we need more random facts though
biggrin.gif
 

·
Premium Member
Joined
·
65,162 Posts
You guys archive in addition to backups, right?
 
1 - 13 of 13 Posts
Top