Wednesday, March 6, 2013

Database Server Crash (2013-03-05)

Earlier today the master database server crashed.  I don't know why it did, but when it came back online it took several hours to recover.  The slave server remained online during this time and is in good health.

I've learned a few things from this experience:

  • I need to prune a number of tables in the database as this could greatly reduce the amount of data being stored and replicated.
  • You can disable Debian's default of checking all tables on MySQL restart.  When tables have many millions of rows, even a fast table check can create a bottleneck which prevents new rows from being inserted into these tables and then ties up available database connections etc.
  • I need to consider upgrading the database hardware.
So, things are back to normal for now.  There were some outages this evening (March 5), but they weren't extreme and a lot of data continued to be tracked in the meantime.  This kind of crash hasn't happened in probably the last 3 years, so it took me by surprise, but at least I'm not up all night trying to recover from a disaster.

I apologize for the downtime.  I hate it at least as much as you do.  Having said that, the WunderCounter's uptime is actually very, very good in general.  There are very few outages and I generally just trust that it's going to be available, since it has historically performed so solidly.  However, I'll be implementing the lessons learned from today so that I'll be able to minimize downtime even more going forward.