Friday, November 23, 2012

Alway keep a backup. Of everything.

Let me tell you a story of failures and backups and pain.

So last night I finished a bunch of changes to Solomon Says.  After the regular load of testing (that lasts 15 minutes and includes opening a bunch of pages on Firefox and Chrome), I uploaded the changes and tried to bring the server back up. Everything exploded in my face at about the same time. The only reason we are still in business is that I had backups. In decreasing order of importance, the following backups saved the day:
  1. Database
  2. Code/Configurations
  3. Images
So pretty much everything :)

At this point, a note on the deployment process is in order. Here’s how it goes:
  1. Stop python fcgi process and nginx service.
  2. Delete the production code.
  3. Run the DB migration script.
  4. Upload the entire code from my laptop to the production location.
  5. Start python and nginx
#2, #3, and #4 didn’t go too well.

#2 – My dev. environment is Windows, but production is on LINUX.  So there’s a bunch of stuff related to path handling (‘/’ vs ‘\’ etc.) that I change just for development. This is automatically handled in production by using a different configuration file. Alas, I ran the delete for #2 from one level higher in the directory structure. Boom goes the config. And on bringing the server up, I get a load of ‘access permission denied’ errors. I spent a half hour analyzing the arcane debug messages, then give up and restore the entire code base file by file and change by change.

#3 – I missed selecting a couple of ‘where’ conditions when running the migration script. Result – 2 of the main table got randomly changed. Considering how crappy the day had been so far, I realized it only on restarting the server. So bring the server down again, restore the DB to its previous avatar from the backup, and run the migration with extra precaution.

#4 – My development copy did not have quite a few of the images related to the newer reviews I had posted. And since I had deleted the production data in #2, the server started throwing ‘Suspicious Operation’ exception (What the hell is that? It should have said ‘File not found” or something). In view of the blunders I had made for #2 and #3, I assumed a mistake in the new configuration I had created and spent another hour debugging, then gave up and copied over the image folders from the back up to production.
 
All told, something that should have taken 15 minutes took 4 hours.
 
Lesson learnt. Always keep a backup. Of everything.