When it’s a disaster recovery test.
The DR test I mentioned previously (Checked your Disaster Recovery plans recently?) didn’t go quite as smoothly as expected. Filestore, load balancing and SQL all fell over to the secondary site as expected with no problems. A quick sp_who2 showed sessions from the apps servers into the correct databases using the correct users, so everything was good then?
Well, all the underlying services looked good, but the web access components weren’t looking healthy. OK that may be an understatement, they were deader than a Norwegian Blue pining for it’s Fjords.
But we had a ‘tested‘ DR plan, so what happened?
Well, it turned out that we hadn’t had a ‘disaster’. Due to other work we didn’t fail the application servers over to the secondary site. So this meant they didn’t get a restart, which was an unwritten assumption in the DR plan.
But the applications were meant to follow the DB connections across weren’t they? And we had connection to the DBs at the secondary site didn’t we? Turns out there’d also been an upgrade that had split the service DB connections from the web DB connections which had previously been tied together. So the connections to the secondary DB were from the services, but the web connections hadn’t failed over as they’d not refreshed themselves. One reboot and we were back up and running, which had we had a full disaster we’d have got a as a freebie.
The upshot of this is that we had