disaster recovery

Disaster Recovery Basics: RPO – Recovery Point Objective

When writing a Business Continuity plan, one of the first metrics to you need to nail down is your Recovery Point Objective, or RPO. This seemingly simple metric can have a major influence on how you need to design your application infrastructure

For disaster recovery your RPO is the point before disaster that you want to be able to recover to. Another way of looking at it is as the maximum amount of data you are willing to loose in a disaster.

So an RPO of 15 minutes would mean that in the event of a major incident, you are willing to accept losing up to 15 minutes worth of data. Depending on when the distaster occurs, you may recover to within 4 minutes or to the full 15 minutes.

Surely every company wants a recovery point objectice of 0 don’t they? Well yes, that would be the golden RPO. But achieving that is a huge task that will have massive implications on the architecture and infrastrure behind their application. And an RPO for one system may not fit another system if they vary massively in workload or size.

Taking our example 15 minute RPO above for a SQL Server based application, this could be achieved with the following:

  • Take Transaction Log backups every 15 minutes
    • We can now recover to within 15 minutes.
  • Backups done to a remote server
    • We ensure that we have the backups if we lose the entire SQL Server box

 

So far so simple. Now, what if the Business wants an RPO of zero? (ie; no loss of any committed data). Is that doable? Yes, but it’s going to cost you:

  • 2 SQL Server instances running in Synchronous Mirroring mode (High-Safety)
    • So that’s a doubling of physcial hardware costs, and a near doubling of licensing costs as well.
    • Plus your system is now more complicated to administer (compare to a single box).

And that’s assuming your disk systems at either end, and the network connection between your servers can keep up with your transaction throughput. If they can’t then you will be looking at beefing them up as well, with all the associated costs that goes along with that.

You can now see that the RPO you set can have a huge impact on the costs and complexity of designing a new solution. And that forcing an unrealistic recovery point objective onto an existing solution can lead to a possibly painful restructuring and redesign. This is yet another reason why a Disaster Recovery plan should be written during the design phase of a new project rather than as an afterthought.

Stuart Moore

Stuart Moore

Nottingham based IT professional with over 15 years in the industry. Specialising in SQL Server, Infrastructure design, Disaster Recovery and Service Continuity management. Now happily probing the options the cloud provides

When not in front of a computer, is most likely to be found on the saddle of his bike

More Posts - Website

Follow Me:
TwitterLinkedInGoogle Plus

Stuart Moore

About Stuart Moore

Nottingham based IT professional with over 15 years in the industry. Specialising in SQL Server, Infrastructure design, Disaster Recovery and Service Continuity management. Now happily probing the options the cloud provides

When not in front of a computer, is most likely to be found on the saddle of his bike

Database slow to open after Instance restart or failover due to too many VLFs in transaction log.

An occasional client hosts a pretty busy 1/2 TB Dynamics CRM database on a SQL Server cluster. The database has been in place for a while, and everyone’s happy with the performance. So on the surface everything seems nice and happy. And btw, they don’t have a full time DBA.

Due to a hardware failure the had an Instance failover on the cluster. All DBs in the instance popped back up nice and quickly, but company_mscrm is stuck IN RECOVERY. As this is the core database for Dynamics CRM their system was down. Thinking it was a size problem they decided to leave it a bit longer. After 10 minutes they decided it was ‘stuck’ and restarted the instance. Same problem.

Now people were getting worried, so they started a restore onto a secondary server, and called us.

On connecting to their ailing server a quick sp_who2 showed a number of processes executing command DB STARTUP, so something was still operating on the database.

Using the scripts provided from Microsoft SQL Server Customer Support – http://blogs.msdn.com/b/psssql/archive/2010/12/29/tracking-database-recovery-progress-using-information-from-dmv.aspx – I started collecting execution stats so we could see what the processes were doing (or not doing as the case might be). Btw, I tend to adjust the capture interval to 5 seconds to start with, so I can get a feel without swamping myself with data.

From the tbl_recovery_tracking table: SELECT total_elapsed_time, wait_type,reads,writes, cpu_time FROM [dbo].[tbl_recovery_tracking] where session_id=23

 

total_elapsed_time wait_type reads writes cpu_time
762441 IO_COMPLETION 109966 0 2418
767445 IO_COMPLETION 110811 0 2418
772448 IO_COMPLETION 111726 0 2449
777451 IO_COMPLETION 112753 0 2480
782453 IO_COMPLETION 113625 0 2480
787457 IO_COMPLETION 114565 0 2496
792460 IO_COMPLETION 115527 0 2527
797462 IO_COMPLETION 116303 0 2574
802465 IO_COMPLETION 117106 0 2589
807469 IO_COMPLETION 117880 0 2589
812471 IO_COMPLETION 118499 0 2589

So much as you’d expect we see lots and lot of reads.Then all of a sudden, reads stop and we start seeing a bigger increase in cpu_time:

total_elapsed_time wait_type reads writes cpu_time
2103341 IO_COMPLETION 317957 0 7129
2108344 IO_COMPLETION 318662 0 7144
2113346 IO_COMPLETION 319523 0 7207
2118350 NULL 320016 0 9672
2123353 NULL 320016 0 14523
2128355 NULL 320016 0 19390
2133359 NULL 320016 0 24398

So nearly there you might think. Not a bit of it. This carried on for a long time. In fact twice as long as it had taken to complete the reads:

total_elapsed_time wait_type reads writes cpu_time
6787241 NULL 320016 0 4456636
6792244 NULL 320016 0 4461316
6797247 NULL 320016 0 4465856
6802251 NULL 320016 0 4470505
6807262 NULL 320016 0 4475341
6812265 NULL 320016 0 4480177
6817267 NULL 320016 0 4484950

So bearing in mind that total_elapsed_time is measure in milliseconds, the database recovery consisted of 35 minutes of reading data, and then 78 minutes of processing. Which is a lot of time, when the recovery stats in the error log are this small:

LogDate ProcessInfo Text
18/04/2013 08:50 spid23s 16 transactions rolled forward in database ‘company_mscrm’ (8). This is an informational message only. No user action is required.
18/04/2013 08:50 spid13s 0 transactions rolled back in database ‘company_mscrm’ (8). This is an informational message only. No user action is required.
18/04/2013 08:50 spid13s Recovery is writing a checkpoint in database ‘company_mscrm’ (8). This is an informational message only. No user action is required.
18/04/2013 08:50 spid13s Recovery completed for database company_mscrm (database ID 8) in 2 second(s) (analysis 720 ms, redo 698 ms, undo 196 ms.) This is an informational message only. No user action is required.
18/04/2013 08:50 spid13s Recovery is complete. This is an informational message only. No user action is required.

And the restore on the seconday box? Hadn’t gotten anywhere.

This all pointed to a ridiculous number of VLFs in their transaction logs. Lo and behold and quick DBCC LOGINFO in the offending database revealed a shockingly high count of ~28000 VLFs in a 70GB transaction log file. That’s a new record for me.

So to fix this we agreed that that they’d shrink their log files and recreated them at a more realistic size, lay off on shrinking them and setting autogrowth use a better value if it needed to by running:

USE company_mscrm
go
BACKUP LOG company_mscrm to DISK=N'\\backup\share\company_mscrm\backup.trn'
DBCC SHRINKFILE ('company_mscrm_log',0,TRUNCATEONLY)
DBCC SHRINKFILE ('company_mscrm_log',0)
ALTER DATABASE company_mscrm MODIFY FILE (NAME='company_mscrm_log',SIZE=70GB, FILEGROWTH=2GB)

First we backup the transaction log to ‘flush’ everything out. Then we shrink the log file as much as we can. Then resize it to 70GB, which was the size we determined would cope with pretty much all of their usage. And then we set a sensible size autogrowth for the few occasions when the transaction log file isn’t big enough.

A simple solution to what was quite a major problem. The lesson to take away is that just because your production database appears to be running perfectly there can still be issues under the surface just waiting to catch you out. There are a number of tools and scripts available that allow you to monitor your log files and their VLF counts.

If you need any help setting them up, or require advice on sizing your transaction logs then Leaf Node are happy to help, please get in touch

Stuart Moore

Stuart Moore

Nottingham based IT professional with over 15 years in the industry. Specialising in SQL Server, Infrastructure design, Disaster Recovery and Service Continuity management. Now happily probing the options the cloud provides

When not in front of a computer, is most likely to be found on the saddle of his bike

More Posts - Website

Follow Me:
TwitterLinkedInGoogle Plus

Stuart Moore

About Stuart Moore

Nottingham based IT professional with over 15 years in the industry. Specialising in SQL Server, Infrastructure design, Disaster Recovery and Service Continuity management. Now happily probing the options the cloud provides

When not in front of a computer, is most likely to be found on the saddle of his bike

Backups – where it all starts

Dear DBAs of the world,

We need to talk about backups. Ever wonder why every DBA blog for any RDBMS product bangs on about backups? (ok, you NoSQL guys can shut up at this point).

Because if you’ve got good, and I mean good

Stuart Moore

Stuart Moore

Nottingham based IT professional with over 15 years in the industry. Specialising in SQL Server, Infrastructure design, Disaster Recovery and Service Continuity management. Now happily probing the options the cloud provides

When not in front of a computer, is most likely to be found on the saddle of his bike

More Posts - Website

Follow Me:
TwitterLinkedInGoogle Plus

Stuart Moore

About Stuart Moore

Nottingham based IT professional with over 15 years in the industry. Specialising in SQL Server, Infrastructure design, Disaster Recovery and Service Continuity management. Now happily probing the options the cloud provides

When not in front of a computer, is most likely to be found on the saddle of his bike

When is a disaster not a disaster?

When it’s a disaster recovery test.

The DR test I mentioned previously (Checked your Disaster Recovery plans recently?) didn’t go quite as smoothly as expected. Filestore, load balancing and SQL all fell over to the secondary site as expected with no problems. A quick sp_who2 showed sessions from the apps servers into the correct databases using the correct users, so everything was good then?

Well, all the underlying services looked good, but the web access components weren’t looking healthy. OK that may be an understatement, they were deader than a Norwegian Blue pining for it’s Fjords.

But we had a ‘tested‘ DR plan, so what happened?

Well, it turned out that we hadn’t had a ‘disaster’. Due to other work we didn’t fail the application servers over to the secondary site. So this meant they didn’t get a restart, which was an unwritten assumption in the DR plan.

But the applications were meant to follow the DB connections across weren’t they? And we had connection to the DBs at the secondary site didn’t we? Turns out there’d also been an upgrade that had split the service DB connections from the web DB connections which had previously been tied together. So the connections to the secondary DB were from the services, but the web connections hadn’t failed over as they’d not refreshed themselves. One reboot and we were back up and running, which had we had a full disaster we’d have got a as a freebie.

The upshot of this is that we had

Stuart Moore

Stuart Moore

Nottingham based IT professional with over 15 years in the industry. Specialising in SQL Server, Infrastructure design, Disaster Recovery and Service Continuity management. Now happily probing the options the cloud provides

When not in front of a computer, is most likely to be found on the saddle of his bike

More Posts - Website

Follow Me:
TwitterLinkedInGoogle Plus

Stuart Moore

About Stuart Moore

Nottingham based IT professional with over 15 years in the industry. Specialising in SQL Server, Infrastructure design, Disaster Recovery and Service Continuity management. Now happily probing the options the cloud provides

When not in front of a computer, is most likely to be found on the saddle of his bike

Checked your Disaster Recovery plans recently?

One of my first jobs in 2012 is to perform some Disaster Recovery testing for a production Service to prove to the Business that it will fail over the way they want it to.

Stuart Moore

Stuart Moore

Nottingham based IT professional with over 15 years in the industry. Specialising in SQL Server, Infrastructure design, Disaster Recovery and Service Continuity management. Now happily probing the options the cloud provides

When not in front of a computer, is most likely to be found on the saddle of his bike

More Posts - Website

Follow Me:
TwitterLinkedInGoogle Plus

Stuart Moore

About Stuart Moore

Nottingham based IT professional with over 15 years in the industry. Specialising in SQL Server, Infrastructure design, Disaster Recovery and Service Continuity management. Now happily probing the options the cloud provides

When not in front of a computer, is most likely to be found on the saddle of his bike