Database slow to open after Instance restart or failover due to too many VLFs in transaction log.

An occasional client hosts a pretty busy 1/2 TB Dynamics CRM database on a SQL Server cluster. The database has been in place for a while, and everyone’s happy with the performance. So on the surface everything seems nice and happy. And btw, they don’t have a full time DBA.

Due to a hardware failure the had an Instance failover on the cluster. All DBs in the instance popped back up nice and quickly, but company_mscrm is stuck IN RECOVERY. As this is the core database for Dynamics CRM their system was down. Thinking it was a size problem they decided to leave it a bit longer. After 10 minutes they decided it was ‘stuck’ and restarted the instance. Same problem.

Now people were getting worried, so they started a restore onto a secondary server, and called us.

On connecting to their ailing server a quick sp_who2 showed a number of processes executing command DB STARTUP, so something was still operating on the database.

Using the scripts provided from Microsoft SQL Server Customer Support – http://blogs.msdn.com/b/psssql/archive/2010/12/29/tracking-database-recovery-progress-using-information-from-dmv.aspx – I started collecting execution stats so we could see what the processes were doing (or not doing as the case might be). Btw, I tend to adjust the capture interval to 5 seconds to start with, so I can get a feel without swamping myself with data.

From the tbl_recovery_tracking table: SELECT total_elapsed_time, wait_type,reads,writes, cpu_time FROM [dbo].[tbl_recovery_tracking] where session_id=23

 

total_elapsed_time wait_type reads writes cpu_time
762441 IO_COMPLETION 109966 0 2418
767445 IO_COMPLETION 110811 0 2418
772448 IO_COMPLETION 111726 0 2449
777451 IO_COMPLETION 112753 0 2480
782453 IO_COMPLETION 113625 0 2480
787457 IO_COMPLETION 114565 0 2496
792460 IO_COMPLETION 115527 0 2527
797462 IO_COMPLETION 116303 0 2574
802465 IO_COMPLETION 117106 0 2589
807469 IO_COMPLETION 117880 0 2589
812471 IO_COMPLETION 118499 0 2589

So much as you’d expect we see lots and lot of reads.Then all of a sudden, reads stop and we start seeing a bigger increase in cpu_time:

total_elapsed_time wait_type reads writes cpu_time
2103341 IO_COMPLETION 317957 0 7129
2108344 IO_COMPLETION 318662 0 7144
2113346 IO_COMPLETION 319523 0 7207
2118350 NULL 320016 0 9672
2123353 NULL 320016 0 14523
2128355 NULL 320016 0 19390
2133359 NULL 320016 0 24398

So nearly there you might think. Not a bit of it. This carried on for a long time. In fact twice as long as it had taken to complete the reads:

total_elapsed_time wait_type reads writes cpu_time
6787241 NULL 320016 0 4456636
6792244 NULL 320016 0 4461316
6797247 NULL 320016 0 4465856
6802251 NULL 320016 0 4470505
6807262 NULL 320016 0 4475341
6812265 NULL 320016 0 4480177
6817267 NULL 320016 0 4484950

So bearing in mind that total_elapsed_time is measure in milliseconds, the database recovery consisted of 35 minutes of reading data, and then 78 minutes of processing. Which is a lot of time, when the recovery stats in the error log are this small:

LogDate ProcessInfo Text
18/04/2013 08:50 spid23s 16 transactions rolled forward in database ‘company_mscrm’ (8). This is an informational message only. No user action is required.
18/04/2013 08:50 spid13s 0 transactions rolled back in database ‘company_mscrm’ (8). This is an informational message only. No user action is required.
18/04/2013 08:50 spid13s Recovery is writing a checkpoint in database ‘company_mscrm’ (8). This is an informational message only. No user action is required.
18/04/2013 08:50 spid13s Recovery completed for database company_mscrm (database ID 8) in 2 second(s) (analysis 720 ms, redo 698 ms, undo 196 ms.) This is an informational message only. No user action is required.
18/04/2013 08:50 spid13s Recovery is complete. This is an informational message only. No user action is required.

And the restore on the seconday box? Hadn’t gotten anywhere.

This all pointed to a ridiculous number of VLFs in their transaction logs. Lo and behold and quick DBCC LOGINFO in the offending database revealed a shockingly high count of ~28000 VLFs in a 70GB transaction log file. That’s a new record for me.

So to fix this we agreed that that they’d shrink their log files and recreated them at a more realistic size, lay off on shrinking them and setting autogrowth use a better value if it needed to by running:

USE company_mscrm
go
BACKUP LOG company_mscrm to DISK=N'\\backup\share\company_mscrm\backup.trn'
DBCC SHRINKFILE ('company_mscrm_log',0,TRUNCATEONLY)
DBCC SHRINKFILE ('company_mscrm_log',0)
ALTER DATABASE company_mscrm MODIFY FILE (NAME='company_mscrm_log',SIZE=70GB, FILEGROWTH=2GB)

First we backup the transaction log to ‘flush’ everything out. Then we shrink the log file as much as we can. Then resize it to 70GB, which was the size we determined would cope with pretty much all of their usage. And then we set a sensible size autogrowth for the few occasions when the transaction log file isn’t big enough.

A simple solution to what was quite a major problem. The lesson to take away is that just because your production database appears to be running perfectly there can still be issues under the surface just waiting to catch you out. There are a number of tools and scripts available that allow you to monitor your log files and their VLF counts.

If you need any help setting them up, or require advice on sizing your transaction logs then Leaf Node are happy to help, please get in touch

Stuart Moore

Stuart Moore

Nottingham based IT professional with over 15 years in the industry. Specialising in SQL Server, Infrastructure design, Disaster Recovery and Service Continuity management. Now happily probing the options the cloud provides

When not in front of a computer, is most likely to be found on the saddle of his bike

More Posts - Website

Follow Me:
TwitterLinkedInGoogle Plus

No comments yet.

Leave a Reply