An occasional client hosts a pretty busy 1/2 TB Dynamics CRM database on a SQL Server cluster. The database has been in place for a while, and everyone’s happy with the performance. So on the surface everything seems nice and happy. And btw, they don’t have a full time DBA.
Due to a hardware failure the had an Instance failover on the cluster. All DBs in the instance popped back up nice and quickly, but company_mscrm is stuck IN RECOVERY. As this is the core database for Dynamics CRM their system was down. Thinking it was a size problem they decided to leave it a bit longer. After 10 minutes they decided it was ‘stuck’ and restarted the instance. Same problem.
Now people were getting worried, so they started a restore onto a secondary server, and called us.
On connecting to their ailing server a quick
sp_who2 showed a number of processes executing command
DB STARTUP, so something was still operating on the database.
Using the scripts provided from Microsoft SQL Server Customer Support – http://blogs.msdn.com/b/psssql/archive/2010/12/29/tracking-database-recovery-progress-using-information-from-dmv.aspx – I started collecting execution stats so we could see what the processes were doing (or not doing as the case might be). Btw, I tend to adjust the capture interval to 5 seconds to start with, so I can get a feel without swamping myself with data.
SELECT total_elapsed_time, wait_type,reads,writes, cpu_time FROM [dbo].[tbl_recovery_tracking] where session_id=23
So much as you’d expect we see lots and lot of reads.Then all of a sudden, reads stop and we start seeing a bigger increase in
So nearly there you might think. Not a bit of it. This carried on for a long time. In fact twice as long as it had taken to complete the reads:
So bearing in mind that total_elapsed_time is measure in milliseconds, the database recovery consisted of 35 minutes of reading data, and then 78 minutes of processing. Which is a lot of time, when the recovery stats in the error log are this small:
|18/04/2013 08:50||spid23s||16 transactions rolled forward in database ‘company_mscrm’ (8). This is an informational message only. No user action is required.|
|18/04/2013 08:50||spid13s||0 transactions rolled back in database ‘company_mscrm’ (8). This is an informational message only. No user action is required.|
|18/04/2013 08:50||spid13s||Recovery is writing a checkpoint in database ‘company_mscrm’ (8). This is an informational message only. No user action is required.|
|18/04/2013 08:50||spid13s||Recovery completed for database company_mscrm (database ID 8) in 2 second(s) (analysis 720 ms, redo 698 ms, undo 196 ms.) This is an informational message only. No user action is required.|
|18/04/2013 08:50||spid13s||Recovery is complete. This is an informational message only. No user action is required.|
And the restore on the seconday box? Hadn’t gotten anywhere.
This all pointed to a ridiculous number of VLFs in their transaction logs. Lo and behold and quick
DBCC LOGINFO in the offending database revealed a shockingly high count of ~28000 VLFs in a 70GB transaction log file. That’s a new record for me.
So to fix this we agreed that that they’d shrink their log files and recreated them at a more realistic size, lay off on shrinking them and setting autogrowth use a better value if it needed to by running:
USE company_mscrm go BACKUP LOG company_mscrm to DISK=N'\\backup\share\company_mscrm\backup.trn' DBCC SHRINKFILE ('company_mscrm_log',0,TRUNCATEONLY) DBCC SHRINKFILE ('company_mscrm_log',0) ALTER DATABASE company_mscrm MODIFY FILE (NAME='company_mscrm_log',SIZE=70GB, FILEGROWTH=2GB)
First we backup the transaction log to ‘flush’ everything out. Then we shrink the log file as much as we can. Then resize it to 70GB, which was the size we determined would cope with pretty much all of their usage. And then we set a sensible size autogrowth for the few occasions when the transaction log file isn’t big enough.
A simple solution to what was quite a major problem. The lesson to take away is that just because your production database appears to be running perfectly there can still be issues under the surface just waiting to catch you out. There are a number of tools and scripts available that allow you to monitor your log files and their VLF counts.
If you need any help setting them up, or require advice on sizing your transaction logs then Leaf Node are happy to help, please get in touch