Sunday, December 4, 2011

Analyzing Vendor Recommendations

Recently, I received some feedback from a vendor regarding changes I made to our customer's space-constrained application server, which hosts two user databases in a SQL Server 2005 instance.  Let's look at some of these recommendations, shall we?  These responses have more value for the vendor as learning points and thus weren't directly communicated with my customer.  Rather the discussion with my customer revolves around application ownership and recoverability, and how those items align with their overall business needs.

This is primarily in defense of my original actions, but brings up some important misconceptions around SQL Server, provides documentation for those myths, and discusses how to redress the issues so that the customer is served.

Vendor Recommendations
I realize it has been a while since I had setup my maintenance plan to backup the user databases and transaction logs, so I went in to see how they were doing.  But I see they are not there anymore.
Yes, I did change the existing maintenance plans.  They didn't cover system databases, which are key for recoverability.  
I also had a weekly maintenance plan to shrink the user databases.
The application server is space constrained, so for performance's sake, the original Shrink Database item was removed from the plan and replaced with T-SQL code for targeting log files only for DBCC SHRINKFILE commands.  Joined with regular log backups, this would allow the OS to retain more free space for regular operation but still let us enjoy point-in-time recovery for user databases.  For a long term solution, this system needs more storage provisioned, period.
The alerts you have setup have thresholds too low.  They don't take into account a database in "full" mode.  
The metrics gathered were from PerfMon stats, namely \\SERVER\MSSQL$INSTNACE\Database\Percent Log Used >= 90.  This metric is valid for any recovery model and can be easily illustrated by the well known DBCC SQLPERF(LOGSPACE) query, which will show us both the percentage used and the size of the log itself.  This metric matters in any case, and is logged to the Windows event log as an Error, not a Warning; its that important.
And you should never get an alert on the tempdb system database. 
A full transaction log, regardless of which database it is for, will halt the writing of data.  Things that might be missing from this systems configuration is a service account with the SE_MANAGE_VOLUME_NAME permission to allow for instant file initialization when auto-growth kicks in frequently to address the utilization issue, further propagating the alerts.
I went out and looked at the physical size of these databases and log sizes.  They are very small compared to normal operation.  The alert settings need to be changed.
The database size isn't the issue in question (thought with Shrink Database commands, there will be fragmentation issues), but with a properly scheduled BACKUP LOG in conjuction with auto-growth enabled, a database log file will balance out with the volume of transactions hitting the database.  Another item to consider is the number of VLFs since there's a history of growing and shrinking logs.
I am changing the user databases to "simple" recovery model for now.  Probably need to turn off the hourly transaction log backup for now.  
Yes, if you're changing recovery models, you're going to cause those explicitly typed BACKUP LOG jobs to fail.
I have run the shrink database on the user databases and on tempdb.
I'm glad database size is a concern, but we should look at sp_spaceused to see if we just plain need more space or if we're potentially over-indexed.  And shrinking tempdb is documented to cause corruption.
I also see that the SQL backups are only for three days.  Are these being swapped out so that there are actually more days being stored somewhere else (offsite maybe)?.
Looking at the larger picture, there is a VSS aware 3rd party backup system in place.  SQL Server backups exist for the purpose of near-line recoverablity and point-in-time recoverability.  Remember, we're space constrained, and you've just implimented a Shrink Database to address it, so we've only got so much space for backup retention. 
Please do not change any general or options properties on the user databases.  I will maintain these.  Just let me know if you need something changed as some items can have repercussions on the functionallity of applications.
It looks like you've got this under control.  The fact that compatibility_level is set to 80 and page_verify is set to CHECKSUM has me concerned.  Which do we really need?
The biggest point here is that user applications are two completely separate systems sharing the SQL server.  They really need separate backup and maintenance plans as the data is used and maintained differently.
Taking a holistic view of the application and server together, these backup plans may need crash consistency between databases.  We've already shown that there are some unique issues with the compatilbity_level and "other general settings". 
The SQL server is two service packs behind.  SQL Server 2005 should be on SP4.  
SQL Server 2005, even at Service Pack 4 with be out of maintenance from Microsoft as of 12/2011.  And this doesn't jive with the compatibility_level request from earlier.  If I'm looking to future proof my application and move to SQL Server 2012, I won't be able to attach my database, let alone restore it.

How can we be productive about these recommendations as consultant generalists and DBAs?


Get the stake holders into a stand-up meeting.  Get the overall goals for the application written down.  What's our time to recovery needs?  What's our point in time recovery needs?  Who's responsible for what part of a disaster recovery?  How about performance?  What's the life-cycle of this application?


Once the questions are answered, don't be afraid to let the 3rd party vendor handle the application the way your customer wants.  They've elected to utilize this vendor.  But be cautious, and take precautions.  In this example, I'll have to let the customer know we'll still be getting alerts and inform them of the ramifications of ignoring or permanently suppressing those notifications.


Put your plan into action.  And, of course, backup, backup, backup.  Then test, test, test!  There's a ton of great information in the links above, so go do your homework so that you can act on knowledge, not assumptions.

No comments:

Post a Comment