All Your SQL Server: Management

Showing posts with label Management. Show all posts

Thursday, May 17, 2018

A Reasonable DR

I wracked my brains over Disaster Recovery concepts for several months before I reached a viable plan. The key was when the light bulb went on that you must plan for two distinct types of disasters (call them major and catastrophic). You need to plan for both, but quite possibly you can be a little more lenient on yourself for the catastrophies. I'm not going to get too far into the weeds with my specific implementation as your's will likely be different. How you recover from each disaster really depends on the perceptions of your customers (what is their range of allowable excuses) and your employer's ownership contingencies (what is their insurance like and what are their long-term objectives). Things that you read in the "trades" about recovery time objective and recovery point objective (RTO and RPO) are still relevant, but they are different under each of the scenarios.

A "major" is when the halon failed and a fire in your server room melted your NAS. Maybe some other equipment survived, and all the hardware outside the server room is just fine. Under this scenario both your customers and your management will extend you a fair amount of sympathy but they still expect some kind of reasonable recovery efforts. It widely varies by circumstance, but in my case they would expect that after a couple of days our systems would be up and running again, and at worse we might have to enter a half day's data over.

A "catastrophic" is when Hurricane Jeff rolls in from the gulf, your whole building is flooded, half the state is without power for two weeks, trees are down, roads are impassible, and the national guard is called out to maintain order and provide drinking water. In my case I suspect management could actually be called in to decide whether to walk away from the business by declaring a total insurance loss. If they did decide to continue, they could probably convince our customers that a month long gap in service was totally justified (of course we'd waive a couple months of bills out of courtesy).

You have to plan for recovery from a major disaster with the resources at your near immediate disposal. It helps to have both onsite and offsite copies of your software and your data; onsite copies are typically faster to access and should be counted on as the primary recovery source. Offsite resources might be compromised by the disaster (such as if the fire took down your switch hardware servicing your cloud provider). Offsite resources however may provide some resilience should you not end up with a complete set of onsite backups after the disaster.

You have to plan for catastrophic recovery from solely the offsite resources. You may need to procure new hardware in a new location and your old location might be permanently inaccessible.

Once you've got these concepts firmly in mind and know for each the RTO and RPO, you can work backward to determine your backup strategy. It doesn't make sense to backup your log files locally on a frequency greater than your major disaster RPO, and to perform offsite database log copies more frequently than your catastrophic RPO.

Planning to meet the Recovery Time Objective is rather more nuanced, and likely takes actual practice to discover. How long does it take you to restore a copy of each of your databases? How long to restore the subsequent differentials and log backups? Until you try it you won't know for sure. Also consider that your charges are usually blessed with running on the fastest IO hardware; in a disaster or catastrophe you may well be restoring to considerably slower boxes.

There now, that was fun, wasn't it!

Thursday, April 12, 2018

Architecture

When you're the DBA in a shop undergoing vigorous software development, one of the knottier challenges is how to specify which production SQL instances to order, as well as how to allocate your databases across those instances. The decisions in this process fall under the general rubric of "data architecture". Several design aspects interplay to pull your decision-making into conflicting directions. The big four are:

- cost
- execution efficiency
- ease of ongoing instance management
- security

You won't be able to provide a clearcut recommendation nor is there a process that magically defines a "best" answer. Also the parameters change over time, as new projects create new databases, and advances in hardware change the optimal choices. You need to weigh the pros and cons of all the options and come up with a "gut feel" configuration that appears to provide the best balanced approach.

Let's delve into each of the important factors to arrive at a happy medium. This process plays out similar to a game of ping-pong, where you smack the instance counts back and forth between high and low until you settle into a reasonable compromise.

1) Cost: Fairly early on you need to plan for the full operational workload, with the features of the database engine that you will require. "Enterprise" versions of SQL Server, although considerably more expensive, often provide the crucial functionality that a large implementation will require, such as data compression, advanced encryption, and a more thorough utilization of modern hardware and cloud-based blended resources.

Some of your databases may not require this complete functionality however: many vendor-supplied packages posses less severe operational requirements and work fine in a "sandboxed" engine. Often you can host "reporting" databases on simpler engines as well.

For future flexibility it's helpful to have a minimum of three instances, one of which is Enterprise. Normally you would purchase more instances depending upon the additional factors discussed below. So we've started by ping-ponging the low end to three. Yes you could throw hundreds of databases on each instance, but this certainly isn't optimal.

2) Execution efficiency: The multiple databases that reside on an instance naturally compete for resources: CPU threads, IO throughput, disk space, and tempDB all experience a shared impact that can lead to resource contention. Each individual database would prefer to have the instance all to itself, but if you follow this route it quickly becomes prohibitively expensive.

Besides, outside of small vendor supplied systems, most software needs to facilitate reads, updates, and data movement across various company databases; having databases with heavy non-batch cross references on the same server provides faster operations as it reduces the network traffic.

Therefore to establish a top bound on the number of instances, group your databases by non-batched cross reference activity. You can also group databases onto an instance by other similar attributes, such as common departmental access, or parallel functionality (eMail related databases clump together, reporting databases clump together, etc). Avoid putting large databases on the same server though if you anticipate it would cause high contention of CPU threads, IO throughput, disk space, or tempDB.

3) Ease of ongoing instance management: calculating the last section likely gave you a dozen or so instances (or more). Now it's time to think about the operational overhead of maintaining all those instances. For one thing, you will need to occasionally down an instance to apply a service pack or patch rollup. You also will build and maintain the automated backup and index optimization plans on each server. Some time in the future you may be even consider upgrading each engine to a higher release version. Once in a while you want to review the SQL error logs on each box.

Clearly then "ease of management" pushes you toward a smaller, further consolidated set of instances. Some helpful guides to further shrink your instances include whether workloads are similar (reporting, batch or OLTP), and if large jobs that you anticipate might be running can be scheduled to avoid stepping upon one another (and avoid stepping upon your scheduled maintenance plans).

4) Security: Now that you've narrowed down and consolidated your instances as much as possible, one final slight expansion might be appropriate to accommodate specialized security considerations. For example, most businesses are especially sensitive about protecting their financial data from prying eyes, so it is often a sensible choice to host the finance ERP on its own server.

In some settings the customer data may warrant exceptional auditing, protection, or encryption safeguards, so it's not unusual for that to be segmented off to the side as well. Many times externally supported vendor-supplied systems may also best be positioned on their own server, not only to more clearly separate update and maintenance responsibilities but also to delineate a security environment for the VARs with offsite access.

It may feel vaguely unsatisfying to heuristically arrive at a solution to such an important decision. Yet after you've ordered, setup, configured instances and then allocated the databases you have a brief window of a year or so where you still have the flexibility to shift things around (before the subsequent effort to reconfigure becomes too overwhelming). Once lots of new production software gets implemented with connection strings set to various locations, the architecture gets fairly tightly "baked in".

Friday, February 16, 2018

The Best Tool

DBAs have all sorts of scripts and tools for managing SQL Server databases. In day to day operations you are down in the weeds with query plans and performance statistics and various monitors. You pop up a level and daily check the database backups and size of the log files. Every once in a while you lift on up to the "instance" level and check out the server logs or disk throughput.

Every other month or so I escape all the way out to the top and review once again what has turned out to be my best tool for managing the shop overall: an Excel spreadsheet. You see new databases seem to materialize on various instances all of the time, and developers even have a habit of spinning up new (poorly configured) instances all by themselves as well. So this spreadsheet helps me keep track of the whole mess and figure out where my DBA love needs to focus next.

Let me go into some detail about what the layout and each column signifies, and why they are important. Each instance has its own section; in the header adjacent to the instance name I show the SQL version, the processor count and speed, and the memory available on the server. Next to that is its backup schedule with an indication of frequency and level of manual intervention. This row of information comes in handy both when the CIO wants to suggest a different configuration on the VM boxes, and when normal backups run out of space and I need to review when they /should/ have happened.

Next comes a subheader row with column labels, and then a row (or more) for each database on that instance. The column next to the database name has the size of the data storage in KB. This is useful as every few months when I return to update the spreadsheet, I can readily tell which databases are most actively growing. Next comes the recovery model, Simple or Full (more about this in this previous post).

Next the location of the mdf database file; if the database has secondary filegroups then I show the location of the secondary files below this (with as many rows as necessary to cover all the secondaries). Then the location of the log file and the location of the backups. Note that to make the document more legible I abbreviate the often long path references with a double-letter pseudo disk designation, that is described in a legend at the bottom of each section.

Listing the file locations serves a couple of useful purposes. When you are first assembling this spreadsheet the discovery and documentation process helps you avoid the high risk of having all the files on a single physical drive (a notoriously dangerous arrangement). Later, when reviewing disk latency, you may find the spreadsheet useful in avoiding disk bottlenecks. Finally in the dreaded clutches of disaster recovery, you will know where your lifesaving logs and backups are located.

Next comes a column I have labeled Write Coupling. This and the next column, Software Volatility, relate to how you design your recovery models. For each database I provide a soft subjective rating from very low to very high of how tightly the dB writes are coupled to intricate processes, and how much the underlying update software is changing (more on this in the later post about recovery models).

Next follows a column labeled SME, for subject matter expert. Typically one of the lead developers at your company knows the most about this particular database, and if you need to migrate tables or take the database offline for maintenance then this is the person with whom you will get clearance. When you are first cobbling together this spreadsheet it will be smart to hold a few meetings with the developers to nail this down.

Finally a column for Notes that mostly alerts me to tasks and technical debt that needs to be worked down, or any special circumstances that excuse a database from normal maintenance processes.

You can find all sorts of costly tools to assist you in managing an array of databases, but far and away the most useful and cost effective is this simple spreadsheet.

Tuesday, January 16, 2018

Table Cleanup

For large databases actively under development, it's not unusual to accrue a fairly large collection of useless and abandoned tables. Here's a query you can run to find them... in this example my cutoff is a year. The polite thing to do is notify the staff of your intention to archive the large tables (if they so indicate) before you delete them. Script out the definitions of all of those you plan to delete anyhow, just in case somebody suddenly misses them.

DECLARE @DAYSunTouched INT
SET @DAYSunTouched = 365

;WITH UnUsedTables (TableName , TotalRowCount, CreatedDate , LastModifiedDate )
AS (
SELECT DBTable.name AS TableName
     ,PS.row_count AS TotalRowCount
     ,DBTable.create_date AS CreatedDate
     ,DBTable.modify_date AS LastModifiedDate
FROM sys.all_objects DBTable
     JOIN sys.dm_db_partition_stats PS ON OBJECT_NAME(PS.object_id)=DBTable.name
WHERE DBTable.type ='U'
     AND DBTable.modify_date <= (SELECT sqlserver_start_time FROM sys.dm_os_sys_info)
     and DBTable.modify_date < (getdate() - @DAYSunTouched)
     AND NOT EXISTS (SELECT OBJECT_ID
              FROM sys.dm_db_index_usage_stats
              WHERE OBJECT_ID = DBTable.object_id )
    )
SELECT DISTINCT TableName , TotalRowCount, CreatedDate , LastModifiedDate
FROM UnUsedTables
ORDER BY TotalRowCount desc

Tuesday, August 15, 2017

All Them Logs

Oh yes those dear transaction logs: can't live with them, can't live without them. Understanding how transaction logs work and how to manage them is the key to having a happy DBA life. The transaction log is basically a copy of all SQL statements run against your database, (except for things like bulk inserts in Bulk Logged mode, or inserts made when traceflag 610 is on). In a disaster on a database set on Full Recovery you can obtain a fully functioning database by first restoring the last complete backup, and then restoring any subsequent log backups. You may still of course have to manually run any bulk imports executed after the last dB backup, but that's seldom an issue.

The nice thing about logs is they backup really fast (they are after all just a bunch of SQL). I can back up a log for a day's worth of activity on a heavy production database in about 15 minutes. The nasty things about logs is they restore rather slowly (as the engine has to re-execute all that SQL). My 15 minute backup takes a good hour and a half to restore. The other nasty thing about log files is, if you go with Instant File Initialization and autogrow, then you need to manage their size by shrinking them. If the device hosting your log files for a database runs out of space, then that database won't execute any more transactions. No bueno.

Hence it's a good idea to add a customized alert to RedGate monitor when your log file expands beyond a certain size. Also I use a little trick to "flip" between different copies of the log file: right click on your database, choose properties, and then files. Add a couple files of type "log" but leave their size small and turn off autogrowth. Now when your primary log file gets too large, turn autogrowth on one of the small files, and off on your primary. After the next log backup you should be able to shrink the primary log file down to nearly nothing (well if there aren't long pending transactions still out there). In this manner you keep occasionally swapping between active log files as their size dictates.

Another nifty trick is to allocate an extra spare log file to a different physical device (but leave it small without growth). That way in an emergency where your disk drive hosting the logs fills up, you can just click into properties and turn autogrowth on your spare log file. Once that's done, backup and shrink the primary logs.

Of course if you don't want to hassle with the logs, don't need point-in-time recovery, and don't need them for log-shipping to a synchronized database, just leave your database on Simple recovery. You can then shrink the logs at your own leisure (usually after a full dB backup) and you won't have to worry about backing up the logs themselves.

Tuesday, January 17, 2017

That Old Cleanup

Once you're consolidating information into your database tracking worksheet, one thing you'll quickly discover is that your shop has muckloads of old data lying about. Some are completely obsolete databases, some of them are cranky old former incarnations of databases that are now active elsewhere, duplicate test or QA versions, or ghosts of playthings that former employees forgot to delete. If you examine the directories where databases are housed you will even find remnant mdf and ldf files that somebody detached (but never deleted). Inside of existing active reporting databases you will find large obsolete tables, former versions of tables, and old historical data in the actively referenced tables.

Cleanup and archiving is a royal mess, but it's the first step in organizing your storage and then reallocating filegroups appropriately. Here are some general principles and guidelines to help resolve this headache.

The first principle is to err on the side of caution. Before you permanently remove any database take it offline for a couple of weeks to verify that no unanticipated processes are affected. If you have low-cost compressible storage available (say on an old file server) then zip the zombie databases there before you delete them.

If you encounter a database with lots of old date-versioned tables, the best trick I've found is to download a tool along the lines of TOAD. With their database explorer you can easily highlight rows, ctrl-c for copy, and then paste the resulting table names into an Excel spreadsheet. Sort out the tables you've targeted for deletion and run it past the staff to make sure they are indeed obsolete. Then you can prepend "drop table" in front of each row in Excel and copy and paste the rows back into a SQL editing window. Presto. (Keep a current backup just in case).

Archiving historical data from live tables is the most work. Depending upon the version of SQL you are running (Standard or Enterprise) you may tackle this with archiving or partioning, but in any case the last few steps typically require weekend or off-hours work to down the database. In the Standard environment my general approach is first to archive, then to shrink.

You can run the first step, archiving, while everyone is blissfully plugging away at work. You may choose to keep the archive in a new table in a separate database, or you may export it to a text file for zipping to your low cost compressible storage. To avoid having too severe an impact on production I like performing the archival copy (or export) with SSIS, running on a view that selects perhaps a half million rows at a time.

Once you've got your archive you're ready for the more consumptive task (on the weekend with production stopped). Identify whereabouts on the clustered index you will be "cutting" the table. Select top(0) from the old table into a new working table. Make sure the new target table has the correct cluster index defined and any auto-increment Identity column with a seed. Now set identity insert on for the new working target and use "Select Into" with a tablock hint to load the data you are saving into the new working table. Rescript the original table's non-cluster indices for a later step, remove any parent-child table constraints, and then you can drop the original table. Is your heart beating in your ears yet? Now rename the new work table back to the original table name and rebuild the non-cluster indices. If you'd like you can recreate the parent-child referential integrity. There, that wasn't so bad now, was it?

Of course if you're on Enterprise edition you can use partitioned tables. So much easier.

Wednesday, September 14, 2016

Recovery Full or Simple

Once you've got your databases identified and organized into a spreadsheet and accomplished the preliminary cleanup, the next step to getting your act together is to determine the recovery model on each database. Well, maybe some basic instance tuning, but we can leave that for another blog post.

You only get two choices for the recovery model, full or simple, but I still can't believe how much difficulty folks have when choosing between the two. One thing you'll notice right off the bat is software developers are your worst resource for intelligently making this choice. You need to ping them for critical information to measure certain volatility features, but after that the choice really should be up to you. You're the one that will be managing disk space, backups, and restores.

There are only two key questions you need to ask yourself to make a final determination on what recovery model to choose for each database: 1) would it kill me if I was asked to restore this database to a precisely specific point in time and could only get it to the closest day? 2) Am I willing to manage log file backups and shrinks on this puppy?

If your answer to both questions is yes, then you need the full recovery model. Otherwise, use the simple model. But getting to the answer of these two questions is really the bulk of what I'd like to chat about.

Despite all the confusing things you will read about the recovery model, the gist of the difference between full and simple is the ability to perform "point in time" recovery. On the simple model you can restore whatever backup you want to your database, then any differential you ran, and you'll get the state of the database naturally from when you made that last differential backup. On the full model, you can additionally restore all or portions of the log file backup(s) to reflect the transactions that happened after your last differential.

Therein lies the rub: on simple recovery your log files don't grow much, but on full they can get quite large. On databases set to full recovery you will need to periodically shrink the log files, and of course you need to add them to your maintenance plan to back them up as well.

This is where that spreadsheet of databases is helpful. If a production database exhibits a highly combined level of Write Coupling and Software Volatility then you'll likely need to bite the bullet and set it on full recovery model. Very seldom will I set a development, QA, or staging database to this model though. A better way to manage high software volatility on a development database is to make sure your developers are using good quality source control. If a production database is written to at random times by all sorts of batch and OLTP systems or your developers are constantly rolling out production upgrades, then you'd best stick with full recovery.

Tuesday, August 23, 2016

Auto Trap

My employer is fairly liberal in allowing reasonably competent folks direct access to SSMS for querying into our databases. This approach encourages folks to learn the structure of the data for themselves, and to answer most of the ad hoc work they might otherwise need to investigate. The disadvantage though is occasionally an employee of lesser experience will execute a couple of queries to bog the system down. As a DBA I need awareness of the clumsier executions without constantly dropping everything to review each and every medium-grade RedGate monitor alert. The halfway elegant solution I came up with looks like this:

DECLARE @sess varchar(7)
DECLARE @emMsg varchar(270)
DECLARE @sqlRun varchar(200)

set @sess = NULL

select TOP(1) @sess = ql.session_id, @sqlRun = ql.query_text from
(
select r.session_id
   ,substring(qt.text,r.statement_start_offset/2,
   (case when r.statement_end_offset = -1
   then len(convert(nvarchar(max), qt.text)) * 2
   else r.statement_end_offset end - r.statement_start_offset)/2)
   as query_text, r.cpu_time, r.total_elapsed_time
   , r.reads, r.logical_reads
from sys.dm_exec_requests as r
cross apply sys.dm_exec_sql_text(sql_handle) as qt
inner join sys.dm_exec_sessions as es on r.session_id = es.session_id
where es.is_user_process = 1
) as ql
where ql.query_text not like '%where%'
and ql.query_text not like '%top%'
and ql.logical_reads > 200000

IF @sess is not null

BEGIN

set @emMsg = 'Session id ' + @sess + ' Running ' + @sqlRun

EXEC msdb.dbo.sp_send_dbmail
    @profile_name='SqlEntNotifications',
    @recipients = 'jdchapman@mycompany.com',
    @subject = 'Unbounded Query Alert',
    @body = @emMsg
END

I created a SQL Server Agent job that runs this every ten minutes. This sends me an eMail when something is spinning up a lot of read activity, but it also filters out queries to just those missing a Where clause or a Top restriction.

It's not perfect enough that I might automatically kill SPIDs; sometimes it sends along an eMail from a long-running production job -- actually a good thing, as it alerts me to places I might have to trace or optimize later. It does tend to help me keep the SQL instance generally out of trouble though.

Wednesday, July 6, 2016

Delta Blues

I don't know what your DBA life is like, but at my place at any given time only one gigantic database garners the majority of the development activity. I'm not directly involved in the development -- they don't fetch me into the design meetings nor copy me on design specs -- but I would still like to be aware of what changes are transpiring in the database (if anything to have some idea where to look when something suddenly goes awry). The quiet way to do this is to "delta" the database, comparing the current version to the database from say a week ago. To pull this off neatly (without having another copy of the gigantic database sitting around glomping up disk space) you need a couple of tricks and a good comparison tool.

First you should make a baseline "empty" copy of your production database -- a full copy without any data in the tables. From SSMS right click on your database, go to tasks, and then script database. When you get to the wizard page that specifies an output file location, after you type in a savefile name click on the Advanced button. Scroll down to where it has Create USE Statement and uncheck it. Also make sure the Script Triggers box is checked.

Now that you've got the whole thing scripted, create a new database with rather small file allocations (yeah you need to mimic the filegroup structures too) and while attached to your new database open up that sql query you just saved. Double check to assure you're attached to your new database and run your script. Boom, a new empty baseline database. Note though sometimes the sequence SSMS originally creates the script violates some dependencies between objects: you may have to run the same script over again so it actually creates everything.

A week of work goes by and now it's time to run a comparison. If you don't already have a commercial product then you will need to download a database comparison tool; several exist but I recommend Open dBdiff. It's especially nice since it comes with the source code to allow you to easily make small tweaks to change colors, button locations, and whatnot.

Once you run the comparison you will need to examine the change in each of the elements that differ. By clicking the checkbox in front of those that are significant, the tool builds the script to "true up" the versions. When you finish your review click on the Synchronized Script tab, and then the saveAs button to store the delta script to a file folder.

The next week attach to your previous-version database, run the script that was the delta from last week, and then open your comparison tool and delta your production database again. Now you have the changes for this week. If you do this every week, then after a while you have a whole history of changes. Working from the original fully scripted baseline database you can execute successive deltas to bring the empty database up to any particular point in time (well, within a week or so).

Monday, January 18, 2016

All The Columns

It doesn't happen to me frequently, but once in a while a tool returns a bunch of SQL that I need to investigate, yet when I look I can't recognize what database it's running against. When your tool doesn't give you a clue, you can use this handy SQL statement to return a list of all of the columns, in all of the databases on your entire instance:

DECLARE @SQL varchar(max)
SET @SQL=''

SELECT @SQL=@SQL+'UNION
select
'''+d.name+'.''+sh.name+''.''+o.name,c.name,c.column_id
from '+d.name+'.sys.columns c
inner join sys.objects o on c.object_id=o.object_id
INNER JOIN sys.schemas sh on o.schema_id=sh.schema_id
'
FROM sys.databases d
SELECT @SQL=RIGHT(@SQL,LEN(@SQL)-5)+'order by 1,3'

EXEC (@SQL)

Pretty handy, eh?