Monday, December 19, 2016

Security Checkup

Sure you maintain the security on your instances, setting up the logins and the database mappings. You probably also assign folks to a couple of database roles you maintain to restrict access to certain critical tables. Still though, employees come and go, and even though your company likely follows a termination procedure that loops you in for disabling logins, security tasks still fall through the cracks.

Help is on the way. Run this:
EXEC xp_logininfo
Do you see any ex-employees listed with admin rights?  Oops. Next check the members of the groups that are listed:
EXEC xp_logininfo @acctname =  '[group_name]', @option = 'members'
Maybe you still have a little more cleanup work to perform, eh?

Here's another way to see the active sysadmins:
SELECT l.loginname
FROM SYS.syslogins l
join sys.server_principals ss on l.[name] = ss.[name]
where ss.is_disabled = 0
and l.sysadmin = 1 
Finally, within a single database, this query is useful for reviewing folks' rights:
;WITH RT AS (SELECT   name AS RoleType, principal_id
          FROM      sys.database_principals
          WHERE     (type_desc = 'DATABASE_ROLE'))
          sys.database_role_members AS RM ON RT.principal_id = RM.role_principal_id INNER JOIN
          sys.database_principals AS P ON P.principal_id = RM.member_principal_id

Tuesday, November 15, 2016

Improving Bad Queries

The canned reports available from SSMS are great for starting to identify the worst of the Bad Queries. If you right click on an instance, then go to reports, standard reports, you can then see the top queries by total IO and by CPU (these are two separate reports). Be aware however these are based upon the underlying queries currently in the plan cache (they aren't the worst queries forever). One thing that's a bit frustrating is you can't copy and paste directly from these reports. A little hidden trick though is you can right-click on the report, and then choose Export to Excel. Now you've got some queries to optimize.

Another way to obtain this same information is to examine RedGate monitor; if you click to see the detail of an alert and then scroll down a bit, you can click on the Top-10 queries. This shows essentially the same thing as the SSMS report, except it's a snapshot from the time when the alert occurred. A great practice to follow is to view the RedGate graphs of disk reads over the last 24 hours, and then look for instances of high activity. Find an alert triggered around the same time and look at its detail for Top-10 queries: now you have more to fix.

Once you've identified a handful of troublemakers it's time to put each one under the microscope. One way of course is to examine the query plan, but first I like to paste the SQL for the query into a new-query window, and then highlight the whole thing, right click, and view in designer. Look at joins by moving the table boxes around. Is there a way to arrange them such that none of the lines representing the joins cross?

Once you've done that, look and see if there's a path around the joins taking you in a circle (SQL doesn't like to resolve such things). See if any of the joins show an f(x) symbol, another troublesome indicator. Now look at the tables out at the far reaches of the diagram, linked by a single join to the mass of tables. Do these extremities have any columns or filters selected? If not then you don't need the table in the query (but double check the SQL first before you delete it -- sometimes the table is in a Case clause). If no columns are selected and the filter is a foreign key, can you instead just specify its column value in the where clause on the table it joins?

Now examine the label grid toward the bottom of the designer. Are the filter criteria aligned properly in adjacent columns? Sometimes a developer will confuse the precedence of AND, OR, and parentheses (returning way more rows than they intended) and this will become apparent when looking at how the filters are set up. After all that it's time to now look at the query plan (fodder for a whole 'nother post).

Saturday, October 15, 2016

Advanced Query Inspection

Pretty much every DBA who gets a breather from their daily maintenance would like to spend some time tracking down those particularly gnarly SQL queries that are hosing their system. This post will give you a strategy for both how to identify them and then how to clean them up. First the Giant Disclaimer: it is not the DBA's job to write application level SQL. Even if you have the knowledge and skills to do so, once you cross the line from support to development you will find it difficult to allocate sufficient time to your actual DBA duties. Stick with being the operational DBA, and when you find queries that are poorly optimized (or just plain wrong) throw them over the wall for a developer to fix. As part of that process you may certainly suggest what needs to be changed and how to do so (in fact you rather have an obligation to provide this guidance) but don't check the stuff out of source control and make the changes yourself.

Once I'm done using the canned package bad-query tools mentioned in this previous post, I am ready to dig in deeper for the bad boys by running this DMV inquiry:

SELECT   TOP (20) execution_count AS [Number of Executions],      total_worker_time / execution_count AS [Average CPU Time],
     total_elapsed_time / execution_count AS [Average Elapsed Time],
         (SELECT   SUBSTRING(text, sys.dm_exec_query_stats.statement_start_offset / 2,
            (CASE WHEN statement_end_offset = - 1 THEN LEN(CONVERT(nvarchar(MAX),
            [text])) * 2 ELSE statement_end_offset END
              - sys.dm_exec_query_stats.statement_start_offset) / 2) AS Expr1
      FROM  sys.dm_exec_sql_text(sys.dm_exec_query_stats.sql_handle) AS dm_exec_sql_text_1) AS query_text,
      total_rows, total_logical_writes, total_logical_reads, total_physical_reads, qp.query_plan
FROM  sys.dm_exec_query_stats
CROSS APPLY sys.dm_exec_query_plan(sys.dm_exec_query_stats.plan_handle) qp
WHERE   (execution_count > 20) AND (last_execution_time > GETDATE() - 14)
ORDER BY [Average CPU Time] DESC

Note that this query has a couple of parameters you can play with to achieve different results. As presented above it shows the top 20 queries executed within at least the last two weeks with more than 20 executions since the last SQL server restart. Sometimes it will become evident (from some of the other tools mentioned in a earlier post) that you are really looking for a query that has run more than a hundred thousand times, so change the selection appropriately. Another one of my favorite tricks is after the development team rolls out a large upgrade, wait a couple days and then reduce the date selection to show the last 3 days, change the execution count to "< 100" and catch the newer instances of bad code. Anyway you get the idea. This is a powerful tool for keeping a close eye on what needs to be most seriously improved.

Wednesday, September 14, 2016

Recovery Full or Simple

Once you've got your databases identified and organized into a spreadsheet and accomplished the preliminary cleanup, the next step to getting your act together is to determine the recovery model on each database. Well, maybe some basic instance tuning, but we can leave that for another blog post.

You only get two choices for the recovery model, full or simple, but I still can't believe how much difficulty folks have when choosing between the two. One thing you'll notice right off the bat is software developers are your worst resource for intelligently making this choice. You need to ping them for critical information to measure certain volatility features, but after that the choice really should be up to you. You're the one that will be managing disk space, backups, and restores.

There are only two key questions you need to ask yourself to make a final determination on what recovery model to choose for each database: 1) would it kill me if I was asked to restore this database to a precisely specific point in time and could only get it to the closest day? 2) Am I willing to manage log file backups and shrinks on this puppy?

If your answer to both questions is yes, then you need the full recovery model. Otherwise, use the simple model. But getting to the answer of these two questions is really the bulk of what I'd like to chat about.

Despite all the confusing things you will read about the recovery model, the gist of the difference between full and simple is the ability to perform "point in time" recovery. On the simple model you can restore whatever backup you want to your database, then any differential you ran, and you'll get the state of the database naturally from when you made that last differential backup. On the full model, you can additionally restore all or portions of the log file backup(s) to reflect the transactions that happened after your last differential.

Therein lies the rub: on simple recovery your log files don't grow much, but on full they can get quite large. On databases set to full recovery you will need to periodically shrink the log files, and of course you need to add them to your maintenance plan to back them up as well.

This is where that spreadsheet of databases is helpful. If a production database exhibits a highly combined level of Write Coupling and Software Volatility then you'll likely need to bite the bullet and set it on full recovery model. Very seldom will I set a development, QA, or staging database to this model though. A better way to manage high software volatility on a development database is to make sure your developers are using good quality source control. If a production database is written to at random times by all sorts of batch and OLTP systems or your developers are constantly rolling out production upgrades, then you'd best stick with full recovery.

Tuesday, August 23, 2016

Auto Trap

My employer is fairly liberal in allowing reasonably competent folks direct access to SSMS for querying into our databases. This approach encourages folks to learn the structure of the data for themselves, and to answer most of the ad hoc work they might otherwise need to investigate. The disadvantage though is occasionally an employee of lesser experience will execute a couple of queries to bog the system down. As a DBA I need awareness of the clumsier executions without constantly dropping everything to review each and every medium-grade RedGate monitor alert.  The halfway elegant solution I came up with looks like this:
DECLARE @sess varchar(7)
DECLARE @emMsg varchar(270)
DECLARE @sqlRun varchar(200)

set @sess = NULL

select TOP(1) @sess = ql.session_id, @sqlRun = ql.query_text from
select r.session_id
   (case when r.statement_end_offset = -1
   then len(convert(nvarchar(max), qt.text)) * 2
   else r.statement_end_offset end - r.statement_start_offset)/2)
   as query_text, r.cpu_time, r.total_elapsed_time
   , r.reads, r.logical_reads
 from sys.dm_exec_requests as r
 cross apply sys.dm_exec_sql_text(sql_handle) as qt
 inner join sys.dm_exec_sessions as es on r.session_id = es.session_id
 where es.is_user_process = 1
 ) as ql
 where ql.query_text not like '%where%'
 and ql.query_text not like '%top%'
 and ql.logical_reads > 200000

 IF @sess is not null


 set @emMsg = 'Session id ' + @sess + ' Running ' + @sqlRun

 EXEC msdb.dbo.sp_send_dbmail
    @recipients = '',
    @subject = 'Unbounded Query Alert',
    @body = @emMsg
I created a SQL Server Agent job that runs this every ten minutes. This sends me an eMail when something is spinning up a lot of read activity, but it also filters out queries to just those missing a Where clause or a Top restriction.

It's not perfect enough that I might automatically kill SPIDs; sometimes it sends along an eMail from a long-running production job -- actually a good thing, as it alerts me to places I might have to trace or optimize later. It does tend to help me keep the SQL instance generally out of trouble though.

Thursday, August 4, 2016

Table Cleanup

For large databases actively under development, it's not unusual to accrue a fairly large collection of useless and abandoned tables. Here's a query you can run to find them... in this example my cutoff is a year.  The polite thing to do is notify the staff of your intention to archive the large tables (if they so indicate) before you delete them. Script out the definitions of all of those you plan to delete anyhow, just in case somebody suddenly misses them.
SET @DAYSunTouched = 365

;WITH UnUsedTables (TableName , TotalRowCount, CreatedDate , LastModifiedDate )
AS (
  SELECT AS TableName
     ,PS.row_count AS TotalRowCount
     ,DBTable.create_date AS CreatedDate
     ,DBTable.modify_date AS LastModifiedDate
  FROM sys.all_objects  DBTable
     JOIN sys.dm_db_partition_stats PS ON OBJECT_NAME(PS.object_id)
  WHERE DBTable.type ='U'
     AND DBTable.modify_date <= (SELECT sqlserver_start_time FROM sys.dm_os_sys_info)
     and DBTable.modify_date < (getdate() - @DAYSunTouched)
              FROM sys.dm_db_index_usage_stats
              WHERE OBJECT_ID = DBTable.object_id )
SELECT DISTINCT TableName , TotalRowCount, CreatedDate , LastModifiedDate
FROM UnUsedTables
ORDER BY TotalRowCount desc

Friday, July 29, 2016

Protecting Key Tables

In most production systems you've got a half to a couple dozen tables that contain configuration or system-wide critical information. Things like State abbreviation tables, printer routing tables, user rights tables, hierarchies, workflow parameters, flag mapping key-value pair descriptions, et cetera. Generally you want to lock these tables down pretty tightly so that you don't allow somebody to accidentally update or delete a row, without some kind of more general awareness of what is happening. The way to do this is with a Database Role. At first this concept may seem confusing so I'll try to tease out the particulars in a way that slowly reveals its sense.

You already are familiar with the Instance level Server Roles, things like sysadmin and public. When you click on the Security folder off of the instance level, and then the folder under that called Server Roles, you can see all of the Instance level roles.

You also have seen the Database level Roles, for example after you add a new user login and then set up their access to the databases. You are familiar with db_datareader, db_datawriter, db_ddladmin, db_owner, et cetera. If you want to see these roles from the top down (rather than from the Logins mapping level) then click on a database, expand the Security folder under that database, and then click on the Roles folder, and then double-click the Database Roles folder.

This is where you add a Role to protect the tables within a database. Right click and select New database Role.  Give it a name (I like to use the name KeyTablesReadOnly). In a moment we'll revisit the remainder of this setup, but first a side digression. The point I'm making here though is that data protection roles belong in the database level of Roles, not in the Instance level.

How do you identify what tables to protect? A good clue on a fairly mature production database is to sort the tables in the Object Explorer Details pane by the column titled Data Space Used. Those that are small in size (and non-zero) likely contain the key configuration and mapping information. If you're using something like "Toad" you can select and paste these table names into a spreadsheet (which comes in handy shortly).

You probably will need to confirm these table names with the Development Manger so send that spreadsheet off for approval before proceeding. When you get confirmation, go back and double-click on that KeyTablesReadOnly database Role. The properties pane has two areas: "General" is for assigning the role to specific users, and "Securables" is for setting the protection level on objects within the database. On the tables that you want to protect, check the Deny column next to Delete, Insert, and Update, and then go ahead and add all the users (or you may leave a couple of users with full rights if they specifically are responsible for maintaining these tables).

If somebody needs to do an update, then the fast way to adjust them is to go to the Instance security folder, click on Logins, then their User mapping, highlight the database affected, and then remove them from the KeyTablesReadOnly role.

Wednesday, July 6, 2016

Delta Blues

I don't know what your DBA life is like, but at my place at any given time only one gigantic database garners the majority of the development activity. I'm not directly involved in the development -- they don't fetch me into the design meetings nor copy me on design specs -- but I would still like to be aware of what changes are transpiring in the database (if anything to have some idea where to look when something suddenly goes awry). The quiet way to do this is to "delta" the database, comparing the current version to the database from say a week ago. To pull this off neatly (without having another copy of the gigantic database sitting around glomping up disk space) you need a couple of tricks and a good comparison tool.

First you should make a baseline "empty" copy of your production database -- a full copy without any data in the tables. From SSMS right click on your database, go to tasks, and then script database. When you get to the wizard page that specifies an output file location, after you type in a savefile name click on the Advanced button. Scroll down to where it has Create USE Statement and uncheck it. Also make sure the Script Triggers box is checked.

Now that you've got the whole thing scripted, create a new database with rather small file allocations (yeah you need to mimic the filegroup structures too) and while attached to your new database open up that sql query you just saved. Double check to assure you're attached to your new database and run your script. Boom, a new empty baseline database. Note though sometimes the sequence SSMS originally creates the script violates some dependencies between objects: you may have to run the same script over again so it actually creates everything.

A week of work goes by and now it's time to run a comparison. If you don't already have a commercial product then you will need to download a database comparison tool; several exist but I recommend Open dBdiff. It's especially nice since it comes with the source code to allow you to easily make small tweaks to change colors, button locations, and whatnot.

Once you run the comparison you will need to examine the change in each of the elements that differ. By clicking the checkbox in front of those that are significant, the tool builds the script to "true up" the versions. When you finish your review click on the Synchronized Script tab, and then the saveAs button to store the delta script to a file folder.

The next week attach to your previous-version database, run the script that was the delta from last week, and then open your comparison tool and delta your production database again. Now you have the changes for this week. If you do this every week, then after a while you have a whole history of changes. Working from the original fully scripted baseline database you can execute successive deltas to bring the empty database up to any particular point in time (well, within a week or so).

Tuesday, June 21, 2016

Memory Used

In an earlier post I mentioned some general guidelines for optimizing each instance of the SQL engine. Once you've got the memory set up it is quite interesting however to observe how the engine is actually using it. Run these two queries; they both show pretty much the same thing but you may find one more useful than the other.
(CASE WHEN ([is_modified] = 1) THEN 'Dirty' ELSE 'Clean' END) AS 'Page State',
(CASE WHEN ([database_id] = 32767) THEN 'Resource Database' ELSE DB_NAME (database_id) END) AS 'Database Name',
COUNT (*) AS 'Page Count'
FROM sys.dm_os_buffer_descriptors
GROUP BY [database_id], [is_modified]
ORDER BY [Page Count], [is_modified];

SELECT count(*)AS cached_pages_count, (count(*) * 8)/1024 As Mbytes,  db_name(database_id)
FROM sys.dm_os_buffer_descriptors
GROUP BY db_name(database_id) ,database_id
ORDER BY cached_pages_count DESC;
The amount of memory used naturally varies depending upon which queries are running at any particular time, but by monitoring this when you are under "memory pressure" you can easily identify the culprit database. This can then guide you whether it makes sense to reallocate that database to a different SQL instance, or perhaps allocate a deeper inspection to what queries are running against it.

To that end, you can run this script:
SELECT        session_id, text, requested_memory_kb,
    granted_memory_kb, used_memory_kb
FROM            sys.dm_exec_query_memory_grants MG
CROSS APPLY sys.dm_exec_sql_text(sql_handle) 

that will show you the /current/ allocation for any running queries. If you need greater detail down at the table index level, you can run this script:
       count(*)as cached_pages_count, as objectname, as indexname, obj.index_id as indexid
from sys.dm_os_buffer_descriptors as bd
    inner join
SELECT     p.object_id AS objectid, object_name(p.object_id) AS name, p.index_id, au.allocation_unit_id
FROM       sys.allocation_units au INNER JOIN
sys.partitions p ON au.container_id = p.hobt_id AND (au.type = 1 OR au.type = 3)      
        union all
SELECT     p.object_id AS objectid, object_name(p.object_id) AS name, p.index_id, au.allocation_unit_id
FROM       sys.allocation_units au INNER JOIN
sys.partitions p ON au.container_id = p.partition_id AND au.type = 2
    ) as obj
on bd.allocation_unit_id = obj.allocation_unit_id
left outer join sys.indexes ind on  obj.objectid = ind.object_id and  obj.index_id = ind.index_id
where bd.database_id = db_id() and bd.page_type in ('data_page', 'index_page')
group by,, obj.index_id
order by cached_pages_count desc

Monday, June 6, 2016

Backup Restore

More than anything, a DBA provides a company a solid underpinning confidence that you are "protecting" the databases. Between periodic maintenance chores and the scheduled backups, you are supposed to make sure that come hell or high water the company's data will always be safe, reliable, and recoverable. In fact you can pretty safely assume that should you under any circumstances "lose" a database, you will also lose your job.

Setting up and then keeping track of the backups is a large part of this chore. Once you've determined whether to set each dB on Full or Simple recovery, it's time to create a maintenance plan on each instance. For some peculiar reason known only to Microsoft, I've only been able to consistently do this RDP'd into the server with SSMS fired up connected to (local). I like to keep my backup jobs separate from all the other maintenance, so when I run the maintenance Wizard I select just the three options of Log Backups, Database Backups, and Backup Cleanup. That's not the order in which the checkboxes appear, but when you arrange them, sequence them in that order.

Needless to say, you don't place your backup files on the same device where you host your log or mdf files. One lesser known useful fact however is, although you can't select it from clicking on treeviews, you can manually type in a path to a UNC share (\\myBackupSrv\sqlBackkups, or even \\\myStuff) ahead of your backup file name. You can even address remote shares on the cloud this way.

Once your plans are in place, you need to verify they are working properly. Things get more complicated once you backup to a cloud vendor: not only do you have your cloud backup routines but for safety you should keep a local backup copy occasionally as well. Also if you are hosting gigantic databases supporting OLTP then chances are it makes more sense to run daily differential backups instead of full backups (with maybe a weekly or monthly full depending on how much the dB grows and changes). After a while therefore you can end up with a fairly complicated blended backup strategy.

If you're like me then, you'll want to open up your SSMS every morning and examine the Explorer Details window. Add the column "Last Backup" and sort it, and right away you know what needs some attention.  Also you will find this query helpful:

USE msdb
SELECT        b.backup_set_id,, b.user_name, b.database_creation_date, b.backup_start_date, b.backup_finish_date, b.type, b.backup_size, b.database_name,
                         b.server_name, b.machine_name, b.recovery_model, b.is_snapshot, b.is_readonly,
                         b.is_single_user, b.is_damaged, f.physical_drive, f.physical_name, f.state,
                         f.state_desc, f.is_present, b.is_copy_only, m.physical_device_name
FROM            backupset AS b INNER JOIN
                         backupfile AS f ON f.backup_set_id = b.backup_set_id INNER JOIN
                         backupmediafamily AS m ON b.media_set_id = m.media_set_id
WHERE        (f.physical_name LIKE '%.mdf%') AND (b.backup_start_date > GETDATE() - 14)

Often I'll add into the where-clause a specific b.database_name if I'm looking specifically for the backups for just one database. Note the results of this query show not only the dB backups but also the log, differential, and VM snapshots.  It's useful to recognize that when you do a Restore through the user-interface, these are the tables the system references to reach the correct restore file suggestions.

In some environments, as when you have another team moving reporting databases around, it's helpful to see the inverse condition (when were the restores performed?):

USE msdb
SELECT  DBRestored = destination_database_name ,
        RestoreDate = restore_date ,
        SourceDB = b.database_name ,
        SourceFile = physical_name ,
        BackupDate = backup_start_date
FROM    RestoreHistory h
        INNER JOIN BackupSet b ON h.backup_set_id = b.backup_set_id
        INNER JOIN BackupFile f ON f.backup_set_id = b.backup_set_id
where physical_name like '%.mdf%'     
ORDER BY RestoreDate desc

What's great about this query is it shows the name of what backupset was used to restore each database and the date that backup got created.

Naturally (or maybe I should phrase that as "unfortunately") if you yourself are asked to do a restore, a good percentage of time it will be under duress with the boss breathing down your neck. Hence I always keep this template by my side to pull up in times of emergency:

FROM  DISK = N'\\bkupStore\sql backups\DEV-ThisBizDB\ThisBizBI\ThisBizBI_backup_2014_08_21_040010_8176618.bak'


\\*------------------------ *\\


FROM  DISK = N'\\bkupStore\sql backups\OLTP2SRV\ThisBizBIDW\ThisBizBIDW_backup_2014_09_08_040001_0041150.bak'
STATS = 10


\\*------------------------ *\\

FROM  DISK = N'\\bkupStore\sql backups\OLTP2SRV\ThisBizOSDev\ThisBizOSDev_backup_2015_05_10_010007_0679267.trn'
STATS = 10

This template shows three different variations of a restore; I seldom use any one of them straight vanilla but may modify or copy a line or two from one to the other to achieve the results that I desire. The first is a restore of one database over another, the second example is a restore of a live production dB (kicking off the current users in the process) and leaving the database in "standby" mode, and the third example is a logfile restore. If you haven't yet, do read up on restoring a database with NoRecovery or Standby -- these are the options to use if you need to restore more than a single backup file to achieve currency.

Monday, May 23, 2016

DMVs for Current Activity

Aside from the three external tools I mentioned in this previous post, I also keep a pocketful of queries and stored procs that I run to periodically monitor performance. I'm afraid I can't take credit for any of these: some I've picked up from reading various SQL books or online forums, others were written by quite skilled folks and you can download them from their website. Often times I modify them slightly to give me a better understanding of what I'm actually looking for or to hone them to show a more concentrated set of issues.

I have a separate group of DMV queries I use more for database tuning and analysis, but those in this post are the ones I'm constantly running to address alerted and critical real-time issues.

For a really fast (and low overhead) look at what is happening right now on an instance I use this variant of who2:

CREATE TABLE #sp_who2 (SPID INT,Status VARCHAR(255),
      Login  VARCHAR(255),HostName  VARCHAR(255),
      BlkBy  VARCHAR(255),DBName  VARCHAR(255),
      Command VARCHAR(255),CPUTime INT,
      DiskIO INT,LastBatch VARCHAR(255),
      ProgramName VARCHAR(255),SPID2 INT,
INSERT INTO #sp_who2 EXEC sp_who2

SELECT     distinct SPID, Status, Login, HostName, BlkBy, DBName, Command, LastBatch, ProgramName
FROM        #sp_who2
where Command not in ('AWAITING COMMAND', 'TASK MANAGER', 'BRKR TASK')
and DBName is not null
and SPID <> @@SPID

drop table #sp_who2
The best thing about the results of this query is a quick glance at the BlkBy column tells me right off the bat who is blocking whom. If I get more concerned and don't mind the overhead of incurring a little extra CPU I run these two:

exec sp_whoisactive

select r.session_id
   (case when r.statement_end_offset = -1
   then len(convert(nvarchar(max), qt.text)) * 2
   else r.statement_end_offset end - r.statement_start_offset)/2)
   as query_text, qt.dbid, qt.objectid, r.cpu_time, r.total_elapsed_time
   , r.reads,r.writes, r.logical_reads, r.scheduler_id
 from sys.dm_exec_requests as r
 cross apply sys.dm_exec_sql_text(sql_handle) as qt
 inner join sys.dm_exec_sessions as es on r.session_id = es.session_id
 where es.is_user_process = 1
 order by r.cpu_time desc
The first of this set, WhoIsActive, is an invaluable freeware stored proc by Adam Machanic: a quick click on the XML query column pops up a new window showing you the precise SQL that is executing. By all means if you find this utility useful I'd recommended making a donation to this guy on his website.

If my various external monitors seem to suggest IO bottlenecks then I like to run this query:

SELECT        DB_NAME(dbid) AS ConnectedToDB, hostname, program_name,
    loginame, cpu, physical_io, memusage, login_time, last_batch, status
FROM            sysprocesses
WHERE        (status <> 'background') AND (program_name NOT LIKE 'SQLAgent%') AND (cpu > 0)
ORDER BY physical_io DESC
This shows the currently connected IO hogs. You can then see if they have any long running queries with the WhoIsActive.  If instead they are running millions of little queries with an ETL tool or some other application, then you will need to run sp_sqltrace, another work of art by a serious SQL developer that you can download for free.

Finally if the CPU is getting pegged chances are either it's tons of recompiles, or a missing index.  This will show  the recompiles:

SELECT        sys.dm_exec_cached_plans.refcounts, sys.dm_exec_cached_plans.usecounts, sys.dm_exec_cached_plans.cacheobjtype, sys.dm_exec_cached_plans.objtype,
                          left(sys.dm_exec_sql_text.text, 50) as SQLpartial, sys.dm_exec_sql_text.text as SQLstmt
into #tempCompiles                         
FROM            sys.dm_exec_cached_plans CROSS APPLY sys.dm_exec_sql_text(sys.dm_exec_cached_plans.plan_handle)
WHERE sys.dm_exec_sql_text.text LIKE '%SELECT %'
AND sys.dm_exec_sql_text.text NOT LIKE '%backup%'
and usecounts = 1

SELECT        SUM(refcounts) AS howMany, SQLpartial
into #tempCompileTotals
FROM            #tempCompiles

SELECT        SUM(CT.howMany) AS howManyCompiles, MIN(CD.SQLstmt) AS sqlExample
FROM            #tempCompiles AS CD INNER JOIN
                         #tempCompileTotals AS CT ON CD.SQLpartial = CT.SQLpartial
ORDER BY howManyCompiles DESC

DROP table #tempCompiles
DROP table #tempCompileTotals

If you have bad recompiles then the best bet is to chat with the responsible developer and have them create less dynamic queries, perhaps by storing the feeding parameters in static tables. As a last resort you can set the Forced Parameterization option on the instance, but this has some other compensating drawbacks.

Here's the query I like to use for missing indices:

SELECT        mid.statement, round(migs.user_seeks * migs.avg_total_user_cost * migs.avg_user_impact / 100, 0) as imPactor,
        migs.unique_compiles, migs.user_seeks, migs.user_scans, migs.last_user_seek, migs.last_user_scan,
        migs.avg_total_user_cost, migs.avg_user_impact, mig.index_group_handle, mig.index_handle,
                         mid.object_id, mid.equality_columns, mid.inequality_columns,
FROM            sys.dm_db_missing_index_group_stats AS migs INNER JOIN
                         sys.dm_db_missing_index_groups AS mig ON migs.group_handle = mig.index_group_handle INNER JOIN
                         sys.dm_db_missing_index_details AS mid ON mig.index_handle = mid.index_handle
WHERE        (mid.database_id = DB_ID())
and migs.last_user_seek > GETDATE() - 14

You don't have to build all the missing indices, but after some use you'll be able to eye those that stand out above a critical threshold. Also note that if this suggests tons of "included" columns then the issue on your table is larger than just a missing index -- likely you need to split the table widthwise or check if most of the columns should be marked "sparse."

Friday, May 6, 2016

IO Stalls and Latency

One of the more useful optimizations you can run on a database instance if you are unfortunate enough to be on old fashioned spinning media (as opposed to SSDs) is to split up files to reduce the amount of stalls. These can be either database or log files, for either user or system databases, reads or writes. In an earlier post I gave a couple of queries you could run to get a quick gross overview, but now let's get further into the weeds and fine tune some of those slower files in detail. Run these queries:
select db_name(mf.database_id) as database_name, mf.physical_name,
(vfs.num_of_bytes_read / vfs.io_stall_read_ms) as BytesPerStall,
mf.type_desc, vfs.num_of_reads, vfs.num_of_bytes_read, vfs.io_stall_read_ms,
vfs.io_stall, vfs.size_on_disk_bytes
from sys.master_files mf
join sys.dm_io_virtual_file_stats(NULL, NULL) vfs
on mf.database_id=vfs.database_id and mf.file_id=vfs.file_id
where vfs.io_stall_read_ms > 0
order by 3

select db_name(mf.database_id) as database_name, mf.physical_name,
(vfs.num_of_bytes_written / vfs.io_stall_write_ms) as BytesPerStall,
mf.type_desc, vfs.num_of_writes, vfs.num_of_bytes_written, vfs.io_stall_write_ms,
vfs.io_stall, vfs.size_on_disk_bytes
from sys.master_files mf
join sys.dm_io_virtual_file_stats(NULL, NULL) vfs
on mf.database_id=vfs.database_id and mf.file_id=vfs.file_id
where vfs.io_stall_write_ms > 0
order by 3
This returns two resultsets, the first is problem stalls on reads, the second is problem stalls on writes. Do pay attention to the column showing number of bytes read and written; likely it's a waste of time to optimize the smaller files. Now that you know generally who your problem children are, you have a couple of matters of recourse. If the stalls are happening in a log file, then it's time to increase the number of log files or move them to a faster device. If the latency is in a table (a file of type "rows") then you need to dig a little deeper; run this query attached to the database that concerns you.
SELECT TableName = object_name(s.object_id),
       Reads = SUM(user_seeks + user_scans + user_lookups), Writes =  SUM(user_updates)
FROM sys.dm_db_index_usage_stats AS s
INNER JOIN sys.indexes AS i ON s.object_id = i.object_id
AND i.index_id = s.index_id
WHERE objectproperty(s.object_id,'IsUserTable') = 1
AND s.database_id = DB_ID()
GROUP BY object_name(s.object_id)
Of course change the Order By clause for whether you are researching the stalls on reads or on the writes. Likely the top couple of tables in this returned set should be spun off to their own filegroup with multiple files (or possibly their clustered index needs reorganizing -- check their fragmentation).

Saturday, April 23, 2016

Cascade Blockage

Sometimes the answer to why a SQL process gets blocked relates to overall network dataflow. In these cases you may not be able to tease out the issue until some event stresses your server past a critical threshold.  A real world example will help to elucidate this.

We fulfill our eMail touchpoints through a service hosted by Rackspace. They communicate clicks, bounces, and opens back to a webservice on one of our servers; that webservice connects to a couple of databases on internal SQL servers. This is a fairly typical modern client-vendor arrangement. The webservice constantly runs around 90 transactions per second through the SQL database, a perfectly reasonable load.

One day a developer opened up a query that locked the eMail transaction table for a couple of minutes. The webservice began throwing timeout messages to our internal notifications queue. Even after the developer closed their session the timeout notifications continued from the webservice for a couple more hours.

When I pulled up sp_who2 there were hundreds of connections to SQL from the webservice, where normally during the day I might glimpse one or two. After a couple of hours the problem abated. When I awoke the next morning my subconscious had pieced together the problem, along with this nice metaphor.

Suppose you're driving along on the 101 freeway with light traffic and way up ahead you see some tailights glisten. In anticipation you remove your foot from the gas and let the car slow down on its own. Nearby traffic does the same, and so there is a brief slowing that passes quickly as you resume your normal speed. Next scenario: as you're driving along the 101 in light traffic the fog starts to thicken, so you slow down. Suddenly up ahead tailights! You switch to the brake, slow down, and then once you're past the distraction resume your slightly slower speed. There is a brief and persistent restricted slowing at the original location.

Now take this scenario: on the 101 in very heavy traffic, the fog thickening. The car in front of you slows, you brake, the car behind you doesn't have much room so they brake hard, and the car behind that hits them. Accident. Now that traffic has stopped a chain reaction of accidents keeps trickling back. One brief constriction, and now the blockage will be there for many hours.

When I examined the queries that were executing from the webservice I noticed they did a Select top(1) * of the row in the event table matching the key for the transaction. Unfortunately one of the fields in that row was an XML copy of the actual eMail we sent, usually around 10 MB in size (you know, graphics and all). At the typical rate of 90 transactions per second this was just about the right size to squeeze through the 10 gigabit switch between the two servers.

As soon as somebody fired off another large query on the same table though, the switch temporarily got overwhelmed, and transactions that would clear in real time suddenly were starting to queue up. Once multiple transactions were running against the same table the extra load began slowing their execution. A constriction in heavy traffic.

The solution was to change the webservice to use a view that excluded the XML column. But frankly, if the problem had never happened, I would never have specifically looked for it, as it was just below the radar.

Saturday, April 9, 2016

Deleting Duplicate Rows

Frequent deletions to rows in a large table can be a fairly large IO load on SQL server. Not only do the rows get deleted, but indices get updated, foreign key constraints (might) get checked, and cluster indices get fragmented. This can especially be a tricky job in an ETL that regularly bulk loads data from an outside vendor.  I've seen this technique used before:
with v as
(select *, row_number() over (partition by DataField1,
DataField2, DataField3 Order by DataField1) AS dupRows
from myInputTable(nolock))
delete from v
where dupRows > 1
... but this is quite nasty to execute on a large table. A much better strategy is to add two additional columns to your bulk loaded table: a DulplicateFlag (bit) and a rowCheckSum (int). Also make sure your table has an Identity column cluster index. Now populate the rowCheckSum with the SQL checksum function over any fields that are logically significant (but not fields like the Identity column or modified-date columns). Create an index on the table to include the Identity, DulplicateFlag, rowCheckSum, and the natural key.

Now your "deletion" actually just sets the DulplicateFlag, thusly:
WITH a AS (SELECT  MIN(idColumn) AS MinClusterKey, rowCheckSum,
     NaturalKey, COUNT(*) AS howMany
     FROM  myInputTable AS v2
     WHERE  (DulplicateFlag IS NULL)
     GROUP BY rowCheckSum, NaturalKey
     HAVING (COUNT(*) > 1)) 
Update myInputTable
set v.DulplicateFlag = 1
FROM  myInputTable AS v INNER JOIN a
ON v.idColumn <> a.MinClusterKey AND v.NaturalKey = a.NaturalKey 
AND v.rowCheckSum = a.rowCheckSum
This is so much faster. Note that this example is saving the earliest instance of the duplicate; if you'd rather save the latest change the MIN to a MAX. You can then occasionally run a simple cleanup by deleting all rows with the DulplicateFlag set to 1 (if we're talking tons of rows it's faster to drop FK constraints and indices before the delete, then rebuild them after). Of course any queries that use this table downstream should select those with a NULL DulplicateFlag.

Friday, March 11, 2016


One of the more important tasks to take care of when you settle into maintaining a new instance is the proper allocation of files to tempdb. How large to make them and the count of how many to create, really depends upon how your developers and their applications are using tempdb. I've seen various other posts that presume to have some sort of algorithm to help you determine this, but I've become somewhat wary of just relying on an outside wild arse guess as an authoritative one-shot setup.

You really face a handful of issues with tempdb. The first is it gets shared by all the processes touching all of the databases on any particular instance. This means even though you can access different files (even across different physical devices) to your databases to avoid contention, they will always unavoidably contend for tempdb. The second issue that pops up is tempdb only shrinks on a SQL service restart (or naturally on a server reboot). The third issue is various software packages (or developers) have their own way to use tempdb: OLTP is normally not too hectic on the thing, but some ETL processes and OLAP aggregation processes can by nasty. Then you always get the developer who joins ten tables with three of them doing full cluster index scans, and tempdb balloons to hold all the temporary data.

For say a terabyte of total databases on an instance, I like to initially set up three or four tempdb data files of ten to forty gigabytes size each. After the instance has been running for a few weeks see if you have any IO stalls on tempdb:
 DB_NAME(VFS.database_id) AS DatabaseName
,CASE WHEN smf.type = 1 THEN 'LOG_FILE' ELSE 'DATA_FILE' END AS DatabaseFile_Type
,SUM(VFS.num_of_bytes_written) AS IO_Write
,SUM(VFS.num_of_bytes_read) AS IO_Read
,SUM(VFS.num_of_bytes_read + VFS.num_of_bytes_written) AS Total_IO
,SUM(VFS.io_stall) AS IO_STALL
FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS VFS
JOIN sys.master_files AS smf
 ON VFS.database_id = smf.database_id AND VFS.file_id = smf.file_id
GROUP BY DB_NAME(VFS.database_id), smf.type
,DatabaseName, DatabaseFile_Type
,CAST(1.0 * IO_Read/ (1024 * 1024) AS DECIMAL(12, 2)) AS IO_Read_MB
,CAST(1.0 * IO_Write/ (1024 * 1024) AS DECIMAL(12, 2)) AS IO_Write_MB
,CAST(1. * Total_IO / (1024 * 1024) AS DECIMAL(12, 2)) AS IO_TOTAL_MB
,CAST(IO_STALL / 1000. AS DECIMAL(12, 2)) AS IO_STALL_Seconds
If you have stalls higher on tempdb than your other databases, you need to increase the quantity of files preallocated to it. Also check how large it has grown (easiest to see by just opening windows explorer and looking at the file sizes). If the files are a whole lot larger than what you initially allocated then you should adjust your initial allocations up to something like the midpoint of the difference.

Every few months you should go back and review your tempdb sizes and stalls -- developers will have installed new software or even placed further strain on them with additional user databases. Also use a monitoring tool like RedGate to determine which queries are writing heavily to tempdb: you may be surprised to find they may have been constructed incorrectly.

Tuesday, February 16, 2016

Who's Got What

Every once in a while it helps to stand back and review the security privileges for everyone, down at the database level.  I've found this query to be exceedingly helpful in ferreting out the roles that have been assigned:

(SELECT name AS RoleType, principal_id
FROM sys.database_principals
WHERE (type_desc = 'DATABASE_ROLE'))
INNER JOIN sys.database_role_members AS RM
ON RT.principal_id = RM.role_principal_id INNER JOIN
sys.database_principals AS P
ON P.principal_id = RM.member_principal_id

You need to run this either with a USE statement, or connected to a specific database (you should run it for each database you are reviewing). Sometimes it's necessary to perform a deep dive on a specific individual, in which case you'll want to run this little puppy:

execute as user = 'OURDOMAIN\chkusername'
select * from fn_my_permissions(null, 'DATABASE') -- Leave as the literal
order by subentity_name, permission_name

Monday, January 18, 2016

All The Columns

It doesn't happen to me frequently, but once in a while a tool returns a bunch of SQL that I need to investigate, yet when I look I can't recognize what database it's running against. When your tool doesn't give you a clue, you can use this handy SQL statement to return a list of all of the columns, in all of the databases on your entire instance:

DECLARE @SQL varchar(max)

from ''.sys.columns  c
    inner join sys.objects  o on c.object_id=o.object_id
    INNER JOIN sys.schemas  sh on o.schema_id=sh.schema_id
FROM sys.databases d
SELECT @SQL=RIGHT(@SQL,LEN(@SQL)-5)+'order by 1,3'

Pretty handy, eh?