Saturday, April 23, 2016

Cascade Blockage

Sometimes the answer to why a SQL process gets blocked relates to overall network dataflow. In these cases you may not be able to tease out the issue until some event stresses your server past a critical threshold.  A real world example will help to elucidate this.

We fulfill our eMail touchpoints through a service hosted by Rackspace. They communicate clicks, bounces, and opens back to a webservice on one of our servers; that webservice connects to a couple of databases on internal SQL servers. This is a fairly typical modern client-vendor arrangement. The webservice constantly runs around 90 transactions per second through the SQL database, a perfectly reasonable load.

One day a developer opened up a query that locked the eMail transaction table for a couple of minutes. The webservice began throwing timeout messages to our internal notifications queue. Even after the developer closed their session the timeout notifications continued from the webservice for a couple more hours.

When I pulled up sp_who2 there were hundreds of connections to SQL from the webservice, where normally during the day I might glimpse one or two. After a couple of hours the problem abated. When I awoke the next morning my subconscious had pieced together the problem, along with this nice metaphor.

Suppose you're driving along on the 101 freeway with light traffic and way up ahead you see some tailights glisten. In anticipation you remove your foot from the gas and let the car slow down on its own. Nearby traffic does the same, and so there is a brief slowing that passes quickly as you resume your normal speed. Next scenario: as you're driving along the 101 in light traffic the fog starts to thicken, so you slow down. Suddenly up ahead tailights! You switch to the brake, slow down, and then once you're past the distraction resume your slightly slower speed. There is a brief and persistent restricted slowing at the original location.

Now take this scenario: on the 101 in very heavy traffic, the fog thickening. The car in front of you slows, you brake, the car behind you doesn't have much room so they brake hard, and the car behind that hits them. Accident. Now that traffic has stopped a chain reaction of accidents keeps trickling back. One brief constriction, and now the blockage will be there for many hours.

When I examined the queries that were executing from the webservice I noticed they did a Select top(1) * of the row in the event table matching the key for the transaction. Unfortunately one of the fields in that row was an XML copy of the actual eMail we sent, usually around 10 MB in size (you know, graphics and all). At the typical rate of 90 transactions per second this was just about the right size to squeeze through the 10 gigabit switch between the two servers.

As soon as somebody fired off another large query on the same table though, the switch temporarily got overwhelmed, and transactions that would clear in real time suddenly were starting to queue up. Once multiple transactions were running against the same table the extra load began slowing their execution. A constriction in heavy traffic.

The solution was to change the webservice to use a view that excluded the XML column. But frankly, if the problem had never happened, I would never have specifically looked for it, as it was just below the radar.

Saturday, April 9, 2016

Deleting Duplicate Rows

Frequent deletions to rows in a large table can be a fairly large IO load on SQL server. Not only do the rows get deleted, but indices get updated, foreign key constraints (might) get checked, and cluster indices get fragmented. This can especially be a tricky job in an ETL that regularly bulk loads data from an outside vendor.  I've seen this technique used before:
with v as
(select *, row_number() over (partition by DataField1,
DataField2, DataField3 Order by DataField1) AS dupRows
from myInputTable(nolock))
delete from v
where dupRows > 1
... but this is quite nasty to execute on a large table. A much better strategy is to add two additional columns to your bulk loaded table: a DulplicateFlag (bit) and a rowCheckSum (int). Also make sure your table has an Identity column cluster index. Now populate the rowCheckSum with the SQL checksum function over any fields that are logically significant (but not fields like the Identity column or modified-date columns). Create an index on the table to include the Identity, DulplicateFlag, rowCheckSum, and the natural key.

Now your "deletion" actually just sets the DulplicateFlag, thusly:
WITH a AS (SELECT  MIN(idColumn) AS MinClusterKey, rowCheckSum,
     NaturalKey, COUNT(*) AS howMany
     FROM  myInputTable AS v2
     WHERE  (DulplicateFlag IS NULL)
     GROUP BY rowCheckSum, NaturalKey
     HAVING (COUNT(*) > 1)) 
Update myInputTable
set v.DulplicateFlag = 1
FROM  myInputTable AS v INNER JOIN a
ON v.idColumn <> a.MinClusterKey AND v.NaturalKey = a.NaturalKey 
AND v.rowCheckSum = a.rowCheckSum
This is so much faster. Note that this example is saving the earliest instance of the duplicate; if you'd rather save the latest change the MIN to a MAX. You can then occasionally run a simple cleanup by deleting all rows with the DulplicateFlag set to 1 (if we're talking tons of rows it's faster to drop FK constraints and indices before the delete, then rebuild them after). Of course any queries that use this table downstream should select those with a NULL DulplicateFlag.