All Your SQL Server: Deleting Duplicate Rows

Frequent deletions to rows in a large table can be a fairly large IO load on SQL server. Not only do the rows get deleted, but indices get updated, foreign key constraints (might) get checked, and cluster indices get fragmented. This can especially be a tricky job in an ETL that regularly bulk loads data from an outside vendor. I've seen this technique used before:

with v as
(select *, row_number() over (partition by DataField1,
DataField2, DataField3 Order by DataField1) AS dupRows
from myInputTable(nolock))
delete from v
where dupRows > 1

... but this is quite nasty to execute on a large table. A much better strategy is to add two additional columns to your bulk loaded table: a DulplicateFlag (bit) and a rowCheckSum (int). Also make sure your table has an Identity column cluster index. Now populate the rowCheckSum with the SQL checksum function over any fields that are logically significant (but not fields like the Identity column or modified-date columns). Create an index on the table to include the Identity, DulplicateFlag, rowCheckSum, and the natural key.

Now your "deletion" actually just sets the DulplicateFlag, thusly:

WITH a AS (SELECT MIN(idColumn) AS MinClusterKey, rowCheckSum,
NaturalKey, COUNT(*) AS howMany
FROM myInputTable AS v2
WHERE (DulplicateFlag IS NULL)
GROUP BY rowCheckSum, NaturalKey
HAVING (COUNT(*) > 1))
Update myInputTable
set v.DulplicateFlag = 1
FROM myInputTable AS v INNER JOIN a
ON v.idColumn <> a.MinClusterKey AND v.NaturalKey = a.NaturalKey
AND v.rowCheckSum = a.rowCheckSum

This is so much faster. Note that this example is saving the earliest instance of the duplicate; if you'd rather save the latest change the MIN to a MAX. You can then occasionally run a simple cleanup by deleting all rows with the DulplicateFlag set to 1 (if we're talking tons of rows it's faster to drop FK constraints and indices before the delete, then rebuild them after). Of course any queries that use this table downstream should select those with a NULL DulplicateFlag.

Saturday, April 9, 2016

Deleting Duplicate Rows

No comments:

Post a Comment