with v as... but this is quite nasty to execute on a large table. A much better strategy is to add two additional columns to your bulk loaded table: a DulplicateFlag (bit) and a rowCheckSum (int). Also make sure your table has an Identity column cluster index. Now populate the rowCheckSum with the SQL checksum function over any fields that are logically significant (but not fields like the Identity column or modified-date columns). Create an index on the table to include the Identity, DulplicateFlag, rowCheckSum, and the natural key.
(select *, row_number() over (partition by DataField1,
DataField2, DataField3 Order by DataField1) AS dupRows
from myInputTable(nolock))
delete from v
where dupRows > 1
Now your "deletion" actually just sets the DulplicateFlag, thusly:
WITH a AS (SELECT MIN(idColumn) AS MinClusterKey, rowCheckSum,This is so much faster. Note that this example is saving the earliest instance of the duplicate; if you'd rather save the latest change the MIN to a MAX. You can then occasionally run a simple cleanup by deleting all rows with the DulplicateFlag set to 1 (if we're talking tons of rows it's faster to drop FK constraints and indices before the delete, then rebuild them after). Of course any queries that use this table downstream should select those with a NULL DulplicateFlag.
NaturalKey, COUNT(*) AS howMany
FROM myInputTable AS v2
WHERE (DulplicateFlag IS NULL)
GROUP BY rowCheckSum, NaturalKey
HAVING (COUNT(*) > 1))
Update myInputTable
set v.DulplicateFlag = 1
FROM myInputTable AS v INNER JOIN a
ON v.idColumn <> a.MinClusterKey AND v.NaturalKey = a.NaturalKey
AND v.rowCheckSum = a.rowCheckSum