Looking for suspicious data

When you get data in large quantities on a regular basis it is easy to miss problems. These problems can occur for numerous reasons. Here are some examples…

  • Data may be imported twice
  • Decimal points in the wrong places
  • Corruption due to a change in the way the customer sends the data to you

Customer data also changes over time. Generally this change is slow and peaks and troughs arise due to seasonal and other reasons however occasionally large disparities may exist due to promotions, closures for holiday or refurbishment etc.

Fixing the problem data is not the hard part once you know where it is: the hardest part is finding the problem data in the first place.

Example:

Imagine you have data coming in each month from your many clients. The data is stored by customer number and product number with a sales quantity and value. For simplicity of demonstration lets look at a single customer and product first.

You would not be suprised to see a pattern like this

January

30

February

40

March

50

April

50

May

50

June

50

July

45

August

40

September

50

October

50

November

60

December

70

Here we can see that there is a fair spread of data but nothing that can’t be explained easily. We often expect January to be a poor month and December to be a bumper month with a bit of a lull in the summer time.

If we were to analyse the year we would see a total of 585 sales averaging 48.75 sales per month, which all looks good.

So what would happen if we saw 150 sales in April? The yearly analysis would change to 685 total sales and an average monthly sales figure of 57.08333333 but would we notice?

In a table of thousands, or millions of values, this would probably be lost and therefore we might just accept these incorrect figures. (remember we are only looking at one custome and one product here so its easy to see the anomoly)

January

30

February

40

March

50

April

150

May

50

June

50

July

45

August

40

September

50

October

50

November

60

December

70

Let’s look at it on a graph and see if the error is obvious…

Unusually high April data graph

Graphically and in isolation, bad data sticks out like a sore thumb

Yes it is. So how do we make sure this data is brought to our attention? Obviously we cannot draw a graph like this for every customer and product. As I mentioned earlier, we probably have a table containing a large number of clients and possibly an even larger number of products and a period of daily, weekly, monthly or something else.

The size of the data is

Number of Customers x Number of Products x Number of Periods

To find anomalies we need to know what the average is per data break and also the standard deviation (difference) from that average.

A standard what?

A standard deviation is a grouping of the data in to sets that define how far a data value is away from the average for a given data break. Standard deviation usually presents three main orders of magnitude 1 standard deviation is 34.1% of the entire sample size either side of the average ( 68.2% of the sample or population). 2 SD’s accounts for another 13.6% either side of this (95.4% of entire sample) and 3 SD’s takes this up another 2.1% (99.6% of total).

1 Standard Deviation covers 68.2% of the data which is 34.1% either side of a perfect bell curve like this

Gobbledigook! How is this helping?

If we can calculate where the majority of the data is then we can see where the anomolies are.

This is probably best illustrated with an example

Take the previous data set including the erroneous data in April.

The average for this data is 57.08333333 with a standard deviation of 29.61266152.

This means that 68.2% of all the data appears within 29.6 sales (1 Standard deviation) either side of the average value of 57 sales.

i.e. 68.2% of the values are between 27.47067181 and 86.69599485

If we look at 3 SD’s then we are looking at 99.6% of all the values. This is between

(57.08333333 – (3 x 29.61266152)) AND  (57.08333333 + (3 x 29.61266152))

Or in round figures, most of the valid data is between -32 and +146

Importantly the 150 sales in April does not ssit betwen -32 and +146 !

So now, if we simply aggregate the entire data set to each data break (Customer and Product), calculating the number of average sales and standard deviation for each we can then compare individual Sales value for the period against the average for their data break (Customer and Product) to get a view of how far away from the norm that sales figure is. The higher the number, the further the data is from the norm.

By applying a filter to this formula on the entire result set we can restrict the results to just the highest numbers of Standard Deviations and we will see only the dates where the data is furthest from the norm. We should then anayse these Customer, Products and Periods to see why they are unusual. It would be a good idea to add a flag to the data to mark the data as valid but unusual data for later processing as unusual but valid data is where the story is and thats what people want to know. Where the data is not valid we simply need to fix it – there is no story there!

Doing this in SQL Server

SQL Server 2008 R2 contains 2 standard deviation functions. STDEV is used when we are using a sample of the result set to calculate a standard deviation but if we are looking at the entire population then we use STDEVP

As we are looking at the entire set of data we will use the STDEVP function here.

First make a Sales table


CREATE TABLE dbo.Sales(
DateKey date NOT NULL,
Customer int NOT NULL,
Product int NOT NULL,
Quantity int NOT NULL,
CONSTRAINT PK_Sales PRIMARY KEY CLUSTERED
(
DateKey ASC,
Customer ASC,
Product ASC
)
)

Now populate this with some fictional data

I used this script.

You’ll get primary key violations but just ignore them and run this script several times


INSERT INTO dbo.Sales(DateKey,Customer,Product,Quantity)
VALUES ('1 January 2011',convert(int,(RAND() * 10)),convert(int,(RAND() * 10)),convert(int,(RAND() * 100)))
INSERT INTO dbo.Sales(DateKey,Customer,Product,Quantity)
VALUES ('1 February 2011',convert(int,(RAND() * 10)),convert(int,(RAND() * 10)),convert(int,(RAND() * 100)))
INSERT INTO dbo.Sales(DateKey,Customer,Product,Quantity)
VALUES ('1 March 2011',convert(int,(RAND() * 10)),convert(int,(RAND() * 10)),convert(int,(RAND() * 100)))
INSERT INTO dbo.Sales(DateKey,Customer,Product,Quantity)
VALUES ('1 April 2011',convert(int,(RAND() * 10)),convert(int,(RAND() * 10)),convert(int,(RAND() * 100)))
INSERT INTO dbo.Sales(DateKey,Customer,Product,Quantity)
VALUES ('1 May 2011',convert(int,(RAND() * 10)),convert(int,(RAND() * 10)),convert(int,(RAND() * 100)))
INSERT INTO dbo.Sales(DateKey,Customer,Product,Quantity)
VALUES ('1 June 2011',convert(int,(RAND() * 10)),convert(int,(RAND() * 10)),convert(int,(RAND() * 100)))
INSERT INTO dbo.Sales(DateKey,Customer,Product,Quantity)
VALUES ('1 July 2011',convert(int,(RAND() * 10)),convert(int,(RAND() * 10)),convert(int,(RAND() * 100)))
INSERT INTO dbo.Sales(DateKey,Customer,Product,Quantity)
VALUES ('1 August 2011',convert(int,(RAND() * 10)),convert(int,(RAND() * 10)),convert(int,(RAND() * 100)))
INSERT INTO dbo.Sales(DateKey,Customer,Product,Quantity)
VALUES ('1 September 2011',convert(int,(RAND() * 10)),convert(int,(RAND() * 10)),convert(int,(RAND() * 100)))
INSERT INTO dbo.Sales(DateKey,Customer,Product,Quantity)
VALUES ('1 October 2011',convert(int,(RAND() * 10)),convert(int,(RAND() * 10)),convert(int,(RAND() * 100)))
INSERT INTO dbo.Sales(DateKey,Customer,Product,Quantity)
VALUES ('1 November 2011',convert(int,(RAND() * 10)),convert(int,(RAND() * 10)),convert(int,(RAND() * 100)))
INSERT INTO dbo.Sales(DateKey,Customer,Product,Quantity)
VALUES ('1 December 2011',convert(int,(RAND() * 10)),convert(int,(RAND() * 10)),convert(int,(RAND() * 100)))

Now you should have a table with loads of random test data in it.

Let’s have a quick look at that data


SELECT
Product
,DateKey
,[1],[2],[3],[4],[5],[6],[7],[8],[9]
FROM dbo.Sales
PIVOT
(
SUM (Quantity)
FOR Customer IN ([1],[2],[3],[4],[5],[6],[7],[8],[9])
) AS pvt

Edit a random value (or two) in the table so that it is ridiculous.

E.g. Change the Quantity to 1000

You know where this change was made but we are going to use the standard deviation script to find it.

Here’s the script.


;WITH cte as
(
-- Get the row data together with the average and standard deviation
SELECT
Customer
,Product
,DateKey
,Quantity
,AVG(Quantity) OVER (PARTITION BY Customer,Product) as Average
,STDEVP(Quantity) OVER (PARTITION BY Customer,Product) as StandardDeviation
FROM        dbo.Sales
)
-- Now lets look at how wide the variance is in Standard deviations
SELECT
Customer
,Product
,DateKey
,Quantity
,StandardDeviation
,CASE
WHEN CTE.StandardDeviation = 0 then 0
ELSE CEILING(ABS(Quantity  - Average) / StandardDeviation)
END as SDsAwayFromAverage
FROM CTE

Look for the highest [SDsAwayFromAverage] value. That’s your card! (magic trick?)

If you use this method to repeatedly look at the highest values for [SDsAwayFromAverage] and then analyse why that data is the way it is, you will end up with better data, and some interesting stories about the anomolies.

Hope this helps

Please feel free to comment.

Dave

Advertisements

About davidbridge

I am a contractor for David Bridge Technology Limited specialising in database design and Web development using Microsoft technologies such as c#, MVC .net and SQL Server (full stack)
This entry was posted in Developer stuff, SQL Stuff and tagged , , , , , . Bookmark the permalink.

One Response to Looking for suspicious data

  1. Nice post. I learn something totally new and challenging on sites
    I stumbleupon on a daily basis. It will always be interesting to read
    content from other authors and use something from their web sites.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s