Sample Size question

Page 1 of 212>
December 18th, 2015 at 3:51:15 PM permalink
AZDuffman
Member since: Oct 24, 2012
Threads: 135
Posts: 18136
I have a question on sample sizes. If I was testing for errors on a sample with 64,000 subjects, what would be the size needed to be confident of no errors?

Background: My employer has pulled customer info for regulatory reporting (banking is NOT de-regulated!) for the feds. Somehow they pulled data to an Excel spreadsheet for the various needs. We are looking at random files, not truly randomized but close enough for government work, and checking the fields with what is in other system. We are doing it all of next week. So far no errors on my end.

What would the sample size of examined files need to be to get n% confidence of say no errors, 99% correct, or whatever the % would be as I know numbers go way up to get that marginal extra confidence.
The President is a fink.
December 18th, 2015 at 5:50:54 PM permalink
terapined
Member since: Aug 6, 2014
Threads: 73
Posts: 11786
Should post this over at WOV, more math heads over there
Sometimes we live no particular way but our own - Grateful Dead "Eyes of the World"
December 18th, 2015 at 7:42:12 PM permalink
Fleastiff
Member since: Oct 27, 2012
Threads: 62
Posts: 7831
There are many situations wherein an auditor would be highly suspicious of NO errors being detected... sort of like the real world ain't never that perfect.
December 20th, 2015 at 9:05:19 AM permalink
TheCesspit
Member since: Oct 24, 2012
Threads: 23
Posts: 1929
Quote: AZDuffman
I have a question on sample sizes. If I was testing for errors on a sample with 64,000 subjects, what would be the size needed to be confident of no errors?

Background: My employer has pulled customer info for regulatory reporting (banking is NOT de-regulated!) for the feds. Somehow they pulled data to an Excel spreadsheet for the various needs. We are looking at random files, not truly randomized but close enough for government work, and checking the fields with what is in other system. We are doing it all of next week. So far no errors on my end.

What would the sample size of examined files need to be to get n% confidence of say no errors, 99% correct, or whatever the % would be as I know numbers go way up to get that marginal extra confidence.


Depends on the likely hood of an error.

You could model the hypothesis of the chance of finding an error in a file. Make up a number. Then work out the chance that you pull X files and find no error. Change your error chance and recalculate.

E.g say you guess the chance of pulling of a file and finding an error is 1%. You check 100 files. What's the chances of finding zero errors in 100 files?

-> 99/100 of clean file each time, raised to power of 100. 99/100^100-> 36%. So even with a 1% error rate, there's a 36% chance you could check 100 files and find no errors. Which is probably not good enough, as if audited, there's a 64% chance they'd find an error.

However, if you think the chance of an error is 50% per file, it's vanishingly small that you have no error in 100 files. I am sure there's a function to work out the 99% confidence rate that after 100 files your error rate is x% of less. It probably uses the natural log.

5% error rate would give a 0.6% chance of running through 100 files clean.

In short, without a good idea of your error rate, it's probably a lot of files to get to an level of confidence that your error rate is low (1 in 1000 feel about right? You need 90% chance of doing 100 files clean).

Course if you find a single error after 500 files, you can estimate your error rate at 1 in 500 and go from there to see how reasonable it is.
It is said that your life flashes before your eyes just before you die.... it's called Life
December 20th, 2015 at 9:12:26 AM permalink
Wizard
Administrator
Member since: Oct 23, 2012
Threads: 239
Posts: 6095
Quote: TheCesspit
Depends on the likely hood of an error.


Exactly.

Here is a way you might phrase the question to get a answer.

There are two populations of cans of beans. In one population every can is good. In the second, one can in 1,000 is contaminated. You do not know which population has the contaminated cans. How many cans would you need to test from either population to be 99.9% sure it is the uncontaminated population, assuming every can tested was negative.

I get 7,596.1. Since you can't test a partial can, you would need 7,597 to get past 99%
Knowledge is Good -- Emil Faber
December 21st, 2015 at 1:16:49 AM permalink
Pacomartin
Member since: Oct 24, 2012
Threads: 1068
Posts: 12569
Quote: AZDuffman
I have a question on sample sizes. If I was testing for errors on a sample with 64,000 subjects, what would be the size needed to be confident of no errors?

Well the term "no errors" doesn't even exist in statistics. The two most common terms are confidence interval and confidence limits.

Confidence interval
Term used in parametric statistics to specify the margin of error associated with a particular survey estimate for a given level of significance. For example, 95% confidence interval denotes the range of values surrounding the survey estimate within which there is a 95% probability that the true population value will lie. Depending on the level of certainty required, higher or lower probability values may be used to specify the confidence interval.
Confidence limits
The lower and upper boundaries/values of a confidence interval, that is, the values which define the range of a confidence interval.

The problem is that for most executives who are not statisticians there eyes will glaze over, and they will order you to check an additional 5000 data entries just to be more confident.

-------------------------------
Are you looking at numerical data or names and adresses? Do you have your own spreadsheet to compare with the government spreadsheet?

If you do, then often it is sufficient to compare some simple metrics from your data to the governments data. Minimum, Maximum, Mean, and Median are the simplest.
But it helps to print out a table to percentiles, i.e. =PERCENTILE(A1:A64000, p) where p=1%,2%,3%,...100%. The percentile table will give you an idea if there a number of errors that cancel each other out.

For instance I have a dataset of 265*24=8760 points that represent the number of watts used by my house electrical system every hour for the last year. I can set out a percentile table with percentiles more often above 90% where the data is more important.

Load on home electrical system

min 580 Watts
5% 1,210
10% 1,380
15% 1,500
20% 1,610
25% 1,730
30% 1,840
35% 1,960
40% 2,070
45% 2,190
50% 2,250 median
55% 2,360
60% 2,530
avg. 2,560
65% 2,650
70% 2,820
75% 3,050
80% 3,280
85% 3,630
90% 4,200
95% 5,010
96% 5,240
97% 5,590
98% 6,153
99% 7,140
Max 10,310 Watts

100A:12,000 Watts


This is of interest because a 100 Amp service can handle peak loads of 120*100=12,000 watts. I am curious not only how often that happens, but how often it gets even in the ballpark (where I would have to reset main circuit breaker). It seems unlikely as I only exceed 7,140 Watts 1% of the time. But naturally an electrician would tell you to put in a 200 Amp service to be "safe and secure".

But in your case if I was comparing two data sets, and such a percentile table was nearly identical for both data sets, I would be pretty confident there was few errors. An average, max and min provides very low confidence that the data sets are nearly identical.
December 21st, 2015 at 3:34:23 AM permalink
AZDuffman
Member since: Oct 24, 2012
Threads: 135
Posts: 18136
Quote: Pacomartin

Are you looking at numerical data or names and adresses? Do you have your own spreadsheet to compare with the government spreadsheet?


Thanks to all replies so far, sorry I have not replied sooner but I just had a killer work weekend. I would share more but then I read what Face is dealing with and it seems not quite as bad. But I still slept most of yesterday.

Anyways, here is some more info. When you apply for a mortgage you give or the bank enters your name, address, race,etc. on the application. So for example on "race" it might look like this:

C=White
B=Black
A=Asian
H=Hispanic
0=did not answer

So the person taking the app enters one of the said values, lets say the most common "C." So now it is time to report and the feds want the records. The feds do not want to look at our files, they want a report. So someone has made a bot to look at each application in our file. The bot sees "C" and takes said "C" and puts it in an Excel spreadsheet.

My job all week is to look at the spreadsheet and see if the "C" there matches the "C" in the app and does not become say an "H" or even a non-answer of say "Z."

That is all it is about. "Errors" are not possible in the normal sense as we are not digging to see if say the name "Corleone" has an "A" in the box. All we care about is that they match. What they are worried about is if the poor bot gets confused. One woman keeps finding one kind of error. So I told her to report it as that is what they want, to see if again for example if "De Lallo" with a space in it gives a null value and other names with spaces give a null value so they go back and fix that part.

For our part it is easy but tedious. It is doubtful I will meet the person running it face to face. I just am wondering that if I look at 200 of these and see they all have no errors how confident I can be that none will have an error.
The President is a fink.
December 21st, 2015 at 4:05:54 AM permalink
TheCesspit
Member since: Oct 24, 2012
Threads: 23
Posts: 1929
God, that's an awful way to test a bot. There's far better ways, including using a known dataset, writing the expected results for it, running the bot on it and checkin it matches expectation.

And a very sorry old system in general.

One way to check the spreadsheet for bad / unexpected dat is to do a simple =countif(...) statement in excel. Count the number of rows. Count the number of C, B, A, H and 0. The sum should equal the number of rows. Etc. Saves the checking for bad values.

Rinse and repeat for each field type.

200 is nothing like enough to be confident of no errors. Especially as the way errors can be generated with a script is not going to be purely random. As you say, names with two parts might cause an error. Your sample might have 0 of those. Dates can through up all sorts of fun errors. Etc, etc.

Having the bot/script writer have some QA it would be far more effective :)
It is said that your life flashes before your eyes just before you die.... it's called Life
December 21st, 2015 at 9:29:04 AM permalink
Pacomartin
Member since: Oct 24, 2012
Threads: 1068
Posts: 12569
Quote: TheCesspit
Especially as the way errors can be generated with a script is not going to be purely random.


Cesspit is correct, since you are really talking about transcription errors here, no likelihood estimate or confidence interval applies.

So in theory someone may have entered L for "Latino" and you don't know if the person who programmed the 'bot correctly deal with all the exceptions.
December 21st, 2015 at 4:36:28 PM permalink
AZDuffman
Member since: Oct 24, 2012
Threads: 135
Posts: 18136
Quote: Pacomartin
Cesspit is correct, since you are really talking about transcription errors here, no likelihood estimate or confidence interval applies.

So in theory someone may have entered L for "Latino" and you don't know if the person who programmed the 'bot correctly deal with all the exceptions.


I don't think there will be any possible exceptions. The system that takes the app would not allow say "M" instead of "L." IOW, the universe of possible answers is fixed as there is no free-form.

Done over 300 of these and zero errors. This is what you end up doing when you stay in school.
The President is a fink.
Page 1 of 212>