I follow @simplystats on Twitter, and on March 1 they had a post that linked to an article in Foreign Policy about a guy who has the coolest job in applied stats. He works here.

The original piece described a quick algorithm that you can use to estimate the number of human rights violations using a technique first devised for counting fish in a pond. The gist of it is this: catch and release fish over two days. Tag the fish caught on the first day. Count each day’s catch and the number of fish caught twice. That is the overlap. To estimate the number of fish in the pond, multiply the two days’ catches and divide the total by the overlap.

I had a data set of insurance claims in Stata’s memory at the time of my reading, with observations uniquely identified by a variable named claim_id.

I decided to use it as the model of a pond with as many fish in it as observations in my data set, so I wrote a little fishing program. It takes one argument: some round upper bound of the number of fish I might catch in a day. I’ll call it n. It can be 100, or it can be 1,000. Here:

// try MSE

capture prog drop guessObservations

program guessObservations

args n // upper bound of a day's catch.

qui {

local day1fishcount=int(runiform()*n')

local day2fishcount=int(runiform()*n')

forvalues i=1/2 {

preserve

tempfile dayi'fishlist

sample dayi'fishcount', count

keep claim_id

save "dayi'fishlist'", replace

restore

}

preserve

drop _all

use "day1fishlist'"

merge 1:1 claim_id using "day2fishlist'"

count if _merge==3

local overlap=r(N)

restore

local totalfish=day1fishcount'*day2fishcount'

if overlap'>0 {

local totalfish=totalfish'/overlap'

}

count

local truect=r(N)

}

local fmt _col(30) %10.0fc

di ""

di "Fish caught on day 1:" fmt' day1fishcount'

di "Fish caught on day 2:" fmt' day2fishcount'

di "Overlap:"              fmt' overlap'

di "Estimate:"             fmt' totalfish'

di "True count:"           fmt' truect'

end


My data set has some 150,000 observations. Choosing a small n, say guessObservations 100, sets me up for an overlap of zero, but even so the two catches multiplied together won’t even come close to the true size of the population. This is a technique for counting hungry fish in a small pond, not in an ocean. The size of the daily catch should be representative of the total, so you can have some decent overlap.

Setting n=1,000 keeps it small enough relative to the total population that it’s still possible to have zero overlap, but n is now large enough to overshoot wildly in that case. If I catch 900 fish each day with zero overlap, I will guess that there are 810,000 fish there. However, an overlap as small as 5 will get me pretty close to the true population.

Setting n=10,000 performs much better. I may still have a day when the fish won’t bite, and get this:

. guessObservations 10000

Fish caught on day 1:                49

Fish caught on day 2:             4,182

Overlap:                              3

Estimate:                        68,306

True count:                     157,638


But with any luck, I will probably get this:

. guessObservations 10000

Fish caught on day 1:             9,662

Fish caught on day 2:             3,220

Overlap:                            220

Estimate:                       141,417

True count:                     157,638


The larger n, the larger the overlap, and the better the precision. That makes sense: in the limit, the true number times itself divided by itself will yield the true number.

But does n have to be very large relative to the size of the population? And does my guess – or the uncertainty surrounding it – depend on what probability distribution function I assume for the daily catch? Next time I’ll be doing some simulations.