I recently worked on a project where I had to model groundwater salinity as an indirect function of population growth. The idea is that more people will draw more fresh water from the aquifer; other things equal, saline water will be displace it. I had to do this for sixteen counties in Southern Florida and my data -- on population, salinity, and water withdrawals, by year and by county -- had some gaps here and there.

A search on how to best fill them had to start with the Statalist archive. I quickly found this simple solution by Scott Merryman, which solved the core of the problem. But then things got complicated. First, I had several variables of interest and I had no reason to expect that they all would have a linear time trend. Second, some of my variables -- such as measured groundwater salinity -- showed pretty unbelievable outliers.

So I needed some kind of wrapper that would accommodate different variables, trends of different orders, and dummy variables to account for any differences between counties or groups of counties. And I needed a quick and easy way to deal with the outliers. The wrapper is below:

// #### higher-order predictions of anything against time. 
// takes three arguments: 
// 1. order, numeric= 1,2,3,4,5,etc. for linear, squared, 
//    cubed, etc. trend
// 2. lhs, string, the name of the left-hand side variable: 
//    salinity, usage, population, etc.
// 3. dummy, string -- name of the group identifier on the 
//    right-hand side, e.g., county fips code.
capture prog drop getValueHat
program getValueHat

args order lhs dummy

levelsof `dummy', local(countem)
local checkit: list sizeof countem
// if `dummy' is not degenerated to one value, then use xi: regress
if `checkit'>1 {
   local regressors i.`dummy'*year1
   forvalues i=1/`order' {
      gen year`i'=year^`i'
      if `i'>1 {
         local regressors `regressors' i.`dummy'*year`i'
      }
   }
   xi: regress `lhs' `regressors'
}
// otherwise, just regress
else {
   local regressors year1
   forvalues i=1/`order' {
      gen year`i'=year^`i'
      if `i'>1 {
         local regressors `regressors' year`i'
      }
   }
   regress `lhs' `regressors'   
}   
predict `lhs'_hat
replace `lhs'_hat=max(0,`lhs'_hat)
if `checkit'>1 {
   drop _I*
}
forvalues i=1/`order' {
   drop year`i'
}

end

Now I can use the same code for different variables with different functional forms, which is nice. For example, if I want to fill gaps in salinity at the county level with a quadratic trend, this program call will get me the right salinity_hat:

getValueHat 2 salinity county

Dealing with the outliers was a far simpler matter: Nick Cox's winsor command -- findit winsor if you don't have it installed. The complete solution to my problem of gaps and outliers is:

winsor salinity, h(3) gen(x)
getValueHat 2 x county
rename x_hat salinity_hat
drop x

If, like me, you've never Winsorized before, here's the Wikipedia entry on the procedure.