In my previous post I cautioned against looping across observations, then showed how to do it anyway, using the example of reshaping a list from long to wide. A reader suggested, not unreasonably, that one might want to use reshape for that. He then proceeded with a code example, under the reservation that he did not know if it would be any faster. This brings me to the topic of benchmarking.

I seldom compare the speed of execution of alternative solutions for the same problem. It's something that's done all the time in general-purpose programming, but in run-of-the-mill statistics and data management this is not a pressing concern. You want to write clear, reproducible code. How long that takes to run is less important than how easy it is to follow and replicate, because it typically doesn't have to run more than once: you write your paper, send it to the publisher, and you're done.

But I don't publish for a living. Instead, I write code that does have to run over and over again, so it's about time that I put some thought in how to measure its performance. If you already have a favorite way of doing that, I am curious. Below is my attempt: a comparison of my initial solution (looping across observations) and Phil's (using reshape and a couple of other clever Stata functions) for a data set of 1,000,000 observations.

clear
set mem 100m

set obs 1000000
gen x=uniform()

// using the egen function seq()
capture prog drop phil
prog def phil

local myvar `1'
count
local n=r(N)
egen i=seq(), from(1) to(`n') block(2)
gen j=mod(_n,2)+1
reshape wide `myvar', i(i) j(j)
destring `myvar'1, replace

end

// looping across observations
capture prog drop gabi
prog def gabi

local myvar `1'
count
local obs=r(N)/2
gen var2=.

forvalues i=1/`obs' {
  local there=`i'*2
  local here=`there'-1
  replace var2=`myvar'[`there'] in `here'
}

end

// speed comparison
foreach k in phil gabi {
  preserve
  di c(current_time) // check the clock
  di "`k''s solution"
  quietly `k' x
  di c(current_time) // check again
  restore
}

The idea is to compare the time posted on screen before and after running each program. On my machine (Dell Latitude D600, Intel Core 2 Duo, 2.0GHz, 2G of RAM) I found this:

11:26:20
phil's solution
11:26:29
11:26:29
gabi's solution
11:26:44

.
end of do-file

Clearly, reshape beats looping across observations: 9 seconds vs. 15.