Dummy variables

There are two straightforward ways to turn string variables into corresponding dummies -- also known as categorical variables -- using Stata. One is an extension of the tab command:
tab stringvar, gen(dummy)
Another makes use of the fact that you seldom need dummies for their own sake. Usually you want them used in some sort of regression model. The xi: extension to various estimation commands turns string variables into dummies automatically, as in
xi: regress y x i.stringvar
Both are described in detail here and they both work well when your string variable translates into dummies directly. That, however, is not always the case. Think of a data set where you have a string variable named "color" which is equal to "red" for the first observation, "blue" for the second and "yellow, blue" for the third. You would want the dummy "color_red" to be equal to 1 in the first observation; the dummy "color_blue" to be 1 in the second and the third; and you'd want a separate dummy, "color_yellow", to be equal to 1 in the third observation.

I just ran across such a data set today. It had characteristics for a few hundred lottery games. The color described the ticket colors. There were a few other string variables that also could have observations that were comma-delimited lists. Moreover, the comma-delimited lists could include values that did not show up as unique values in other observations (like "yellow" in the example above).

So I thought I'd write a program that could deal with all of that without the need of any visual inspection or case-by-case manual labor on my part. I wanted it to be applicable to any string variable in this situation. My suggestion is below:

// ##### getDummies -- turns string to dummies. Takes one argument: // ##### `1' -- string, the name of the variable of interest. capture prog drop getDummies prog def getDummieslocal stringvar `1' quietly count local fullset=r(N) quietly count if !regexm(`stringvar',",") local uniques=r(N) // cases where `stringvar' is not a listif `fullset'!=`uniques' { quietly { tab `stringvar' if regexm(`stringvar',",") levelsof `stringvar' if !regexm(`stringvar',","), local(tags) preserve tempfile `stringvar'_lists keep `stringvar' keep if regexm(`stringvar',",") duplicates drop split `stringvar', p(",") save "``stringvar'_lists'", replace restore describe `stringvar'* using "``stringvar'_lists'", varlist // note(1) local `stringvar'_stubs=r(varlist) split `stringvar', p(",") local stubs: list sizeof `stringvar'_stubs forvalues i=2/`stubs' { local stub: word `i' of ``stringvar'_stubs' replace `stub'=trim(`stub') // note (2) levelsof `stub' if `stub'!="", local(extras) local tags: list tags | extras } local tags: list sort tags } } else { di "for each value of `stringvar' there corresponds one dummy variable" capture drop __* quietly levelsof `stringvar' if !regexm(`stringvar',","), local(tags) local `stringvar'_stubs `stringvar' } capture drop __*local stubnum: list sizeof `stringvar'_stubs local tagnum: list sizeof tagsquietly { forvalues i=1/`tagnum' { local thistag: word `i' of `tags' local thistag: list clean thistag gen byte _`stringvar'_`i'=0 forvalues j=1/`stubnum' { capture drop __* local thisstub: word `j' of ``stringvar'_stubs' replace _`stringvar'_`i'=1 if `thisstub'=="`thistag'" } } } drop `stringvar'*// this section is for listing stuff on screen and in the log local `stringvar'_stubs: list `stringvar'_stubs-stringvar local stubs: list sizeof `stringvar'_stubs di "" di "total number of games: `fullset'" di "number of games where `stringvar' is not a list: `uniques'" di "unique values of `stringvar': `tagnum'" if `stubs'>0 { di "where `stringvar' is a list, it is this long at most: `stubs'" } di "" forvalues i=1/`tagnum' { local thistag: word `i' of `tags' local thistag: list clean thistag di "_`stringvar'_`i' is for `stringvar' == `thistag'" }end

That's it. This program collects all the possible values that your stringvar can take, whether inside comma-delimited lists or by themselves, and produces accurate dummies that are equal to one every time such a value is encountered, whether by itself or in a list, and regardless of its position in the list. With your data set in memory, you simply call
getDummies color
Now, I don't post programs unless they contain something I just learned in the process of writing them. Today's such thing is in line 24, next to the comment "note(1)". Turns out -- if you call help describe -- there are two kinds of describe: one for data in memory, another for data using a file. The latter comes with a different set of options. One of them is varlist. It stores the name of the variables in r(varlist). I chose to preserve/restore and create the tiny temporary file "``stringvar'_lists'" so I could apply describe using to it, and get a variable list saved in the local ``stringvar'_stubs'. I'm using it later on.

This may look like a lot of work, and it is, but it's all up-front. You do it once, and if it works now it works forever. The marginal cost of creating dummies out of any number of such variables is zero from here on out.

Update (February 4, 2009): the first version of getDummies had a bug. The line marked with the comment "note(2)" was missing. As a result, the program produced more dummies than it should have. To use my color example, without this line getDummies will produce two separate dummies for the color "blue": one for the case where color was equal to "blue" strictly, and another for the case where color contained the string " blue". Notice the leading blank space. You want to trim() it.