Dealing with unhelpful .csv files

The easiest way to get your .csv (comma-separated values) text files into Stata is this:
insheet using myfile.csv, comma names
This assumes that your .csv file is well-behaved: names are on the first row, and any commas really are field separators.

This is sometimes not the case. Recently I got a set of .csv files that were likely exported from Microsoft Excel and had four main problems. The first was that the first row was empty except for the first cell, which had some gibberish in it; the variable names were on the second row. Second, some of these names were things like "$100 & UP". Third, some of the files had an empty las variable, as if MS Excel had tacked an extra spreadsheet column onto the .csv file. Finally, all the variables entered Stata as string, because Stata, seeing alphanumeric characters all over the second row, reasonably assumed that there were no numeric variables in this file.

So I wrote a program that would fix these four problems in any set of .csv files that shared any of them, regardless of the actual number of variables, which were string and which were numeric, and what crazy characters were in the original variable names.

It went like this:
capture prog drop fixVarlist prog def fixVarlist// get rid of last var if empty unab oldvars: _all quietly describe local varnum=r(k) local last: word `varnum' of `oldvars' capture confirm byte var `last' if _rc==0 { count if `last'!=. if r(N)==0 { drop `last' unab oldvars: _all quietly describe local varnum=r(k) } } // collect names of vars from first row and fix them as needed forvalues i=1/`varnum' { local oldvar: word `i' of `oldvars' // capture value from first row local newvar=`oldvar' in 1 // turn $ signs to USD local newvar=regexr("`newvar'","^\\$","USD") // clean up blank spaces while regexm("`newvar'"," ") { local newvar=regexr("`newvar'"," ","") } // some more cleanup // -- these changes apply to only a subset of the vars if regexm("`newvar'","&UP") { local newvar=regexr("`newvar'","&UP","andUp") } // -- these apply to all vars local newvar=regexr("`newvar'","w/o","Without") local newvar=regexr("`newvar'","w/","With") // collect clean var names into a list local newvars `newvars' `newvar' } di "there are `varnum' valid variables in this file" di "they are: `newvars'"// now rename vars drop in 1 local newvarnum: list sizeof newvars forvalues i=1/`newvarnum' { local oldvar: word `i' of `oldvars' local newvar: word `i' of `newvars' rename `oldvar' `newvar' } // now de-string any numeric vars unab newvars: _all local newvarnum: list sizeof newvars quietly count local check=r(N) forvalues i=1/`newvarnum' { local newvar: word `i' of `newvars' gen x=real(`newvar') quietly count if x==. if r(N)==`check' { di "`newvar' is not numeric" } else { destring `newvar', replace force } drop x } end
You need to fix blank spaces inside that while loop because Stata's regexr() command would operate only on the first blank space it found, and leave the rest alone. You check whether there are any blank spaces left to fix with regexm(), which returns 1 if there are, and 0 otherwise.

Also, you need to escape the dollar sign, twice as it turns out, because in regular expressions $ by itself is the end-of-string marker. If you didn't escape it with a backslash (OK, two in this case) then regexr() would tack the string "USD" as a suffix onto all your variable names. Oh, and there's the ^ sign: in regular expressions it's the beginning-of-string character. Essentially, ^\\$ means "look for dollar signs at the beginning of this string".

Other than that, there's nothing exotic about this program. Its nice feature is that it keeps track of how many variables you have (by using the unab command to read in the unabbreviated variable list and by filling recursively the local `newvars', another Stata feature not commonly used) and it does its own counting. Features like these aren't mandatory, but they make your code more reusable.