I am reading An Introduction to Stata Programming, by Christopher Baum.

He suggests, in Chapter 5.2, a nice do-file method to validate your data: you use pairs of list and assert. For example, suppose you know that a variable v should have no missing values. If it indeed does not, then assert !missing(v) should run without error. If it does, you want to know where they are: list if missing(v). Reversing the order of these two lines in a do-file will cause Stata to exit with an error, which alerts you that there are problems with your data, but not before it shows you where the problems are:

sysuse auto, clear
list if missing(make)
assert !missing(make)
list if missing(rep78)
assert !missing(rep78)

This is neat because you can always edit this do-file with a new pair of list/assert lines as needed. But, as the author mentions, sometimes a summarize is plenty helpful too, especially if you remember that it accepts a list of variables as an argument. You could do this for example:

sysuse auto, clear
sum make rep78

Wait. That didn't work too well, because summarize tells you nothing about make: it treats string variables as missing. You would know that that was indeed the case if before sum you would have requested describe.

So, I guess, before you do any kind of data validation, describe is a good first step; you might also like codebook; I don't. I find it too wordy. But it does do the job of giving you information about the whole data set.

One alternative to wordy output when you have a specific question regarding more than one variable is to use little custom programs for data checks. On such example is countIfMissing, shown in my previous post. Another, inspired by Christopher Baum's use of summarize might be sumIfNumeric:

capture prog drop sumIfNumeric
program sumIfNumeric

unab fullset: _all
local numset
foreach varble in `fullset' {
   capture confirm numeric variable `varble'
   if _rc==0 {
      local numset `numset' `varble'
local check: list sizeof numset
if `check'>0 {
   sum `numset'
else {
   di "No numeric variables found in this dataset."


This might be useful when your data set comes with a bunch of variables -- some numeric, some string. Though describe will tell them apart easily enough, you may not care to list them explicitly. The usage is straightforward:

sysuse auto, clear