Advanced Database Tricks for Social Scientists – Querying a SQL Database in Stata

This is a quick guide to setting up a connection to directly query an ODBC-compliant database from a do file in Stata.

NOTE: As the title suggests, this guide assumes relatively advanced knowledge of both relational databases (e.g. Oracle, SQL Server, MySQL, etc) and Stata.  Viewer discretion advised.

OBJECTIVE: Use a Stata do file to directly query a live ODBC-compliant, relational, T-SQL database and load the results as the current dataset in Stata.  Do this using a connection string so no OS-specific named ODBC datasources are required.

[su_divider top=”no” size=”1″]

Step 1 – Make sure you have a working connection to your database server of choice (and have the appropriate permissions and access)

Step 2 – Assemble the following information to use in creating a connection string.

Step 3 – Use the following code in your Stata do file

*** LOAD DATA ***
// clear any data already loaded in memory

//database connection
local db = "DRIVER={YOURDRIVERNAMEHERE};SERVER=yourservernameoripaddress;DATABASE=databasename;UID=uid;PWD=pwd;"

// SQL SELECT statement to retrieve data
local sql = "SELECT * FROM YourTable"

// Load data via ODBC (using connection string)
odbc load, exec("`sql'") connectionstring(“`db'") clear

//destring any vars that came in as strings but should be numeric
destring, replace


[su_divider top=”no” size=”1″]

The driver attribute in the connection string is the full name of the driver as your OS sees it when installed. For example, when I connect to MySQL, I use the following

DRIVER={MySQL ODBC 5.3 Unicode Driver}

You can find the driver name by checking your ODBC drivers using a command line (WIN/OSX/Linux) or with a GUI through the ODBC settings in your control panel (WIN) or by installing the iODBC GUI (OSX/Linux) available at:

Coming Soon: In a former life, I was a software engineer, IT project manager, and database developer. I built software and worked with enterprise-level database tools on a daily basis for about a dozen years before getting my PhD and becoming a full-time researcher and teacher in the field of Management and Entrepreneurship.  When my fellow academics learn this part of my history, they often ask me for advice in learning how to better manage the data they use for social science research.  So, I’ve decided to post a series of tutorials (not sure whether they’ll be articles, videos, or both) giving an introduction to how an ordinary social scientist can use the kinds of high-horsepower tools that software and database engineers use every day.  Stay Tuned.

Stats Tip: Understanding Normal Distributions (and “Husky-Sized” Tomatoes)

Today’s stats tip comes to you courtesy of… My amazing friend, Ingrid Gruett, who has joined the ranks of us Gen-Xers going back to grad school for a career change.  Ingrid got her bachelor of music degree as a classmate of mine at the Wheaton Conservatory of Music and is an extremely talented clarinetist.  She is now back in school to begin a new career in music therapy.

Ingrid is taking a stats course and passed along an interesting exam question she saw this week which I thought was a particularly good one for digesting the concept of normally distributed data and how we think about what data are unusual.  I always find that explaining a concept helps me understand it better and perhaps this will be helpful to some new stats students out there (or others who teach them.) For you seasoned liars, damned liars, and statisticians out there, this will be old hat, but for me it was helpful to have an opportunity to explain this concept in a way that might make sense to someone who is new to statistics.

Here’s the question Ingrid passed along…

Jody, a statistics major, grows tomatoes in her spare time.  She keeps a record of the weight of each tomato she grows.  One tomato is 2 standard deviations heavier than the mean weight.  Assume a Normal model is appropriate.  What percentile is it in?

  1. 99.7
  2. 95
  3. 97.5
  4. 68
  5. None of the above

To answer this question, I offer this crudely drawn figure and some steps to guide your thinking.TomatoDistributionSketch

STEP 1. CONSIDER THE TOMATOES OF THE FIELD  – Think of all the tomatoes coming out of the garden.  When we assume their weights are normally distributed, that means that all of them must fall inside the bell curve of our distribution somewhere. A handful of the tomatoes will be especially big (like the one in your problem) and will be way out under the right tail of the curve.  About the same number will be especially small, so these will be way out under the left tail of the curve.  Most will be about average (really close the the mean), which is why there’s so much more area under the middle bump of the curve than there is way out in the tails.

STEP 2. WHICH TOMATOES ARE INSIDE 2 STANDARD DEVIATIONS AND WHICH ARE OUTSIDE? – If you look at the poorly drawn picture  you can work out that 95% of all the tomatoes are your typical tomatoes (as we said above) and aren’t so unusually big or small that they are more than 2 standard deviations away from the mean, so they would be under the big middle bump of the curve.  That means that all the freakishly big AND freakishly small tomatoes are outside of that big middle bump, under the TWO tails. So, if 95% are inside that 2SD window, 5% are outside in the TWO tails…

STEP 3. WHICH FREAKISH TOMATOES ARE YOU TALKING ABOUT? – This is the key step… in this case you are ONLY looking at a freakishly BIG tomato way out on the right tail of the distribution.  As we’ve already said, though, if we assume our garden grows a crop of tomatoes that is normally distributed in terms of tomato size, there must also be freakishly SMALL tomatoes way out on the left tail (these are pitiful, green, sour tomatoes that deserve our pity).  Even though we don’t care about those little guys, they are still part of the 5% that are outside the 2SD window… in fact, they make up exactly HALF of that 5%, meaning there is 2.5% in that tail and 2.5% in other tail.  So… that means that ONLY 2.5% of ALL the tomatoes are 2SD+ BIGGER than average… meaning, 97.5% are smaller than the monster tomato you are looking at in this problem.  Hence, it is in the 97.5th percentile.

Thanks, Ingrid!  I have no doubt a great many hurting people will benefit from your work as a music therapist. Keep at it!

Image Credit – “Healthy Red Tomatoes are Wet and Organic”  by, used under CC BY 2 / Modified from original

Teaching Tools – Trivia Game

Fellow teachers of undergraduates, how often to you face this question on the first day of the semester?  Am I better off…

  1. …spending the entire first class session reading the syllabus to your students with great gravity and conviction in the misguided hope that they will pay some minimal level of attention to the information therein?
  2. …handing them a copy of the syllabus without further ado knowing nothing you say/do will make any difference anyway, then brace for a semester of conflict?

I have tried option A with the results more experienced teachers might expect (i.e. a wasted class period and bored students) and had been planning to go with Option B next time I teach. However, a more experienced colleague mentioned that she had given her students a syllabus quiz on the first day that seemed to force students to review and ask questions about course content and policies.  My best teaching buddy, Chris Sutter and I talked about this idea and thought “Hey, why not do a syllabus game instead?”  We are both planning to try it in our courses this coming Fall using a Jeopardy-style game (with a shameless Skinnerian infusion of candy to keep things moving along.)

I developed a basic, easily customizable powerpoint template to run this game in the classroom.  I think it’s a little better than the other templates/apps I’ve seen via Google.  More importantly, I designed it to be both visually attractive AND universally compatible with all versions of PowerPoint on all platforms (Chris and I use different operating systems and a shared classroom computer for presentations so we didn’t want to have any issues with missing fonts or the like.)

Turns out, the template I made was only a few tweaks and a page of documentation away from being useful for pretty much anyone wanting to build and run a trivia game on any topic.  So, I’ve decided to push it out on ye olde interwebs for anyone to download and use as they see fit.

You can grab a copy at:

If you end up using/improving this, drop me a line (via twitter, facebook, email, whathaveyou) and tell me about it.