Stata Toiler: 2010

Friday, September 10, 2010

Are you the/a “marginal” ________ ? (PART 1)

In microeconomic theory, the concept of the marginal is unavoidable. In fact, it is the cornerstone of “modern” microeconomic theory (well, it depends on how you cast your individuals and players…).

In the simplest term, for me anyway, the “marginal” is the one that tilts the status quo. It is a dividing line, it makes you pursue a specific task, it tells you if something is a “go” or not, it tells you to STOP!

Are you the “marginal” player? The “marginal” consultant….. the one that makes the difference. The relevant one. The cat in Mrs. Lovett's pie?

In running regressions, we are oftentimes interested in the marginal effects of specific explanatory variables. The explanatory variable can be anything, from a continuous variable (say income) to a switch variable.

What is the effect of a 100 unit of increase in income in Y?

What is the effect of introducing a “pill” or a policy in utilization?

In the standard linear regression model, the computed betas are usually the “marginal” effects (ill show in another blog entry examples which show otherwise). Suppose you are running a logit or probit model, you are oftentimes not only interested in the direction, but in the degree and magnitude as well (effect of introducing a policy in the probability of pursuing a certain action).

In stata 10, some of the commands are

mfx, compute -- >for logit models

mfx, predict (pu0) --> for fixed or random effects logit models (xtlogit)

(there are variations of mfx depending on the model you are running, say an ologit or mlogit).

These two commands would give you the effect on the probability by the explanatory variables. HOWEVER, note that stata would compute the probabilities at the MEAN values of your explanatory variables. Say x1 is a dummy variable for “male” and 20% of your regression sample are males, the mfx will be computed at x1=0.2 (you have to make basic manual computations to get the predicted probability).

If you are lazy and do not want to bother with computations to get the predicted probability, you can use…

mfx, predict (p) at(male=0) à say you want to find out the change in the probability if the respondent is male.

A complication exists if one of the dependent variables is an interaction of two other variables. Or you are dealing with squared explanatory variables. Obviously, the “mfx, compute” command WILL NOT give you the marginal effects (say of age if there is age^2)….for the simple reason that stata WILL NOT BE ABLE to recognize a variable called age_square as a transformation of another variable. Unless explicitly specified, stata will simply treat age_square as an additional variable. (MORE ON THIS,,,,,LATER)

PS: for those using STATA 11, there is now a faster command, margins. Click here to learn more.

Thursday, September 2, 2010

OUT OF (re)SHAPE: long to wide

I am NOT a fan of the reshape command, if there are other ways, i really avoid it like a plague.

Say you have a data set with members as unit of observation. Each observation has a tag, identifying its household and its “count” in the household.

Say, the problem is transforming the data set into one where the unit of observation is the household.

If I am interested only with a few household characteristics, I WOULD RATHER NOT USE the reshape command.

I would rather use the egen command and then using the following technique:

sort hhid:

by hhid: egen aveincome=mean(income)

gen count=sum(1)

drop if count~=1

keep hhid aveincome

TAPOS! WALA NG KUSKUS BALUNGOS!

Problem with reshape (long to wide pa lang ito ha), you would have a lot of income variables depending on the family with the most number of members. And you will be forced to include in the reshape command variables you are actually not interested with (since these vars may not be constant within the ids).

But sige na nga, here is a sample of the darn RESHAPE command.

Wednesday, September 1, 2010

Let Me Count the Ways, 1 2 3

Somebody asked me this morning if it is possible to generate a variable in an EXISTING stata data set containing counts, from 1 to n.

Yes, its possible, the command is:

gen varname=sum(1)

you can also have a series of variables containing, 2, 4, 6, n+2…The command is

gen varname=sum(2)

suppose, you do not want to start with 1, then

gen varname=sum(1) + 1

Suppose you have a roster data set with family members as observation and with specific family ids tagging the members of a family, and say you want to order the members, with count=1 for the youngest…Try running the commands

sort family_id age

by family_id: gen varname=sum(1)

Try running the gen ____=sum(1) command in a blank data set and see what you will get. the answer isnt surprising :)

YOUR FRIENDLY AVON GIRL: A RDS TECHNIQUE

A few months ago, I handled data sets on intravenous drug users and men-having-sex with men (MSMs).

The sampling design used for data collection was RDS, or respondent-driven sampling. Basically, for each sites, some “seeds” or respondents were recruited to participate in the survey. Then, these primary seeds were asked to recruit additional respondents. And so on…(think of the method used by AVON to sell cosmetics)…

The IDs of the respondents reflect their “position” in the entire recruitment process.

For example, given the following IDs

Id=1 > person is a first seed

Id=2 > person is a first seed

Id=12 > person is a seed recruited by 1

id=21 > person is a seed recruited by 2

Id=123 > person is a seed recruited by person 12 (who was recruited by id 1)

I needed to perform an xtreg with grouping based on the seeds. I then, have to cut the IDs, say on the third , second, or fourth level. And then do labeling based on the new ids generated. To perform this, I used the substr command.

Refer to statadaily.wordpress.com for the command. Click here. Thanks Mitch!!!

Tuesday, August 31, 2010

DROP DOWN menus in Excel

A really swell thing that I have been doing is making my own encoding file in excel.

To avoid Major Major data cleaning after encoding (say, recoding

NCR and Metro Manila, as NCR), I simply use a drop-down menu via excel.

Suppose, I have two variables, Region and Province, making a drop-down menu for Region is easy.

However, for province, can the menu “change” depending on the answer on

Region. Say, if the Region is A, provinces in menu will be C, D, E.

and if Region is B, provinces will be X, Y, Z? Yes!

Click here for a video demo on drop down menus.

Monday, August 30, 2010

Respondent-Driven Sampling (RDS)

How do you do analysis or quantitative studies on hidden populations, or those populations where sampling frame cannot be constructed? Say Intravenous Drug Users (IDUs), or Men Having Sex with Men (Yes Martha, in most countries, MSMs are still hidden!).

You can do RESPONDENT-DRIVEN SAMPLING (RDS).

and then....with the data from an RDS study, how do you then analyze the data? What are the complications of using data from RDS studies?

Click here to learn more!

COLLAPSE to avoid fatigue

A few months ago, an officemate ran the following regressions:

Y1 = a + bx1 + cx2 + dx3 + …….. + nxn

Y2 = a + bx1 + cx2 + dx3 + …….. + nxn

.....

Y5 = a + bx1 + cx2 + dx3 + …….. + nxn

There were around 5 dependent variables and 7 explanatory variables (which include age and income). The data set also covered observations coming from more than 10 cities. Within a city, there are two types of observation, say registered and free-lance (o dead give away na kung ano ang topic J ).

Then I was asked to do the following (and to be submitted within 15 hours!)

1. Mean values of Y1, Y2, and Y3 by explanatory variables, meaning…..

-Say one of the explanatory variables is sex. Means of Y1, Y2, Y3 for male and for female. I remember the regression models had categorical variables as well, so Means of Y1, Y2, Y3 for ALL the categories!

2. Oh, and note that since there are continuous explanatory variables, age and income….

The means has to be done for each age and income quintile groups, meaning…..

Means of Y1, Y2,Y3 for quintile1 of age, quintile2, quintile3….and the same style for income and ALL THE OTHER EXPLANATORY VARIABLES

3. Oh, and the quintiles of the continous explanatory variables SHOULD BE REFLECTIVE of location distributions and not the entire data set. Meaning, the quintiles should be generated PER location.

4. And one last note, the table should be PER LOCATION and PER TYPE OF OBSERVATION (registered versus free lance).

Ang DAMING gusto! Inday, tapos na ang trilogy ng Lord of the Rings, and isama mo pa ang special features,malamang hindi pa ako tapos. And kinabukasan siya kailangan!

I can do the basic table commands, tabstatat, tab, sum. Pero naman, sa conditionalities pa lang (setting the “if”), prone na to mistakes……So...

I performed a COLLAPSE command!

Click here for a sample. Enjoy.

Sunday, August 29, 2010

Play that Funky Music!

One of my friends, Aiken mentioned, “Jay, ikaw ang nakita kong UNANG nagkaroon ng ipod sa UPSE…”

Really, I still contest this claim... Pero, I will not deny that I have one of the most extensive (as in MAJOR) music collection at UPSE, if not the entire university (hehehehehe, I buy my music from itunes, thank you!). Dati, I have time to listen to music, depende sa mood. Now, I listen to music, depende sa work :)

Some examples.

1. Reviewing a questionnaire, drawing (Yes, I draw first) the structure of the data set.

Recommendation: Philip Glass, Songs and Poems for Solo Cello

2. Generating, renaming

Recommendation: Newton Faulkner, Hands Built By Robot

3. More generating, more renaming J

Recommendation: Incubus, Light Grenade

4. Egen, reshaping

Recommendation: ANY Kronos Quartet CD + Recommendation 1

5. Regressions, logits, mfx compute

RecommendationS: (ang tagal nito) Joni Mitchell, Bob Dylan, Alexi Murdoch, and

RAY LaMONTAGNE!

6. A certain data set called Female Sex Workers and MSM (isearch ninyo na lang what is this)

Recommendations: Barang!

Starting Here, Starting Now: Some basic stuff + Gen(e)gen

My first encounter with stata ten years ago (guys , I am carbon dating myself here), was a very colorful one, mostly in RED. So irritating.

I really do not know how to manipulate data set, the right syntax, that stata is case-sensitive, and most importantly, that I should ALWAYS log (though I still usually do not keep log files). My first impression is that stata is so tiring to use.

All that was going for me is my knowledge of excel. I picture in my mind the variables in my mind using excel. I visualize the structure of the data set in “excel terms”. At one point, I constructed a data set via excel and then ran the regression commands using stata (a No! No!).

Then, seven years ago, in my first random experiment project, I instinctively picked up a few things. For instance, in the question…

“What if I am adding two variables, and say some of the observation has no values in one of the mentioned variables, should I….

1. Replace the missing values with zero, and then simple

gen variablename=variable1+variable2

….. or I do the following:

2. gen variablename=variable1+variable2

replace variablename=variable1 if variable2==.

replace variablename=variable2 if variable1==. “

My guiding rule then is NEVER, ever change the base variable. Always generate a new one. So I opted for option 2 above……

Now I know better (I think), I use egen!

Don’t want to explain, would rather show. (Hopefully, I would always be inspired to do sample files and do-files for you guys! Now I am inspired since I just saw a nice play).

So here is the link. Read the do-file first. Change the location in the do-file (research on your own why this has to be done J ). And of course, you know the drill.

PS: Hopefully, fewer REDs this time....

Come to my window...

So how do we start...Well first, an introduction...

Probably, just like some of you, I have experienced toiling till the wee hours of the morning on manipulating data files while drinking endless cups of coffee. Doing it while listening to some favorite music? Instead of coffee, drinking alcohol while working (this can be a very bad idea). Or your style might be working, while downloading "free" music and movies (I do not officially admit this).

Anyhow, at one point, WE TOILED! Getting to know the data, watching out for skips, complaining about too many errors..We toiled, using Excel, Stata, SAS. It gets to be too frustrating sometimes, but admit it or not....

YOU GET A KICK in watcing your stata do-files run smoothly, like the "perfect" stream of characters in the MATRIX movie.

YOU CURSE whenever a "bump" is experienced along a do-file. But once the bump is fixed, you are A-OK again....ONLY TO REALIZE AFTER THAT YOU DID NOT SAVE THE DO-FILE OR MADE A LOG-FILE.

If you have experienced or are currently experiencing either of the two things above, good for you! But i think this blog is NOT for you since you should be busy looking at your stata or excel files right now :) . If not, you are probably not yet working hard enough!

This blog, STATA TOILER, is for people like me...those who like to:

1. Look at data sets.

2. Arrange data sets and complain.

3. Create do-files.

4. Discover new commands which hopefully will make things faster.

5. and has to hone their analytical and data manipulation skills to be continuously EMPLOYED!

But seriously, this blog is really FOR ME!.

1. To gather thoughts from other stata toilers on how best to approach STATA problems and Excel programs;

2. To release tension when I need a break from all these do-files;

3. basically, to keep track of my files :)

This BLOG will include things related with data manipulation, either directly or indirectly. I would probably post things DIRECTLY related with stata, stuff like do-files, excel tips, commands, etc. AND things not really directed with data manipulations like...

1. How productivity in working increases after BUYING something impractical :)

2. What music works best when toiling away till the morning, or the next night.

3. The simple joys of having coffee while working

4. The importance of a PRIVATE SPACE

At times, this blog can be motherly :), encouraging, but there will be BITCHY turns ahead.

So. THERE!

Stata Toiler