Stata Toiler

Friday, September 10, 2010

Are you the/a “marginal” ________ ? (PART 1)

In microeconomic theory, the concept of the marginal is unavoidable. In fact, it is the cornerstone of “modern” microeconomic theory (well, it depends on how you cast your individuals and players…).

In the simplest term, for me anyway, the “marginal” is the one that tilts the status quo. It is a dividing line, it makes you pursue a specific task, it tells you if something is a “go” or not, it tells you to STOP!

Are you the “marginal” player? The “marginal” consultant….. the one that makes the difference. The relevant one. The cat in Mrs. Lovett's pie?

In running regressions, we are oftentimes interested in the marginal effects of specific explanatory variables. The explanatory variable can be anything, from a continuous variable (say income) to a switch variable.

What is the effect of a 100 unit of increase in income in Y?

What is the effect of introducing a “pill” or a policy in utilization?

In the standard linear regression model, the computed betas are usually the “marginal” effects (ill show in another blog entry examples which show otherwise). Suppose you are running a logit or probit model, you are oftentimes not only interested in the direction, but in the degree and magnitude as well (effect of introducing a policy in the probability of pursuing a certain action).

In stata 10, some of the commands are

mfx, compute -- >for logit models

mfx, predict (pu0) --> for fixed or random effects logit models (xtlogit)

(there are variations of mfx depending on the model you are running, say an ologit or mlogit).

These two commands would give you the effect on the probability by the explanatory variables. HOWEVER, note that stata would compute the probabilities at the MEAN values of your explanatory variables. Say x1 is a dummy variable for “male” and 20% of your regression sample are males, the mfx will be computed at x1=0.2 (you have to make basic manual computations to get the predicted probability).

If you are lazy and do not want to bother with computations to get the predicted probability, you can use…

mfx, predict (p) at(male=0) à say you want to find out the change in the probability if the respondent is male.

A complication exists if one of the dependent variables is an interaction of two other variables. Or you are dealing with squared explanatory variables. Obviously, the “mfx, compute” command WILL NOT give you the marginal effects (say of age if there is age^2)….for the simple reason that stata WILL NOT BE ABLE to recognize a variable called age_square as a transformation of another variable. Unless explicitly specified, stata will simply treat age_square as an additional variable. (MORE ON THIS,,,,,LATER)

PS: for those using STATA 11, there is now a faster command, margins. Click here to learn more.

Thursday, September 2, 2010

OUT OF (re)SHAPE: long to wide

I am NOT a fan of the reshape command, if there are other ways, i really avoid it like a plague.

Say you have a data set with members as unit of observation. Each observation has a tag, identifying its household and its “count” in the household.

Say, the problem is transforming the data set into one where the unit of observation is the household.

If I am interested only with a few household characteristics, I WOULD RATHER NOT USE the reshape command.

I would rather use the egen command and then using the following technique:

sort hhid:

by hhid: egen aveincome=mean(income)

gen count=sum(1)

drop if count~=1

keep hhid aveincome

TAPOS! WALA NG KUSKUS BALUNGOS!

Problem with reshape (long to wide pa lang ito ha), you would have a lot of income variables depending on the family with the most number of members. And you will be forced to include in the reshape command variables you are actually not interested with (since these vars may not be constant within the ids).

But sige na nga, here is a sample of the darn RESHAPE command.

Wednesday, September 1, 2010

Let Me Count the Ways, 1 2 3

Somebody asked me this morning if it is possible to generate a variable in an EXISTING stata data set containing counts, from 1 to n.

Yes, its possible, the command is:

gen varname=sum(1)

you can also have a series of variables containing, 2, 4, 6, n+2…The command is

gen varname=sum(2)

suppose, you do not want to start with 1, then

gen varname=sum(1) + 1

Suppose you have a roster data set with family members as observation and with specific family ids tagging the members of a family, and say you want to order the members, with count=1 for the youngest…Try running the commands

sort family_id age

by family_id: gen varname=sum(1)

Try running the gen ____=sum(1) command in a blank data set and see what you will get. the answer isnt surprising :)

YOUR FRIENDLY AVON GIRL: A RDS TECHNIQUE

A few months ago, I handled data sets on intravenous drug users and men-having-sex with men (MSMs).

The sampling design used for data collection was RDS, or respondent-driven sampling. Basically, for each sites, some “seeds” or respondents were recruited to participate in the survey. Then, these primary seeds were asked to recruit additional respondents. And so on…(think of the method used by AVON to sell cosmetics)…

The IDs of the respondents reflect their “position” in the entire recruitment process.

For example, given the following IDs

Id=1 > person is a first seed

Id=2 > person is a first seed

Id=12 > person is a seed recruited by 1

id=21 > person is a seed recruited by 2

Id=123 > person is a seed recruited by person 12 (who was recruited by id 1)

I needed to perform an xtreg with grouping based on the seeds. I then, have to cut the IDs, say on the third , second, or fourth level. And then do labeling based on the new ids generated. To perform this, I used the substr command.

Refer to statadaily.wordpress.com for the command. Click here. Thanks Mitch!!!

Tuesday, August 31, 2010

DROP DOWN menus in Excel

A really swell thing that I have been doing is making my own encoding file in excel.

To avoid Major Major data cleaning after encoding (say, recoding

NCR and Metro Manila, as NCR), I simply use a drop-down menu via excel.

Suppose, I have two variables, Region and Province, making a drop-down menu for Region is easy.

However, for province, can the menu “change” depending on the answer on

Region. Say, if the Region is A, provinces in menu will be C, D, E.

and if Region is B, provinces will be X, Y, Z? Yes!

Click here for a video demo on drop down menus.

Monday, August 30, 2010

Respondent-Driven Sampling (RDS)

How do you do analysis or quantitative studies on hidden populations, or those populations where sampling frame cannot be constructed? Say Intravenous Drug Users (IDUs), or Men Having Sex with Men (Yes Martha, in most countries, MSMs are still hidden!).

You can do RESPONDENT-DRIVEN SAMPLING (RDS).

and then....with the data from an RDS study, how do you then analyze the data? What are the complications of using data from RDS studies?

Click here to learn more!

COLLAPSE to avoid fatigue

A few months ago, an officemate ran the following regressions:

Y1 = a + bx1 + cx2 + dx3 + …….. + nxn

Y2 = a + bx1 + cx2 + dx3 + …….. + nxn

.....

Y5 = a + bx1 + cx2 + dx3 + …….. + nxn

There were around 5 dependent variables and 7 explanatory variables (which include age and income). The data set also covered observations coming from more than 10 cities. Within a city, there are two types of observation, say registered and free-lance (o dead give away na kung ano ang topic J ).

Then I was asked to do the following (and to be submitted within 15 hours!)

1. Mean values of Y1, Y2, and Y3 by explanatory variables, meaning…..

-Say one of the explanatory variables is sex. Means of Y1, Y2, Y3 for male and for female. I remember the regression models had categorical variables as well, so Means of Y1, Y2, Y3 for ALL the categories!

2. Oh, and note that since there are continuous explanatory variables, age and income….

The means has to be done for each age and income quintile groups, meaning…..

Means of Y1, Y2,Y3 for quintile1 of age, quintile2, quintile3….and the same style for income and ALL THE OTHER EXPLANATORY VARIABLES

3. Oh, and the quintiles of the continous explanatory variables SHOULD BE REFLECTIVE of location distributions and not the entire data set. Meaning, the quintiles should be generated PER location.

4. And one last note, the table should be PER LOCATION and PER TYPE OF OBSERVATION (registered versus free lance).

Ang DAMING gusto! Inday, tapos na ang trilogy ng Lord of the Rings, and isama mo pa ang special features,malamang hindi pa ako tapos. And kinabukasan siya kailangan!

I can do the basic table commands, tabstatat, tab, sum. Pero naman, sa conditionalities pa lang (setting the “if”), prone na to mistakes……So...

I performed a COLLAPSE command!

Click here for a sample. Enjoy.