Friday, March 29, 2013

Lowess or Lois?

Not Lois --- as in Lane --- but Lowess (or Loess):  the fitting of a smooth curve to a scatter plot of data.  Lowess (LOcally WEighted Scatter plot Smoothing) "fits polynomials (usually linear or quadratic) to local subsets of the data, using weighted least squares so as to pay less attention to distant points" (Oxford Dictionary of Statistics, entry for loess),  In other words, each observation $(x_i, y_i)$ is fitted to a separate linear regression based on nearby observations with the points weighted such that the further away $x$ is from $x_i$, the less it is weighted (Statistical Modeling for Biomedical Researchers, Dupont).  The strength of lowess smoothing is that it can reveal trends in the data and is a locally based smoother:  it follows the data.  Most lowess smoothing involves just two variables, a y-variable and a single x-variable, but this methodology has been extended to multiple x-variables and can be executed in Stata with the -mlowess- command.  According to the help file, "mlowess computes lowess smooths of yvar on all predictors in xvarlist simultaneously; that is, each smooth is adjusted for the others."  The authors of mlowess, however, caution that multiple variable lowess smoothing should be reserved primarily for exploratory graphics, not inferential model fitting. 

Consider the Stata system data set, auto.dta, and the variables mpg, price, weight, length, and gear ratio.  In the bivariate case between mpg and price, there is some hint of a slight quadratic relationship but this could be due to a couple of outlying observations.  The inclusion of the ordinary least squares fitted regression line illustrates the mild departure from the linear model for the cheapest and most expensive vehicles.  

In the multiple-variable case (controlling for weight, length, and gear ratio), the mild non-linear curve between mpg and price is dampened although the inexpensive vehicle with super high gas mileage appears to be largely responsible for the departure from linearity on the left side of the graph.  The remaining three graphs --- mpg versus weight, length, and gear ratio, respectively --- all reveal mild non-linear associations even after for controlling for the other variables. 

The Stata code used to generate the graphs above follows:

capture log close
log using lowess-01, replace

// program:
// task:  lowess demo
// project:  n/a
// author:    cjt
// born on date:  20130329

// #0
// program setup

version 11.2
clear all
macro drop _all
set more off

// #1
// read in auto
sysuse auto

// #2
// bivariate lowess smoother between mpg (yvar) and price w/ fitted values line
lowess mpg price, addplot(lfit mpg price) xtick(2000(2000)18000) xlabel(2000(2000)18000) ///
scheme(sj) legend(row(1)) title("Lowess Smoother and Fitted Regression Line")

* **save graph
gr save mpgXprice-01, replace
* **export graph for inclusion into blog
gr export mpgXprice-01.png, replace

// #3
// multiple lowess smoother
mlowess mpg price weight length gear_ratio, scatter(msymbol(o)) scheme(sj)

* **export graph
gr export mpgXall-01.png, replace

log close

Lowess smoothing, although not the most rigorous and complicated statistical technique, is great for exploratory analysis and can help reveal relationships between variables that may have otherwise gone unnoticed.

Friday, March 22, 2013

Stata and $\LaTeX$: Descriptive Statistics, Part 2

In a previous post about outputting descriptive statistics from Stata to $\LaTeX$ I didn't expect to follow that post with a sequel but as I gained more experience outputting statistics, I realized I'd need something more flexible and more powerful.  Enter -estpost-.  I didn't realize just how flexible, powerful, and relatively easy -estpost- is to use until I started playing around with it.   The author, Ben Jann, ought to receive a place in the Stata user-programmer hall of fame for this contribution alone.  Anyway, outputting descriptive statistics can be accomplished with either -summarize- or -tabstat- following -estpost-.  I prefer -tabstat-.  Once the statistics are generated, the output is then dumped into a $\LaTeX$ file with the accompanying code via -esttab-.  It's pretty simple really.  Below is my Stata code using the ever-pervasive system dataset, auto.dta.  

// #1
// read in auto data
sysuse auto

// #2
// univariate summary statistics
estpost tabstat price mpg trunk weight length turn displacement gear_ratio, ///
statistics(N min max p50 mean sd) columns(statistics)
* latex...
esttab . using stats.tex, replace cells("count min max p50 mean(fmt(a3)) sd(fmt(a3))") ///
title("Univariate Summary Statistics for auto.dta") label nomtitles noobs width(\hsize)

// #3
// stratified by foreign
estpost tabstat price mpg trunk weight length turn displacement gear_ratio, ///
by(foreign) statistics(N min max p50 mean sd) columns(statistics) nototal
* latex...
esttab . using stats.tex, append cells("count min max p50 mean(fmt(a3)) sd(fmt(a3))") ///
title("Summary Statistics for auto.dta stratified by foreign") label nomtitles noobs width(\hsize)

The code from -esttab- outputs a $\LaTeX$ file (I called it stats.tex above) which I then open up in my editor (WinEdt), declare the document class, add the necessary packages into my preamble, then run.  The output is pasted below.  The more I use -estpost-, the more I love it. 

Monday, March 11, 2013

How Not to Fail a PhD: Advice Better Late Than Never

I came across a blog by an Australian academic while searching for information about how to build a poster presentation using $\LaTeX$ and while browsing his blog, I got distracted by one of his more popular posts, "How to fail a PhD".  I certainly don't want to fail out of my program and being the semi-paranoid student I am, I read it.  The blogger, Dr. Rob J. Hyndman, created his list after reading a similar list by another blogger, Matt Might, and both are spot on.  Dr. Hyndman ranks "Wait for your supervisor to tell you what to do" as number one and based on my experience, I couldn't agree more.  I took some initiative in the early stages of my research but it wasn't enough.  It took a while before I realized my adviser wasn't going to hand me a tidy, ready-to-be-answered research question and when I finally did make the realization, I started making some progress.  Another point of failure Dr. Hyndman identifies is "Aim too high".  I already know I'm prone to perfectionism so it's a constant battle with myself to just plow forward irrespective of whether I consider something "perfect".  My adviser emphasized early on that the Ph.D. isn't the crowning masterpiece, especially if one remains in academia, so don't treat it as such.  The Ph.D. is suppose to demonstrate research ability.  

A few of the "don'ts" on Dr. Might's list also resonated with me.  The first, "Focus on grades or coursework", was problematic for me because I viewed the Ph.D. program linearly:  two years of coursework, comprehensive exam, dissertation proposal, dissertation analysis, write-up, then defense.  I'm in the write-up stage now but I think that if I'd adopted a more holistic view from the beginning, I may have identified a topic and research question earlier.  It also doesn't help matters that I switched from biostatistics to epidemiology two-and-half years into the program, setting me back a year.  Oh well.  Another one, "Treat Ph.D. school like school or work":  according to Dr. Might, the Ph.D. is all-consuming and for those who don't pony up the requisite devotion and obedience, they take 7+ years to finish or wind up ABD.  Based on how long I've been at this (six years), I wonder what Dr. Might would say about my devotion.  The last one, "Miss the real milestones", is a real risk in my program since we are given two options for the Ph.D.:  a "European" style version comprising of three-publication quality papers or the "traditional" version comprising the standard five chapters.  I opted for the "traditional" version so the requirement to publish --- even just one paper --- is absent and based on Dr. Might's criteria, I'd hardly be Ph.D worthy.  (I should note, however, that even though publication isn't a requirement for graduation with the five-chapter dissertation, we are strongly encouraged to write a manuscript from our dissertation and try to get in published.)  

I wish I'd come upon this advice a few years ago but, as the saying goes, better late than never.