## Friday, March 29, 2013

### Lowess or Lois?

Not Lois --- as in Lane --- but Lowess (or Loess):  the fitting of a smooth curve to a scatter plot of data.  Lowess (LOcally WEighted Scatter plot Smoothing) "fits polynomials (usually linear or quadratic) to local subsets of the data, using weighted least squares so as to pay less attention to distant points" (Oxford Dictionary of Statistics, entry for loess),  In other words, each observation $(x_i, y_i)$ is fitted to a separate linear regression based on nearby observations with the points weighted such that the further away $x$ is from $x_i$, the less it is weighted (Statistical Modeling for Biomedical Researchers, Dupont).  The strength of lowess smoothing is that it can reveal trends in the data and is a locally based smoother:  it follows the data.  Most lowess smoothing involves just two variables, a y-variable and a single x-variable, but this methodology has been extended to multiple x-variables and can be executed in Stata with the -mlowess- command.  According to the help file, "mlowess computes lowess smooths of yvar on all predictors in xvarlist simultaneously; that is, each smooth is adjusted for the others."  The authors of mlowess, however, caution that multiple variable lowess smoothing should be reserved primarily for exploratory graphics, not inferential model fitting.

Consider the Stata system data set, auto.dta, and the variables mpg, price, weight, length, and gear ratio.  In the bivariate case between mpg and price, there is some hint of a slight quadratic relationship but this could be due to a couple of outlying observations.  The inclusion of the ordinary least squares fitted regression line illustrates the mild departure from the linear model for the cheapest and most expensive vehicles.

In the multiple-variable case (controlling for weight, length, and gear ratio), the mild non-linear curve between mpg and price is dampened although the inexpensive vehicle with super high gas mileage appears to be largely responsible for the departure from linearity on the left side of the graph.  The remaining three graphs --- mpg versus weight, length, and gear ratio, respectively --- all reveal mild non-linear associations even after for controlling for the other variables.

The Stata code used to generate the graphs above follows:

capture log close
log using lowess-01, replace
datetime

// program:  lowess-01.do
// project:  n/a
// author:    cjt
// born on date:  20130329

// #0
// program setup

version 11.2
clear all
macro drop _all
set more off

// #1
sysuse auto

// #2
// bivariate lowess smoother between mpg (yvar) and price w/ fitted values line
lowess mpg price, addplot(lfit mpg price) xtick(2000(2000)18000) xlabel(2000(2000)18000) ///
scheme(sj) legend(row(1)) title("Lowess Smoother and Fitted Regression Line")

* **save graph
gr save mpgXprice-01, replace
* **export graph for inclusion into blog
gr export mpgXprice-01.png, replace

// #3
// multiple lowess smoother
mlowess mpg price weight length gear_ratio, scatter(msymbol(o)) scheme(sj)

* **export graph
gr export mpgXall-01.png, replace

log close
exit

Lowess smoothing, although not the most rigorous and complicated statistical technique, is great for exploratory analysis and can help reveal relationships between variables that may have otherwise gone unnoticed.