Saturday, December 28, 2013

"Keep The Change" Tipping Guide

The author behind Waiter Rant --- Steve Dublanica --- followed up his successful first book with a book about tipping, Keep The Change:  A Clueless Tipper's Quest to Become the Guru of the Gratuity, and although the book is an entertaining and informative treatise on tipping, I thought it would have benefited from a matrix presenting who should be tipped and how much.  Aside from Dublanica including two appendices listing (a) who gets tipped and what during the holidays and (b) who gets tipped at a wedding, he didn't collate and list all the tipped professions in a single place for easy reference.  Since I also have, on more than one occasion, suffered from tip anxiety, I decided to create a matrix listing all the tipped professions Dublanica discusses and the expected tip amount.  Where appropriate, I also included any relevant comments (mostly his, but occasionally mine).   

It should be emphasized that the expected tips are amounts that those in the industry regard as appropriate (and expected).  Clearly there is a difference between what is expected and what is received.   The amounts below reflect what is expected. 

Friday, December 27, 2013

Type I and Type II Errors: Lay Explanation

I'm reading a book titled Merchants of Doubt:  How a Handful of Scientists Obscured the Truth on Issues from Tobacco Smoke to Global Warming by Naomi Oreskes and Erik M. Conway and I thought their explanation of Type I and Type II errors was particularly clear.  Although somewhat long, what they wrote is re-presented below (pp. 156-7):
The 95 percent confidence level is a social convention, a value judgement.  And the value it reflects is one that says that the worst mistake a scientist can make is to fool herself:  to think an effect is real when it is not.  Statisticians call this a type 1 error.  You can think of it as being gullible, naive, or having undue faith in your own ideas.  To avoid it, scientists place the burden of proof on the person claiming a cause and effect.  But there's another kind of error -- type 2 -- where you miss effects that are really there.  You can think of that as being excessively skeptical or overly cautious.  Conventional statistics is set up to be skeptical and avoid type 1 errors.  The 95 percent confidence standard means that there is only 1 chance in 20 that you believe something that isn't true.  That is a very high bar.  It reflects a scientific worldview in which skepticism is a virtue, credulity is not.  As one web site puts it, "A type 1 error is often considered to be more serious, and therefore more important to avoid, than a type 2 error."  In fact, some statisticians claim that type 2 errors aren't really errors at all, just missed opportunities.  
Is a type 1 error more serious that a type 2?  Maybe yes, maybe no.  It depends on your point of view.  The fear of type 1 errors asks up to play dumb.  That makes sense when we really don't know what's going on in the world -- as in the early stages of a scientific investigation.  This preference also makes sense in a court of law, where we presume innocence to protect citizens from oppressive governments and overzealous prosecutors.  However, when applied to evaluating environmental hazards, the fear of gullibility can make us excessively skeptical and insufficiently cautious.  It places the burden of proof on the victim -- rather than, for example, the manufacturer of a harmful product -- and we may fail to protect some people who are really getting hurt.
There are many, many statistical texts that provide mathematical and symbolic definitions of Type I and Type II errors but when a lay book nicely articulates the definitions, it is worth noting (and remembering).

Wednesday, November 27, 2013

Create Fake Data: SAS vs. Stata

There are a lot of resources in both SAS and Stata for accessing fabricated (or publicly available) data shipped with the software program (e.g. -sysuse- in Stata) but it isn't immediately obvious how to create fake data from scratch.  I'm not sure if this is because doing so is largely unnecessary due to the availability of _actual_ data but I figured it would be useful to know how to create a fictional dataset on the fly if, for instance, I wanted a break from using the program datasets or if none of them were suitable for my needs.  

In SAS, a single DATA step can generate several variables then output them to a SAS dataset.  A cursory Google search turned up a paper from a SAS Users Group Meeting by Andrew J. L. Cary discussing creation of data in SAS that was quite informative.  I adapted some of his code and eventually coded the block below:

data fiction (drop=i);
 *set seed;
 seed = 20131126;
 array sitelist [5] $10 _temporary_ ('WashDC', 'Princeton', 'Chicago', 'Cambridge', 'Oakland');
  do id = 1 to 150;
site = sitelist[rantbl(seed,.3,.4,.1,.1)];
   gender = int(ranuni(seed)+0.5);
weight = 175 + rannor(seed)*30;

In Stata, a paper by Maarten Buis proved quite helpful in guiding my coding.  Unlike in SAS where everything can be accomplished in a single discrete DATA step, Stata requires several distinct steps that begin with, most importantly, setting the number of observations in the _fake_ dataset.  Each of the variables are created with a series of -generate- statements.  

In both SAS and Stata, I set the seed so that within each program the results (e.g. frequencies, summary statistics) could be replicated if run at a later time.

// Task #1
* **create observations;
set obs 150

// Task #2
* **create variables via series of -gen- commands;
set seed 20131126

* **using uniform distribution for random draws;
gen rand = uniform()

* **site-region (8 site-regions);
gen siteregion = cond(rand < .15, 1, ///
                 cond(rand < .30, 2, ///
               cond(rand < .40, 3, ///
               cond(rand < .55, 4, ///
               cond(rand < .75, 5, ///
               cond(rand < .90, 6, ///
               cond(rand < .95, 7, ///
* **site (5 sites);
gen site = cond(rand < .3, 1, ///
           cond(rand < .7, 2, ///
          cond(rand < .8, 3, ///
          cond(rand < .9, 4, ///
* **gender (2 genders);
gen gender = rand < 0.5

* **weight (continuous:  mean 175 and sd 30);
gen weight = rnormal(175,30)

// Task #3
* **assign value labels;
label define sitereg 1 "Pacific" 2 "Mountain" 3 "Mid-West" 4 "South" 5 "Mid-Atlantic" ///
  6 "NorthEast" 7 "North" 8 "West"
label values siteregion sitereg

label define site 1 "WashDC" 2 "Princeton" 3 "Chicago" 4 "Cambridge" 5 "Oakland"
label values site site

Tuesday, November 12, 2013

Counting and Listing Duplicates

Identifying and listing duplicates is such a crucial data management task that you'd think there would be resources all over the web showing how to do it but, strangely, I didn't find that to necessarily be the case.   For a task I'm currently working on, I had to create a counter variable that counted the number of records with duplicate values for two ID variables and as an afterthought, I also wanted to list the duplicates in their entirety.  That is, I didn't want just the duplicate records output, I also wanted the "parent" record such that in the listing you could see the parent (counter = 1) as well as the duplicates (counter = 2...k).  Creating the counter variable was straightforward enough (the ATS UCLA stat resource has a good tutorial) but listing of the duplicates wasn't as obvious.  For that task I relied on Ron Cody's "Cody's Data Cleaning Techniques Using SAS Software".  Here is some generic code:

* **create counter;
proc sort data=dsin; by id1 id2; run;

data dsin;
 set dsin;
 by id1 id2;
 if first.id1  or first.id2 then counter=0;
  counter + 1;

* **write duplicates to dataset then print;
proc sort data=dsin; by id1 id2; run;

data dsin_dups;
  set dsin;
  by id1 id2;
  if first.id2 AND last.id2 then delete;

proc print data=dsin_dups noobs; 
  var counter id1 id2 tx_description;
  title "Duplicate Records";


Friday, November 1, 2013

F5, F6, and F7 in SAS

I've started working in SAS much more now that I'm reporting to an office job everyday.  (Not that full-time work on a dissertation isn't a job, but I digress...)  Anyway, I don't like to switch from the keyboard to the mouse to click on a different window while working in SAS --- I usually Ctrl+Tab between the windows but if you have multiple programming (Editor) windows open or you don't want to tab through the Explorer or Results windows then this can get cumbersome.  A quick Google search of SAS shortcut keys took me to a SAS help page where I learned these gems:

F5:  Directly tabs to the programming (Editor) window(s)
F6:  Directly tabs to the Log window
F7:  Directly tabs to the Output window.

I suppose moving the hands from the keyboard to the mouse isn't that much work but if you're doing it dozens, hundreds even, of times a day, the effort (and annoyance) can add up.

Monday, July 15, 2013

Pitztal Gletscher 95k: DNF

The start of ~3,300 feet of climbing over 3 miles
Brutal.  The Pitztal Gletscher 95k is definitely the hardest ultra I've ever run.  The day started out promising enough --- beautiful weather, uneventful check-in, and glorious scenery --- but would unceremoniously end approximately 14 hours and 57 kilometers (~35 miles) later when I failed to make the first time cut-off.  I would have expected my response to be one of disappointment, perhaps even anger, but when the aid station volunteer announced in broken English, "Your race is finished", I was relieved.  

The original course was suppose to traverse the Pitztal Glacier but the day before the race the organizers re-routed the course for safety concerns due to too much snow.  The original race featured ~7,100 meters (~23,300 feet) of elevation gain but the modified course was slightly easier:  6,423 meters (~21,070 feet) of elevation gain.  (I was under the mistaken impression prior to the race that the number provided on the race website was gross elevation gain/loss --- oh how mistaken I was!)  At the race briefing the evening before, the race directors mentioned two time cut-offs, 13 hours to reach 57k and 16 hours to reach 68k (~42 miles) and I remember thinking, "Thirteen hours for ~35 miles?  Shouldn't be too much of a problem."  I also found it odd that the briefer repeatedly mentioned that if you fail to make the cut-off(s) or drop for other reasons, you'd still be eligible for some sort of finisher's medal but I didn't dwell too much on his comments since I didn't think they'd really apply to me.  

This race, along with most in Europe, are "semi-autonomous" so you have to schlep a lot of *mandatory* gear along with you.  Up until the day before, the race organizers were requiring all runners to carry micro-spikes with them for the glacier traverse and the pack I was originally going to use (Nathan Endurance Vest) wasn't large enough to haul everything I'd need so I purchased a larger vest:  the UltrAspire Omega Hydration Vest.  Turns out I may have been able to use the Nathan pack when the organizers nixed the micro-spikes but with everything else I had to carry, I was glad I used the larger vest.  The required gear included insulating layers (top & bottom), rain pants (WTF?!), jacket, emergency blanket, whistle, basic first aid supplies, functioning cell phone, two liters of water, warm hat & gloves, 500+ calories of food, your own cup, course map (provided), money, and ID.  I also carried sunglasses, TUMS, some Perpeteum powder, and my iPod.  The pack was, no surprise, relatively heavy but I was pleased with how well the pack distributed the weight.  The ultra running scene in Europe is, of course, different from the US and although I may not be too keen on all the gear you have to carry (rain pants?!?), I understand.  A high-alpine environment can be terribly unpredictable and I suspect the race organizers have to have some modicum of confidence that the runners can get themselves out of a jam.  
Lisa --- crew extraordinaire --- and me
The were 70 runners officially registered but according to the race results, only 55 toed the start line.  The attrition in this race was astounding:  only 20 or so made the 13 hour cut-off and just 11(!) made it to the finish line.  I suspect some of the 20 who didn't finish were probably pulled when they failed to make the 16 hour cut-off but it's unclear from the results.  Nevertheless, a race where only 20% of the starters actually finish is a damn difficult race.  Perhaps too difficult.  
His day was much more relaxed than mine
With the exception of such legendary races like the Barkeley Marathons (actually a 100 mile race) and Hardrock, I can't think of any races (especially ones I've run) where the majority of runners don't even make it two-thirds the distance and just a fifth finish.  This was my first DNF (Did Not Finish) and, funny thing, I don't feel too disappointed in myself because in spite of not finishing, I felt like I ran a relatively strong race and probably could have gone the entire distance if the cut-offs weren't so unforgiving. 

Just a few minutes 'til the start!
When we set off at 5AM, we immediately started climbing (~600 meters) toward Rifflsee, rounded the lake, gently ascended a ridge, then traversed a ridge toward Taschachhaus.  There were three aid stations in the first 20 miles but all of them were largely inaccessible to spectators so I wouldn't see Lisa and her father until the start/finish area (Mandarfen) at ~mile 21.  Aside from a steep 1,640' ascent over 2.5 miles beginning at around mile 17, I was moving pretty well.  I fell off course once (minor detour) and thought I had fallen off a second time but when three other runners caught me they assured me I was headed in the right direction.  When I arrived in Mandarfen, I refilled my bladder, drank some Coke, ate some cucumber and pretzels, emptied the pebbles and sand out my shoes and socks, chatted with Lisa and her father for a few minutes, then set off for the slog to the highest point on the course:  Braunschweiger Hutte (2,727 meters, ~8,950 feet).  The ascent began gently enough but it quickly became obscenely steep with some sections characterized as "Klettersteig" (fixed rope route) --- there were chains and cables that you could hold on to as you snaked your way up the rock face.  I've run some races featuring steep sections but no race I've ever run featured terrain quite like this one.  Near the summit, there was a small snowfield that would have been easier with trekking poles but since I'd never run with poles before, I wasn't about to start now.  At the summit, I hung out for a bit, ate several orange slices, enjoyed the sunshine, and took in the view.  The descent, although marginally faster than the ascent, still featured sections (especially those with cables or chains on steep drop-offs) where I was moving pretty slowly.  The round trip to Braunschweiger Hutte was only ~11 km or so but it took me about three hours.  
Somewhere along the ascent to Braunschweiger Hutte.  Mandarfen is waaaay down there! 
When I arrived back at Mandarfen, I hadn't yet had a chance to remove my pack when the race director came over and told me (first in German then switched to English) that the 13 hour cut-off had been extended to 13.5 hours (6:30PM) and that I should quickly gather what I need and set off.  This was about 3PM and, somewhat naively, I thought 3.5 hours would be plenty of time to cover the next 11km.  At the time, I thought his advice was misplaced since I was, after all, just behind the second woman and well ahead of 50% of the field.  The possibility of not making the cut-off and having as many as half to two-thirds of the runners not the make the cut-off hadn't yet set in.  I gathered up some cucumber and orange slices then headed out with Lisa alongside for about a half kilometer.  After a gentle decline to the base of the next steep climb, I caught the runner in front of me, meekishly confirmed if he spoke English ("Yes, of course" is always the answer), then asked him if the race director also told him to hurry up and haul ass otherwise he might not make the cut-off.  This runner (I never caught his name or number) replied yes then told me he was the only local running the race (he was from Pitztal) and that the cut-offs were way too early.  He was pretty confident we wouldn't make the cut-off since we had ~1,000+ meters to climb and 1,000+ meters of descent in overgrown, rooty, grassy, wet trail still ahead of us.  But we soldiered on.  I stayed with him for a bit but he soon overtook me then disappeared up the trail.  Similar to the ascent up to Braunschweiger Hutte, the ascent up the Plangerossalpe was insanely steep but unlike Braunschweiger, there wasn't any "Klettersteig" --- it was just steep, exposed scree.  At one point I looked down at my pace as reported on my Garmin and was discouraged, yet amused, to be moving at a glacial 57+ minutes per mile.  
The route paralleled the stream way down below
Ridiculously steep.  Since there weren't any cables or chains to grab onto, I clung to the mountainside as I walked.
When I finally crested the summit ridge, I thought it would be a relatively uneventful descent but that expectation was dashed as soon as I reached a fairly large snow field and had to glissade down it.  For the most part, I don't have much problem glissading and quite enjoy it but in one particularly treacherous section, a small snowfield traversed a steep section of trail where one misplaced step would have sent me sliding down the snow toward a rock field below.  It's a good thing no one was in front of me because had they looked into my eyes, they certainly would have seen terror.  I was relieved when I finally descended below the snow line but what I lost in snow I gained in wet grass and rooty, overgrown trail.  The race director made some comment during the race briefing about this section of the course being "nature trail" but it didn't become obvious what that meant until I was slowly picking my way down the trail and trying not to roll an ankle or fall into the many muddy stream crossings.   About two-thirds the way down it became obvious there was no way I'd make the time cut-off so I stopped, snapped photos, and walked it into the 57km control point.  Lisa and her father were waiting for me and I'm not sure why Lisa thought I would be angry about missing the cut-off but I wasn't.  Quite the contrary, in fact.  It had become obvious I would DNF several kilometers back and although I felt like I had a strong run, the parameters of the race were too unforgiving for all but the front runners.  After the aid station volunteer checked me out of the race (administratively), I ran/walked the remaining six kilometers back to Mandarfen for, weirdly, a finish-line finish with all the pomp that accompanies an official finish.  Strange, but, I suppose, better than no recognition at all.  In the end, I ran(?) 63km (~39 miles) and prior to my Garmin battery dying about two-thirds the way down Plangerossalpe, I had climbed 14,000 feet and descended ~13,050 feet (both underestimates since I hadn't yet reached the 57km check-point when my Gamin died and I still had 6km and ~650 feet of climbing to go en route back to Mandarfen).  My final DNF time was 14h:47m:31s.  

I have mixed feelings about this race.  Would I run it again?  Not if the parameters of the course were the same.  The cut-offs, especially the first one, were too early.  I'm no race director but it seems reasonable to expect the majority of runners who start the race to make the time cut-offs and the majority of those runners to finish (barring extreme circumstances, e.g. horribly inclement weather).  I knew the race was going to be difficult but I grossly underestimated just how difficult it was going to be, especially since I assumed the elevation gained value as elevation gain and loss.  Big difference.  If the race course remains as it is, then they either need to start the race earlier (3AM?) or characterize this race as an "advanced" race (similar to Hardrock) and only admit those runners who have successfully completed events of similar difficulty.  Most runners have a fighting chance of finishing the race they start --- it may not be pretty --- but a fighting chance, nonetheless.  There wasn't much of a chance for the middle-of-the-packer in this race.  My complaints about the unforgiving cut-offs aside, the race was well-organized, the aid station staffers helpful and enthusiastic, and the location of the race (Pitztal) exceptional. 
The Rifflesee and surrounding mountains.  The scenery and weather were top-notch.

Friday, July 5, 2013

Critical Statistical Skills

The gents over at Simply Statistics re-posted what they consider the five most critical concepts/skills every statistician should possess and although this list could probably be debated ad infinitum, I think their list is a solid start.  (I've also wondered what constitutes a competent and qualified statistician/data analyst versus a merely adequate one but I never codified the traits with a list.)  Their list, admittedly general, and limited to five, is thus (pasted verbatim from their post):
  1. The ability to manipulate/organize/work with data on computers - whether it is with excel, R, SAS, or Stata, to be a statistician you have to be able to work with data.
  2. A knowledge of exploratory data analysis - how to make plots, how to discover patterns with visualizations, how to explore assumptions
  3. Scientific/contextual knowledge - at least enough to be able to abstract and formulate problems. This is what separates statisticians from mathematicians.
  4. Skills to distinguish true from false patterns - whether with p-values, posterior probabilities, meaningful summary statistics, cross-validation or any other means.
  5. The ability to communicate results to people without math skills - a key component of being a statistician is knowing how to explain math/plots/analyses.
I heartily agree with #1 although I would consider any statistician that routinely uses MS Excel for their analyses and graphics as either lazy or marginally incompetent.  Numbers two and four are also important and I think the longer you practice statistics and the more problems you encounter, the better you get at these skills.  Number three could be distilled down even further into theoretical versus applied statisticians where the theoretical statistician toils away in academia teaching graduate-level mathematical statistics and the applied statistician engages in dirty data collection, cleanses that data, then outputs descriptive and inferential statistics.  The last point, communication, is a skill that is often given short shrift but is one, as the folks at Simply Statistics also agree with, shouldn't be overlooked.  

It's easy to lose sight of the broad skill set a statistician must possess to remain effective and competent, especially as one specializes, so it's nice to be occasionally reminded.  

Tuesday, June 25, 2013

Zugspitz Basetrail 35.9k: A quasi-ultramarathon

View of the Zugspitz from our hotel room
The Zugspitz Basetrail is the shortest among three races held on the same day and although technically not ultra-distance, it sure felt like one.  The Basetrail is 35.9km (~22.3 miles) and began in Mittenwald then wound its way around the Zugspitz (the highest point in Germany) until arriving in Grainau.  (None of the three races --- the Ultratrail, Supertrail, or Basetrail --- climbed the summit of the Zugspitz.)  When Lisa and I arrived on Friday evening to check-in, gather the race schwag, and wander around, I was struck by how lean and sporty everyone looked.  Ultrarunners are, of course, lean and sporty but the western Europeans who spend a lot of time ambling around the Alps are especially fit.  I'm a slender guy and among all the races I've run in the States, I've never felt on the larger side.  It was weird.  

This event is a Salomon-sponsored event so Salomon products, especially shoes, abounded.  If I had to guess, I'd estimate that 80% of the participants were running in Salomon shoes with a substantial majority also using Salomon hydration packs.  Salomon has definitely cornered the market in western Europe.  Their products are relatively expensive but if some of the corporate revenue makes it way back to the runners via generous sponsorship of events then, I suppose, the premium paid for S-Lab shoes might be worth it...but I digress.
Trying to civilize trail runners
The race began at 11am and rather than take the bus from Grainau to Mittenwald, Lisa and I took the train and arrived to Grainau at about 10:30am (this was super convenient for us since we stayed in Garmisch and getting to the train in Garmisch was easier than driving to Grainau to take the bus).  When we arrived, we surveyed the scene, stood around, used the restroom (no peeing outside!), then checked-in.  I haven't run an ultra in the US for a few years but among the three I've run in Europe, two of the three featured mandatory gear checks.  Even though the distance I was running wasn't an ultra-distance, each runner was obligated to carry at least 1 liter of water, a phone, course map, hat & gloves, emergency blanket, whistle, and a raincoat.  You'd think you could slip into the start corral without notice but in order to enter, you had to pass through the gear check so ignoring the gear list wasn't really an option.  I've commented on this issue before so I won't bother again but suffice to say, the European ultras fancy themselves "semi-self sufficient" so it stands to reason you'd have to haul around much more gear than in (most) American ultras.
Lisa and me at the start
The weekend prior I had run up and around Schneeberg (the eastern most Alp in Austria) in order to log some decent vertical and although I felt good at the time, my quads were spent for a few days after.  The Basetrail features a respectable amount of elevation gain and loss (~6,207' of ascent, ~6,755' of descent) so I knew whatever residual soreness from the previous weekend remained might make for a longer-than-necessary day.  

The course itself was mostly fire road and single track (aside from the first and last miles on pavement).  The single track leading up to the third aid station on the Zugspitz (Talstation Langenfelder) was woody, rooty, and steep enough to reduce virtually all runners to a hike but not so steep you felt like you were rock scrambling.  On this section, if I approached a runner from behind they usually stepped aside to let me pass but about a half-kilometer from the crest of the ridge, I came up behind a couple moving at a decent clip, albeit slower than me (I caught them after all!).  They wouldn't step aside even though I was practically stepping on the heels of the runner in front of me.  I'm surprised he couldn't feel my breath on the back of his neck.  I was mildly annoyed, especially since I expected these two (German?) runners to hike like they drive on the autobahn:  drive on the right unless passing and if you are in the left lane and another car approaches from behind, immediately change lanes.  When we finally crested the ridge, I raced ahead, unable to contain my annoyance.   After a few more kilometers of ascent on fire roads, the long descent to the finish (6-7 miles) began.  And what a beautiful, fast, technical descent it was!  I was moving at a respectable speed but I was astonished at how fast some of the runners (especially the females!) were descending...they flew past me as if I were shuffling behind a walker.  Very humbling.  
A few meters from the finish line!
I crossed the finish line in 5h:22m:43s.  Unlike my time at the Worthersee 57k, I felt like my time and place at this race were more respectable.  Among the approximately 300 male finishers, I finished in 105th place and if both the males and females are considered together, I finished 117th among ~425 finishers.  Not quite the top-quartile but well within the top tertile.  All in all, I was pleased with my run, especially since I regarded this race as a training run for the Pitztal Gletscher 95k in mid-July --- a race featuring considerably more vertical, more distance, and more chances for things to go either spectacularly well or miserably bad.  Based on this race, I'm optimistic about the the former.  
Happiness at the finish  

Monday, April 15, 2013

Retirement Averted: The 2013 Vienna City Half-Marathon

In the week leading up to the Vienna City Half-Marathon I was nursing a sore and tender left calf and was unsure how the half-marathon would play out.  I joked that depending on the how I felt during the race, I may (or may not) retire from running.   Up until a few days ago, I hadn't run in over a week and I was starting to worry I may have sustained a season-derailing injury.  Thus the drama and hyperbole.  (Lisa wasn't nearly as amused as I was by my repeated announcements to retire but she played along.)  

We arrived at the starting area w/ a friend of ours and his wife, relaxed for a bit, then we parted ways w/ our friend and Lisa and I made our way to the rear of the start corral.  Since I wasn't sure how I'd feel from the outset I figured I'd start conservatively then speed up, calf tightness permitting.  The first few kilometers felt good and I felt marginally confident my calf wouldn't act up so Lisa and I picked up the pace a bit.  This race is crowded and it feels even more so in the rear of the pack since a somewhat faster pace means constant slowing, accelerating, and weaving in and out of runners to avoid colliding with someone.  The throngs of runners never really spread out and 13.1 miles later, we crossed the finish line in a mediocre 2:05:21.  I feel a bit ambivalent about my time but, on the bright side, my calf didn't act up during the race and, fortunately, retirement can wait 'til another day.

Friday, April 12, 2013

R: Apparently Pretty Hot

I receive email updates from Revolution Analytics and the most recent email contained a white paper (R is Hot) written by the vice president of marketing, David Smith, about how R is the hottest game in town for statistical analysis and modeling.   No matter what one's software preferences are, this is an interesting read.

Wednesday, April 10, 2013

Student Business Card with $\LaTeX$

I recently gave a poster presentation at a research conference sponsored by the university I attend and among the resources I consulted prior to attendance, many of them suggested the presenter have business cards available during the conference.  I don't have any corporate cards and my program doesn't provide cards for Ph.D. students so I decided to create my own.  There are a few services that can print cards on the cheap and ship them direct but this wasn't viable since I live overseas and I needed the cards relatively quickly.  I decided I'd create the cards myself then have 50 or so printed at a local print shop.  Enter $\LaTeX$.  (I could have, of course, used MS Word but I suspected that what would have begun as a simple exercise would have quickly morphed into a formatting nightmare.  Thanks but no thanks.)  

I used the bizcard package along with the marvosym and url packages to create a simple, crisp, and clean card containing my name, university, email address, phone number, and blog address.  Below is a fictional business card for the man behind Student's t-test, William Sealy Gosset.   

And the $\LaTeX$ code:
% %% cjt
% %% 20130410
% %% creation of example (fictional) business card using bizcard package. 

\usepackage[frame]{bizcard}      % options are none (no marks), flat (non-invasive tick marks), and frame (fully framed cards)



\put(19,38){\makebox(50,5){\Large\bfseries William Sealy Gosset}}
\put(19,32){\makebox(50,5){\large ``Student''}}
\put(19,27){\makebox(50,5){Guinness Brewery}}
\put(19,23){\makebox(50,5){Dublin, Ireland}}
\put(7,14){\makebox(10,4)[tl]{\Letter \enspace \emph{}}}
\put(7,10){\makebox(10,4)[tl]{\Telefon \enspace +353 012 3456789}}
\put(7,6){\makebox(10,4)[tl]{\ComputerMouse \enspace \url{}}}



Friday, March 29, 2013

Lowess or Lois?

Not Lois --- as in Lane --- but Lowess (or Loess):  the fitting of a smooth curve to a scatter plot of data.  Lowess (LOcally WEighted Scatter plot Smoothing) "fits polynomials (usually linear or quadratic) to local subsets of the data, using weighted least squares so as to pay less attention to distant points" (Oxford Dictionary of Statistics, entry for loess),  In other words, each observation $(x_i, y_i)$ is fitted to a separate linear regression based on nearby observations with the points weighted such that the further away $x$ is from $x_i$, the less it is weighted (Statistical Modeling for Biomedical Researchers, Dupont).  The strength of lowess smoothing is that it can reveal trends in the data and is a locally based smoother:  it follows the data.  Most lowess smoothing involves just two variables, a y-variable and a single x-variable, but this methodology has been extended to multiple x-variables and can be executed in Stata with the -mlowess- command.  According to the help file, "mlowess computes lowess smooths of yvar on all predictors in xvarlist simultaneously; that is, each smooth is adjusted for the others."  The authors of mlowess, however, caution that multiple variable lowess smoothing should be reserved primarily for exploratory graphics, not inferential model fitting. 

Consider the Stata system data set, auto.dta, and the variables mpg, price, weight, length, and gear ratio.  In the bivariate case between mpg and price, there is some hint of a slight quadratic relationship but this could be due to a couple of outlying observations.  The inclusion of the ordinary least squares fitted regression line illustrates the mild departure from the linear model for the cheapest and most expensive vehicles.  

In the multiple-variable case (controlling for weight, length, and gear ratio), the mild non-linear curve between mpg and price is dampened although the inexpensive vehicle with super high gas mileage appears to be largely responsible for the departure from linearity on the left side of the graph.  The remaining three graphs --- mpg versus weight, length, and gear ratio, respectively --- all reveal mild non-linear associations even after for controlling for the other variables. 

The Stata code used to generate the graphs above follows:

capture log close
log using lowess-01, replace

// program:
// task:  lowess demo
// project:  n/a
// author:    cjt
// born on date:  20130329

// #0
// program setup

version 11.2
clear all
macro drop _all
set more off

// #1
// read in auto
sysuse auto

// #2
// bivariate lowess smoother between mpg (yvar) and price w/ fitted values line
lowess mpg price, addplot(lfit mpg price) xtick(2000(2000)18000) xlabel(2000(2000)18000) ///
scheme(sj) legend(row(1)) title("Lowess Smoother and Fitted Regression Line")

* **save graph
gr save mpgXprice-01, replace
* **export graph for inclusion into blog
gr export mpgXprice-01.png, replace

// #3
// multiple lowess smoother
mlowess mpg price weight length gear_ratio, scatter(msymbol(o)) scheme(sj)

* **export graph
gr export mpgXall-01.png, replace

log close

Lowess smoothing, although not the most rigorous and complicated statistical technique, is great for exploratory analysis and can help reveal relationships between variables that may have otherwise gone unnoticed.

Friday, March 22, 2013

Stata and $\LaTeX$: Descriptive Statistics, Part 2

In a previous post about outputting descriptive statistics from Stata to $\LaTeX$ I didn't expect to follow that post with a sequel but as I gained more experience outputting statistics, I realized I'd need something more flexible and more powerful.  Enter -estpost-.  I didn't realize just how flexible, powerful, and relatively easy -estpost- is to use until I started playing around with it.   The author, Ben Jann, ought to receive a place in the Stata user-programmer hall of fame for this contribution alone.  Anyway, outputting descriptive statistics can be accomplished with either -summarize- or -tabstat- following -estpost-.  I prefer -tabstat-.  Once the statistics are generated, the output is then dumped into a $\LaTeX$ file with the accompanying code via -esttab-.  It's pretty simple really.  Below is my Stata code using the ever-pervasive system dataset, auto.dta.  

// #1
// read in auto data
sysuse auto

// #2
// univariate summary statistics
estpost tabstat price mpg trunk weight length turn displacement gear_ratio, ///
statistics(N min max p50 mean sd) columns(statistics)
* latex...
esttab . using stats.tex, replace cells("count min max p50 mean(fmt(a3)) sd(fmt(a3))") ///
title("Univariate Summary Statistics for auto.dta") label nomtitles noobs width(\hsize)

// #3
// stratified by foreign
estpost tabstat price mpg trunk weight length turn displacement gear_ratio, ///
by(foreign) statistics(N min max p50 mean sd) columns(statistics) nototal
* latex...
esttab . using stats.tex, append cells("count min max p50 mean(fmt(a3)) sd(fmt(a3))") ///
title("Summary Statistics for auto.dta stratified by foreign") label nomtitles noobs width(\hsize)

The code from -esttab- outputs a $\LaTeX$ file (I called it stats.tex above) which I then open up in my editor (WinEdt), declare the document class, add the necessary packages into my preamble, then run.  The output is pasted below.  The more I use -estpost-, the more I love it. 

Monday, March 11, 2013

How Not to Fail a PhD: Advice Better Late Than Never

I came across a blog by an Australian academic while searching for information about how to build a poster presentation using $\LaTeX$ and while browsing his blog, I got distracted by one of his more popular posts, "How to fail a PhD".  I certainly don't want to fail out of my program and being the semi-paranoid student I am, I read it.  The blogger, Dr. Rob J. Hyndman, created his list after reading a similar list by another blogger, Matt Might, and both are spot on.  Dr. Hyndman ranks "Wait for your supervisor to tell you what to do" as number one and based on my experience, I couldn't agree more.  I took some initiative in the early stages of my research but it wasn't enough.  It took a while before I realized my adviser wasn't going to hand me a tidy, ready-to-be-answered research question and when I finally did make the realization, I started making some progress.  Another point of failure Dr. Hyndman identifies is "Aim too high".  I already know I'm prone to perfectionism so it's a constant battle with myself to just plow forward irrespective of whether I consider something "perfect".  My adviser emphasized early on that the Ph.D. isn't the crowning masterpiece, especially if one remains in academia, so don't treat it as such.  The Ph.D. is suppose to demonstrate research ability.  

A few of the "don'ts" on Dr. Might's list also resonated with me.  The first, "Focus on grades or coursework", was problematic for me because I viewed the Ph.D. program linearly:  two years of coursework, comprehensive exam, dissertation proposal, dissertation analysis, write-up, then defense.  I'm in the write-up stage now but I think that if I'd adopted a more holistic view from the beginning, I may have identified a topic and research question earlier.  It also doesn't help matters that I switched from biostatistics to epidemiology two-and-half years into the program, setting me back a year.  Oh well.  Another one, "Treat Ph.D. school like school or work":  according to Dr. Might, the Ph.D. is all-consuming and for those who don't pony up the requisite devotion and obedience, they take 7+ years to finish or wind up ABD.  Based on how long I've been at this (six years), I wonder what Dr. Might would say about my devotion.  The last one, "Miss the real milestones", is a real risk in my program since we are given two options for the Ph.D.:  a "European" style version comprising of three-publication quality papers or the "traditional" version comprising the standard five chapters.  I opted for the "traditional" version so the requirement to publish --- even just one paper --- is absent and based on Dr. Might's criteria, I'd hardly be Ph.D worthy.  (I should note, however, that even though publication isn't a requirement for graduation with the five-chapter dissertation, we are strongly encouraged to write a manuscript from our dissertation and try to get in published.)  

I wish I'd come upon this advice a few years ago but, as the saying goes, better late than never.

Monday, February 4, 2013

Stata and $\LaTeX$: Descriptive Statistics

For as long as I've been statisticulating, I've sought to (seamlessly) export results from the statistical software program into a file-format readable (and usable) by non-statisticians.  In my earliest days --- an undergraduate econometrics course using SAS --- I just copied-and-pasted the results from the SAS output window into a MS Word document, fixed the formatting and spacing, then saved the Word file and left it at that.  This method was dreadfully inefficient and prone to error but something I came to view as a necessary (if not wholly enjoyable) part of the statistical analysis process.  Fortunately, the days of copy-and-paste are getting further and further behind us.

In SAS, they've developed a rich and expansive Output Delivery System (ODS) that can output virtually any SAS output into an RTF or PDF file.  I used SAS ODS a fair amount in my last job and don't recall having any major beefs with it.  With Stata, however, there isn't any corporate-developed output delivery system that outputs results into, say, MS Word or Adobe Acrobat.  You can create a "log file" in Stata that logs all your output (sans graphs) and commands into a Stata-proprietary format (.SMCL) or into a text file (.TXT) but since this log also includes the commands and comments used to generate the results, it isn't ideal for sending to a non-statistician.  In spite of this limitation, though, I (for a time) was converting my .SMCL files into HTML documents and sending the HTML document (liberally commented to make the document somewhat self-explanatory) when results needed to be circulated.  This was acceptable but not completely ideal.  It wasn't until I started using $\LaTeX$ that outputting of the results directly to Adobe started to make more sense.  But even more ideally, I wanted to write and create $\LaTeX$ code directly from within Stata with the code being written directly to a text or $\LaTeX$ file.  Although I'm still working through the best way to do this, I think what I have so far is a decent start.  

First, I open a $\LaTeX$ file (WinEdt actually) and include everything (e.g. preamble) up until the first \section{...} statement.  

\usepackage{parskip}           % no indent for each paragraph but vertical space instead
%Change setting to line space between paragraphs
\setlength{\parindent}{10mm}                    % Paragraph indentation



In my Stata .do file, I macro out a text file containing the soon-to-be created $\LaTeX$ code via a -local- statement then with each call of the text file, I use a series -file open, write, and close- commands to open, write to, and close the text file.  The Stata results are grabbed and formatted for inclusion into a $\LaTeX$ file using various user-written commands, Ian Watson's -tabout- being the one I've primarily used for descriptive statistics. 

For example,

* **macro out text file to collect all LaTeX code and comments
local stats `"`"C:\Documents and Settings\stats.txt"'"'

file open stats using `stats', write replace text
file write stats "\section{Descriptive Statistics}" _n
file write stats "Statistics that follow are for the N=100 sample." _n(2)
file close stats

* **rank --- frequency distribution
file open stats using `stats', write append text
file write stats "Academic Rank"
file close stats

tabout rank using `stats', append oneway cells(freq col cum) format(0 1) ///
clab(No. Col_% Cum_%) style(tex) bt font(bold) topf(top.tex) botf(bot.tex) topstr(14cm) botstr(.)

* **index score, overall and by rank --- summary statistics
file open stats using `stats', write append text
file write stats "Index Score:  Overall and by Academic Rank"
file close stats

quietly oneway h_pre95 rank
local p = trim("`: display %9.4f (Ftail(`r(df_m)', `r(df_r)', `r(F)')) '")

tabout rank using `stats', append sum oneway ///
cells(N h_pre95 min h_pre95 max h_pre95 median h_pre95 mean h_pre95 sd h_pre95) ///
format(0 2) clab(.) style(tex) bt font(bold) topf(top.tex) botf(bot.tex) ///
topstr(14cm) botstr(One-way ANOVA, p = `p')

A couple of comments on the above snippet of code.  The first instance of -file open- contains replace as an option whereas the latter -file open- statements use append.  Second, the first -tabout- produces a one-way frequency distribution and the second -tabout- produces select summary statistics of a continuous variable stratified by a categorical variable.  Frustratingly, I haven't figured out how to generate a non-stratified table of summary statistics although I'm sure there is a simple and straightforward means of doing so.  For a more detailed and helpful explanations of the features and capabilities of -tabout-, see Ian Watson's help documentation. 

Given the number and variety of (exceptionally smart) Stata users out there, I suspect there are many methods of varying elegance that have been devised for exporting results to a non-Stata format.  This is one.  And one likely to evolve with my experience and needs. 

Thursday, January 31, 2013

No, I'm Not

About a month ago, I wondered whether I'm An Asshole.   Turns out --- at least by the definition and standards outlined by Aaron James in "Assholes:  A Theory" --- I'm not (phew!).  In this breezy philosophical treatise, James introduces the asshole as "not just another annoying person but a deeply bothersome person --- bothersome enough to trigger feelings of powerlessness, fear, or rage."  James then goes on to explicitly theorize that "a person counts as an asshole when, and only when, he systematically allows himself to enjoy special advantages in interpersonal relations out of an entrenched sense of entitlement that immunizes him against the complaints of other people."  This theory can then be broken down into three parts with respect to interpersonal relations:  
  1. The asshole "allows himself to enjoy special advantages and does so systematically";
  2. The asshole "does this out of an entrenched sense of entitlement"; and
  3. The asshole "is immunized by his sense of entitlement against the complaints of other people."  
This asshole-ishness must also (1) be a stable character trait, (2) impose only small or moderate material costs upon others, and (3) nevertheless makes the person morally repugnant.  

James sprinkles examples of asshole-ish behavior throughout:  the asshole, unlike the non-asshole, doesn't view his birthday as that one day where they are special, "the asshole's birthday comes every day."  And the asshole doesn't restrain their veiled criticisms, insinuating questions, or awkward allusions to topics best kept away from polite company.  But what makes a person an asshole versus a schmuck, jerk, d-bag, or just oblivious to social norms?  According to James, "a proper asshole always has an underlying sense of moral entitlement.  We may have to look deep within his soul to find it, but it is there."  So there you have it, the crux of being an asshole is moral entitlement.  

But how do you know if you are an asshole?  Unfortunately, James doesn't provide much in the way of a decision tree but an initial test of asshole-ishness is a simple one.  Ask yourself the question:  Am I an asshole?  If you would be willing to call yourself an asshole you are most likely not, in fact, an asshole and if you are even remotely worried by the revelation that you are an asshole then you are, again, most likely not an asshole.  A sense of shame about being characterized as an asshole usually means you are not an asshole.  If, however, the thought of being characterized as an asshole delights you, gives you no pause, or you dismiss the possibility with a huffy "whatever", then you are most likely an asshole. 

James spends the remainder of the book dissecting the asshole, the different types of assholes, and the implications of an asshole culture on modern society and our capitalist economy.   But what about the question motivating this post:  Am I an asshole?

I think I can confidently assert that I am not, in fact, an asshole.  I concede that, like most people, I may make the rare lapse into asshole-ish territory but that doesn't make me an unqualified, unadulterated asshole.  My wife and I had a somewhat heated discussion about what it means to be an asshole and she argued that just because James articulated a definition of an asshole doesn't mean his definition is right.  Unlike James, my wife didn't think moral entitlement was the cornerstone of being asshole.  The complete absence of empathy and sympathy ranked higher in her definition of being an asshole.  In a sense, they are both right.  When I look inward, I don't see any moral entitlement --- I'm pretty easy going, I usually defer to others' preferences, and I get uncomfortable receiving special treatment on my birthday so the prospect of special treatment everyday is horrifying --- but if lack of empathy and sympathy are the measures of asshole-ishness then, yes, there have been a couple of instances of asshole-ishness.  These rare episodes, however, don't make me an unqualified asshole since, according to James, a rare lapse isn't a stable characteristic trait.  The conclusion?  I'm pretty certain I'm not an asshole.  

Friday, January 25, 2013

Statistics: A 15 Minute Primer

My insanely productive and brilliant friend called me a couple of days ago and mentioned that he was giving a 15 minute presentation on statistics to the medical residents in his division and he wanted to know if I had any old presentations lying around.  Unfortunately, my external hard drive crashed several months ago and I lost several hundred files --- slide sets, PDFs, course handouts, etc. --- that would have been perfect for what he needed so I was unable to give him an already-assembled presentation.  As I was lamenting the loss (and slow recovery) of so many of my files, I started playing around with $\LaTeX$, leafing through a few of my statistics books, then ended up putting together a short (and superficial) presentation on statistics.  This presentation is quite soft on the statistics --- rigorous it certainly isn't! --- but I was relatively pleased with the Beamer theme I used, Warsaw.  Assembling this presentation probably wasn't the best use of my time but once I got started I was loathe to abandon it, especially since it began as an exercise in using Beamer but evolved into trying to reduce a discipline as rich, rewarding, and vast as statistics into a mere 15 minutes.  This is my feeble attempt.  

Thursday, January 3, 2013

Writer's Diet Test

I'm reading Helen Sword's "Stylish Academic Writing" --- writing, especially mine, can always be improved --- and on page 60, Sword suggests visiting a website that tests your writing for its "fitness-level".  This diagnostic tool is free and examines your writing sample with respect to verb usage, stodgy nouns (nominalizations), prepositions, adjectives/adverbs, and prevalence of if/this/that/there.  An overall "fitness rating" is returned as well as ratings for each grammatical category.  The test can be accessed directly here.  I input the content of my previous blog post, "Am I?", into the test and was pleasantly surprised by the overall result:  Fit & trim (a screenshot of the result is pasted below).  I'm not so arrogant as to believe that all my writing will be as fit and as the website disclaimer states, grammatically fit writing doesn't necessarily translate to stylish or interesting.  Limitations of the test aside, though, it is an easy and fun way to diagnose your writing.

Wednesday, January 2, 2013

Am I?

Am I an asshole?  

The question has been wearing on me for a couple of days.  On new year's eve, I had a drunken conversation with a friend of mine and at one point I jokingly wondered aloud if I was asshole, to which she replied:  "You know I love you, Clint, but sometimes you can be an asshole." Wow.  I wasn't really expecting that kind of no-hesitation response but, okay, I guess that clears up any lingering doubt.  
Photo credit:

But what constitutes asshole-ish behavior, exactly?  Occasionally, I'm guilty of breaching the asshole boundary.  I can say some insensitive things and if I'm feeling randy, I can be almost ruthless (i.e. an offensive asshole) when talking religion with someone.  A couple of instances come to mind.  The first occurred about two weeks ago at the airport.  My wife and I were pulling into the arrivals terminal to park curbside and en route to the designated area, a taxicab driver blocked our lane (they have a separate lane) then proceeded to load a few passengers with little regard for the fact that he was breaking protocol and blocking the flow of traffic.  It really irritated me so I inched up close enough to his rear bumper to make loading his trunk difficult.  He glared at me, pointed at my bumper, then gestured for me to back off.  I gestured back, put the car in reverse, and gave him the space he needed but made it clear while doing so (via wild hand waving) that he needs to get the hell out of the way.  At the time, my behavior seemed appropriate and justified --- the cabbie was breaking the rules so it seemed prudent to let him know --- but was what he did really that big of deal?  And did it justify acting like an asshole?  In the grand scheme of things, a few minutes spent waiting is trivial.  But then again, what about principle?  Some people will take advantage of nearly any situation, cheat at the first opportunity, and ruthlessly exploit loopholes --- where does society draw the line?  Do you need someone to occasionally emerge as the asshole and let them know, in no uncertain terms, that they need to queue in the back of the taxicab line and not obstruct traffic?  Depending on my mood, I can vigorously defend each position.  

The second instance occurred over dinner about a week ago.  The conversation turned to religion after I made a couple of snide remarks about Catholicism and the taking of communion ("snack time!") that then dovetailed into a few earnest questions I had about Catholicism.  After my dinner mate tried to answer my questions, she then posed a few of her own:  What was my beef with Mormonism?  Could I, in a non-biased way, summarize and describe the tenets of Mormonism?  What was my religious background?  Do I believe in anything now?  (I'm not Mormon but grew up in Mormon suburbia and have read a fair amount of Mormonism so she seemed to think I was qualified to educate her on the basics of the Mormon faith.)  I realize religion and faith are very sensitive and personal topics --- a point I felt I emphasized repeatedly and diplomatically --- but I couldn't help but point out that religious moderates can't conveniently ignore and condemn the religious zealots because acknowledgement and respect of one (some?) faith(s) necessitates acknowledgement and respect of all faiths.  I insisted that my objection to religious faith isn't how religious people choose to explain the inexplicable or the way they spend their Sundays --- I really don't care --- but that a lot of religions are trying to make further inroads into the public sphere by trying to erode the barrier between church and state.  After much back-and-forth, the conversation eventually wound down to an uncomfortable and implied mutual respect for differing viewpoints.  I like to have conversations like this and am genuinely interested in understanding viewpoints different from mine and I didn't think anything I said or did during the discussion would qualify as asshole-ish but I left the dinner table feeling somewhat asshole-ish.  Is asserting one's view, especially if it may be offensive to another person's religious sensibilities, characteristically asshole-ish?   I'm not sure.  But I intend to investigate and reflect on what it means to be an asshole and whether I am one. 

I just downloaded Aaron James' "Assholes:  A Theory" onto my Kindle and I'm hopeful that in spite of the occasional asshole-ish lapse, I'm not, according to James's rubric, an unadulterated, unapologetic asshole.