Thursday, July 5, 2012

Going the Distance: Plotting Cumulative Time

I recently ran an ultra marathon, the Trail Peñalara 60k,  in Navacerrada, Spain (about 30 miles outside of Madrid) --- you can read my race report here --- and although neither the running nor the writing up of the experience have anything to do with pursuit of my Ph.D., the graphical presentation of the checkpoint times seemed rigorous and interesting enough to label as a "PhD" blog post.  (And since I was working on this instead of the power calculation for my dissertation I figured I may as well legitimize the time spent and add a blog post about it.)

This race, unlike any other I've run in the United States, required that each runner wear an electronic chip bracelet that had to be scanned at the start, the finish, and seven control points in between.  I can only speculate as to why our race chips were scanned as many times as they were --- to deter cheating? for data collection? --- but no matter, the race organizers posted the data on their website and I thought it would be interesting to identify the fastest, slowest, middle, and average times for each aid station/checkpoint (including the finish) then compare those times to my times at the respective points.  Based on the finishing times --- I was 39th among 147 finishers --- I knew I was faster than the middle (median) and average (mean) runners but I was curious as to how my times stacked up at the intermediate check points.  The short explanation of how I did this is I imported the data into Stata, identified the fastest (minimum), slowest (maximum), middle (median), and average (mean) then created a new dataset containing these summary statistics.  I then merged my times into that summary statistic dataset such that the resultant dataset contained nine observations with six variables.  The first variable in this dataset is the aid station/checkpoint and the remaining five variables contain the slowest/fastest/average/middle/my times recorded at each checkpoint. The summary dataset is presented below:

+-----------------------------------------------------------------------------+
|                   as     min_as     max_as    mean_as     mdn_as     cjt_as |
|-----------------------------------------------------------------------------|
|    Start (Rascafria)   00:00:00   00:00:00   00:00:00   00:00:00   00:00:00 |
|     El Reventon Pass   01:02:12   02:23:08   01:44:04   01:43:26   01:31:44 |
|             Penalara   01:58:58   04:44:30   03:18:15   03:14:29   02:54:09 |
|            La Granja   02:49:44   07:02:42   05:00:20   04:56:00   04:25:27 |
|     Casa de la Pesca   04:17:27   10:16:42   07:26:50   07:18:47   06:45:11 |
|             Fuenfria   05:00:54   11:41:58   08:34:15   08:26:47   07:41:40 |
|     Navacerrada Pass   05:40:14   12:53:42   09:38:06   09:27:47   08:41:14 |
|          La Barranca   06:23:54   14:37:36   10:48:46   10:35:31   09:39:31 |
| Finish (Navacerrada)   06:45:43   15:34:39   11:29:04   11:15:00   10:15:21 |
+-----------------------------------------------------------------------------+

The code I used from start to finish is presented below.

capture log close
log using tp60k_graph, replace
datetime

// program:  tp60k_graph.do
// task:  graph times from TP60k
// project:  drivel
// author:    cjt
// born on date:  20120705

// #0
// program setup

version 11.2
clear all
macro drop _all
set more off

// #1
// insheet .csv
insheet using "C:\Documents and Settings\cjt\Desktop\TP60k\TP60k_times.csv", comma

// #2
// convert string time variables to numeric variables
foreach var of varlist as1_reventon as2_penalara as3_lagranja as4_cpesca as5_fuenfria ///
as6_navac as7_barranca finishtm {
gen double var'_temp = clock(var', "hms")
drop var'
rename var'_temp var'
format var' %tcHH:MM:SS
}
*end;

gen as0_start = 0
format as0_start %tcHH:MM:SS

// #3
// -save- entire dataset
save tp60k, replace

// #4
// extract my times from the dataset
keep if place==39
keep as* finishtm
* **prefix time variables w/ my initials, cjt
rename as0_start cjt_as0
rename as1_reventon cjt_as1
rename as2_penalara cjt_as2
rename as3_lagranja cjt_as3
rename as4_cpesca cjt_as4
rename as5_fuenfria cjt_as5
rename as6_navac cjt_as6
rename as7_barranca cjt_as7
rename finishtm cjt_as8
* **save dataset for later merge
save tp60k_cjt, replace

// #5
// recall earlier, fuller dataset
use tp60k, clear

// #6
// -collapse- data for graph
collapse (min) min_as0=as0_start min_as1=as1_reventon min_as2=as2_penalara min_as3=as3_lagranja ///
min_as4=as4_cpesca min_as5=as5_fuenfria min_as6=as6_navac min_as7=as7_barranca ///
min_as8=finishtm ///
(max) max_as0=as0_start max_as1=as1_reventon max_as2=as2_penalara max_as3=as3_lagranja ///
max_as4=as4_cpesca max_as5=as5_fuenfria max_as6=as6_navac max_as7=as7_barranca ///
max_as8=finishtm ///
(mean) mean_as0=as0_start mean_as1=as1_reventon mean_as2=as2_penalara mean_as3=as3_lagranja ///
mean_as4=as4_cpesca mean_as5=as5_fuenfria mean_as6=as6_navac mean_as7=as7_barranca ///
mean_as8=finishtm ///
(median) mdn_as0=as0_start mdn_as1=as1_reventon mdn_as2=as2_penalara mdn_as3=as3_lagranja ///
mdn_as4=as4_cpesca mdn_as5=as5_fuenfria mdn_as6=as6_navac mdn_as7=as7_barranca ///
mdn_as8=finishtm

// #7
// -merge- in the cjt data
merge 1:1 _n using tp60k_cjt
drop _merge

// #8
// -reshape- from wide to long
gen index = .
reshape long min_as max_as mean_as mdn_as cjt_as, i(index) j(as)
drop index

// #9
// -label- values of "as" varialble
label define aid 0 "Start (Rascafria)" 1 "El Reventon Pass" 2 "Penalara" 3 "La Granja" ///
4 "Casa de la Pesca" 5 "Fuenfria" 6 "Navacerrada Pass" 7 "La Barranca" 8 "Finish (Navacerrada)"
label values as aid

// #10
// generate -graph-
line min_as as || line max_as as || line mean_as as || line mdn_as as || line cjt_as as, ///
ytick(0(7200000)57600000) ylabel(0 "0h" 7200000 "2h" 14400000 "4h" 21600000 "6h" 28800000 "8h" ///
36000000 "10h" 43200000 "12h" 50400000 "14h" 57600000 "16h", angle(horizontal)) ///
ytitle("Cumulative Time") xtitle("Aid Stations/Check Points") xtick(0(1)8) xlabel(0(1)8, ///
valuelabel angle(45)) legend(col(1) pos(3) lab(1 "Minimum") lab(2 "Maximum") lab(3 "Mean") lab(4 "Median") ///
lab(5 "My Time")) legend(subtitle("Times")) title("Trail Penalara 60k") subtitle("Control Post Times") ///

// #11
// -export- graph
gr export tp60k_graph.png, replace

log close
exit

The resulting graph:
As is evident from the graph, the middle and average run times were nearly the same over the course of the race, thus suggesting a non-skewed distribution of times.  My times were only slightly better than the middle and median times up until Casa de la Pesca but then I managed to increase the gap, per the widening of the gap between my time and the mean/median times after that checkpoint.