Grading Age Grading: Evaluating Methods for Handicapping Competitive Runners

Introduction

Several methods have been proposed for age handicapping distance runners (5K through Marathon). The purpose of handicapping is two-fold:  it allows runners of different ages to compete on a “level playing field”, and it gives us some insight into the aging process itself.

In this article, I give letter grades (A, B, C, D, and F) to four different handicapping methods based on the accuracy of each when it is applied to runners competing in eleven different venues representing the local, state, and world competitive levels. As you will hopefully agree, I have taken care to ensure that the grading system is both fair and objective. Three of these methods are separately applicable to both males and females and one (YALE_AG) has been developed using data for males only.  Consequently, with 11 venues and 2 genders, three of these methods receive a letter grade in 22 different courses, and one method receives a letter grade in 11 courses.  Finally, after grades have been calculated for all of the courses, the overall GPA is calculated for each of the four methods.

For those desiring a more conventional approach to evaluating these four models, I also fit each model to the eleven venue test data and calculate generalized Coefficients of Determination (R2) for each model across the various venues. As you will see, there is excellent agreement between the letter grade and the Coefficient of Determination.

The four methods and links to references and principal authors are as follows:

Although they use different approaches, the first three of these methods are all based on the single age world records maintained by the Association of Road Racing Statisticians at arrs.run.  The 2017 factors for the YALE_AG method are given for ages 40 and above.  However, the authors recommend an adjustment to extend the age range back to 35 years.  This article uses the recommended adjustment.

The fourth method, UD_5KH, is unique in that it is based on the maximal oxygen uptake during treadmill tests using subjects of different ages. Originally it was developed for only the 5K but, more recently, it has been extended for distances up to the Marathon.  In addition to age, UD_5KH also includes an adjustment for bodyweight.  In this evaluation, the CDC 10th percentile for bodyweight by age was used for UD_5KH since this provided a slightly better fit than did higher percentiles and since competitive runners typically are leaner than the general populations.  Additional observations on each of these methods are discussed later in this article.

There are many age handicapping or “age grading” calculators available on-line; however, virtually all are derivatives of one of the four methods evaluated here (particularly of WMA_AG or its earlier versions sometimes denoted with the acronym “WAVA”).

Test Data: World, State, and Local Competition

The accuracy of the four methods is evaluated by measuring their performance against eleven different test venues representing a wide range of competitive abilities. For each gender, these eleven venues were split into 3 distinct classes of competitors representing the world, state, and local competitive levels.

Between the ages of 35 and 85, male and female single age speeds were obtained for each venue.   (Some venues did not have single age data that went up to age 85, in which case I used the maximum upper age available; only data for runners, avg. speed > 4.5 mph, were used.)

To insure that the speeds were equitable among the various ages within each venue, speeds were adjusted for the size of the relevant population as described in Age Handicapping Competitive Runners, Part1: Quantifying the Population Effect.  The data were also smoothed as described in the appendix to Age Handicapping Competitive Runners, Part 2: Tables for Speed Handicaps.

The test venues were as follows:

  1. The World single age records were obtained from the Association of Road Racing Statisticians, arrs.run, as follows:
  •       World Single Age Records for the Marathon
  •       World Single Age Records for the Half Marathon
  •       World Single Age Records for the 10K
  •       World Single Age Records for the 5K
  1. State Single Age Records for were obtained or linked from the StateRunningRecords.com for the following four venues:
  •       Marathon (AL,MO,MN,MS,NH,OK,TN)
  •       Half Marathon (AL,MO,MN,MS,NH,OK,TN)
  •       10K (AL,MO,MN,MS,NH,OK,TN)
  •       5K (AL,AZ,MO,MN,MS,NH,OK,TN)
  1. Local 5-year Age Group Winners in the 5K were obtained from data summarized in Racing Among the Ages. Based on race size, there were three Local venues:
  •     Very Large 5K races (302 local races, each with 1000 and 8641 finishers)
  •     Large 5K races (356 local races, each with 500 and 999 finishers)
  •     Medium 5K races (313 local races, each with 300 and 499 finishers)

[Note that the AZ state records were available for only the 5K due to a broken link at the time these data were compiled. More information about the local races can be obtained at Median Times for Top Finishers in 5K Road Races.]

I did not include local races with fewer than 300 finishers because of concerns that too many of the age groups would be won by non-competitive individuals (i.e. recreational walkers and runners) due to small age group sizes and the absence of a competitive runner in some groups.

Final GPA for Each Handicapping Method

Before we get too far into the weeds, here are the final Grade Point Averages (based on a 4 point scale) and overall letter grades for each of the four methods:

  •    YALE_AG:   1.27 D+
  •    WMA_AG:   2.45 C+
  •    BDR_AH:       3.64 A-
  •    UD_5KH:       0.00 F

Later I will provide more detail on the methodology to calculate grades and will break down the competitive levels where each method did or did not do well. I also discuss the technical reasons for the performances of each.

Lest the proponents of the WMA_AG methodology take some comfort in the “gentleman’s C+” received by their candidate, I point out here that the WMA_AG method received an “F” in three required courses. Also, our valedictorian, BDR_AH, will graduate with only an “A-“ average, indicating that there may be still room for improvement.

Coefficients of Determination, R2

Because the observed speeds for all of the single year data points within each of the eleven venues have been adjusted for the population at each age, these speeds represent equivalent (same population percentile) performances among the various ages (typically 35-85) within each venue.

Similarly, each of the four age-grading/handicapping models defines a family of performance isoquants, where an isoquant is a series of combinations of speed and age representing an equal level of performance. Thus, for example, using WMA_AG, a 35 year old male who completes a 5K road race at 11.3 mph (time=16:30) is on the same isoquant and has a performance equal to an 85 year old man running at approximately 6.3 mph (time=29:40).  On the other hand, using the isoquants from the BDR_AH method suggests approximately 5.1 mph (36:35) for an 85 year old is equivalent to 11.3 mph at age 35.

So the question becomes, how well does each of the four models fit the data in each of the 11 venues? To answer this question, the isoquant with the least mean squared error from the observed speeds (population adjusted) was determined for each model and each venue. Since YALE_AG is only applicable to males, this gives a total of 77 families of isoquants.  (Even though these models are, in general, non-linear in speed, each of the 77 families of isoquants is defined by a single parameter, so finding the optimum fit is relatively straight forward.)

The Coefficient of Determination (R2) was used to measure the fit of each method across the eleven venues. Tables 1a and 1b show these Coefficients of Determination.  The Coefficients of Determination are color coded as under 90% = Red, between 90% and 95% = Yellow, and over 95% = Green.  Note that only BDR_AH was Green across all venues.

The second best Coefficients of Determination were obtained by WMA_AG. Even though this method did well at the world and state levels, its performance was not very good at the local level. For example, in Table 1a, the Coefficient of Determination using the WMA_AG method for Age Group Winners in Local races with 500-999 finishers is 89%.  Thus the WMA_AG method explains only 89% of the variability in running speed across the various ages.  Figure 1 shows the best fit of the WMA_AG method to the male data for age group winners in Local races with between 500 and 999 finishers as well as for 5K world and state records.  As you can see from this figure, at the Local level, the WMA_AG method gives an “unfair” advantage to runners between the mid-forties and early fifties.  On the other hand, starting in the late sixties, this method progressively disadvantages the older runners.

Nevertheless, WMA_AG is substantially better than UD_5KH, which is the worst method across all venues. Figure 2 gives an example of the fit of this method.

Grading Age Grading

In the previous section we looked at the performance of four handicapping methods in terms of how well each fits the observed speeds between the ages of 35 and 85 and across eleven different venues representing a wide range of competitive ability.  In this section, we look directly at the handicapped speeds generated by each model.

Quite simply, an ideal handicapping system will meet two requirements:

  1. Performances which are equal, as they are within each venue, receive the same handicapped speed.
  2. The differences in competitive ability between the different venues is preserved in the handicapped speeds

Since all of the methods equate the handicapped speed and the actual speed at some point between the ages of 25 and 35, the second of the above requirements will be fulfilled provided the first is met. Consequently, I will focus on the first requirement and grade each method on its ability to achieve consistency among the handicapped speeds within each venue.

One measure of within venue consistency is the average deviation (i.e. average absolute error) among the handicapped speeds between the ages of 35 and 85.  Thus, if the handicapped speeds within a venue are all the same, the handicapping method performs ideally and the average error is zero.  On the other hand, as the differences among the handicapped speeds becomes larger, the average error grows proportionally.

For example, using the YALE_AG method, the average deviation among the male 5K state record handicapped speeds is 0.471 mph (miles per hour).

To make the interpretation of the average deviation in handicapped speeds somewhat more intuitive, miles per hour can be converted to the same scale as age (i.e. years) as follows:

Between the ages of 35 and 85, the male 5K actual state record speeds declined at an average rate of 0.125 mph/year.  Thus the average deviation of 0.471 mph in handicapped speeds is equivalent to (0.471 mph) ÷ (0.125 mph/year) = 3.77 years.

So for this venue, the average error in the handicapped speeds is 3.77 years. Is this good or bad and what letter grade should be assigned to the performance of the YALE_AG method in the male 5K state record venue?  To answer this question, consider the following two observations:

  1. Differences in age among individuals who have the same integer age are generally regarded as inconsequential by all race venues. Although we may record the exact age in days, state and world records are generally not maintained for intervals shorter than a year. Thus we do not see separate records maintained for ages 50.0, 50.1, 50.2, etc. Local race results generally show finishers ages in whole numbers. Consequently, for a particular venue, a handicapping system having an average absolute error rate of less than one year can be considered to have very good performance.
  2. Differences in age of 5 years or more are usually regarded as quite significant. Larger local races will most commonly separate individuals differing in age by 5 or more years into separate age groups. Even though ARRS reports single age records, above 40 they highlight the best performance in each 5-year interval. USATF maintains American individual masters records in 5 year intervals. If a handicapping system has an average absolute error rate greater than 5 years for a particular venue, we can conclude that it fails for that venue.

Having established that an average error of less than one year is an “A” and an average error of five or more years is an “F”, the intervening spread can be partitioned uniformly to provide the following grade scale:

  •    “A”: Error less than 1.00 years
  •    “B”: Error between 1.00 and 2.33 years
  •    “C”: Error between 2.33 and 3.67 years
  •    “D”: Error between 3.67 and 5.00 years
  •    “F”: Error equal to 5.00 years or more

Using this scale, report cards with the grades achieved are shown for each handicapping method in tables 2a and 2b.

Summary of Results

For the local races, only the BDR_AH method showed a reasonable performance, with grades of B and Coefficients of Determination at 97% or above for both males and females in each local venue. On the other hand, the other three methods all had a Coefficients of Determination at 90% or below on all local venues and received straight “F’s” on the male local venues.

The UD_5KH failed on all eleven venues for both genders. Nonetheless, at the world level, the other three methods all performed reasonably well with BDR_AH getting straight “A’s”, WMA_AG getting “A’s” and “B’s”, and YALE_AG receiving “C’s” and one “B”.  However, this might be expected since all three of these methods are directly derived from the same single age world records that were used in the evaluation process.  Possibly a useful question is “Why didn’t all three of these methods get straight “A’s” at the world level?”

The following discussion provides more detail on each of the methods.

UD_5KH Discussion

Even though the UD_5KH method performed poorly, with some methodological adjustments, it does have the potential to provide a significant insight into the aging process itself: Can a unified approach explain of the effect of age on individuals ranging in ability from the general population (measuring VO2max) to world class athletes achieving a single age record?

Among the methodological issues with the UD_5KH, as published by Vanderburgh and Laubach in 2007, is that it relies on two studies on the “Changes in Aerobic Power” of Men and Women by Jackson et.al, 1995 and 1996. The studies by Jackson et.al. assume a linear effect of age; however, a 2005 study by Fleg et.al. showed “Accelerated longitudinal decline of aerobic capacity in healthy older adults.”  (See also Ades and Toth, 2005).  Also, the Jackson age coefficients, as used by Vanderburgh, are “corrected” for physical activity and body composition, which are themselves highly correlated with age.  This causes the decline with age to be substantially underestimated, as is graphically illustrated in Figure 2.

It should additionally be noted that the Jackson studies, especially for women, underrepresent older adults in that the oldest woman was only 64.

Vanderburgh and Laubach also published “Validation of a 5K Age and Weight Run Handicap Model” in 2006.  Unfortunately this study is of limited value due to questionable statistical methodology.  Two assumptions should have been challenged during peer review.  First, the assumption is made that the absence of a linear correlation between age and handicapped run times implies there is no relation between age and handicapped run times.  Second, individuals whose times were outliers to the model were successively excluded for (supposed) lack of sufficient effort (i.e. when data that did not fit the model very well is excluded, the model fits the remaining data better!)

Nevertheless, this approach has significant potential, I hope it is revisited with the appropriate methodological corrections.

YALE_AG and WMA_AG Discussion

A fundamental assumption of both the YALE_AG and the WMA_AG methods is that for each age a unique frontier or upper biological limit to human performance exists, and that individuals of different ages who have performances at the upper biological limit for their age can be regarded as having equal performances.

A second assumption is that individuals of differing ages who perform below the biological limit but at the same percentage of the biological limit for their age can be regarded as having equal performance.

As far as I can tell, these assumptions and the resulting models have not been tested until this article. The WMA_AG performs somewhat better than the YALE_AG, primarily because it has more parameters which allow a closer fit to the data.

The second assumption is particularly problematic because not only has it not been tested, but the alternatives do not appear to have been considered. For example, with males, the YALE_AG 5K “biological limit” at 80  takes 44% more time and is about 4.36 mph slower than at 40 years. Now suppose a less than world class 40 year old can run a 5K in 18 minutes, what is the equivalent time for an 80 year old?  44% more time suggests about 26 minutes, whereas 4.36 mph slower suggests about 31 minutes.

As in the above example, I reworked the YALE_AG handicapped performances for the 3 male Local venues using absolute changes in speed rather than % change in time. This simple adjustment to the YALE_AG model resulted in a dramatic improvement, reducing the average deviation by more than 50% and changing the Local venue grades from three “F’s” to three “C’s”!

BDR_AH Discussion

Although BDR_AH did well for all of the venues considered here, even the least competitive venue (Age Group Winners in midsized 5K races) represents athletes who are well above average.  Moreover, even though the BDR_AH model fits the local venues reasonably well, the fit for the state and world venues is consistently better.  Consequently, the trend is somewhat worrisome in that most of the applications of age handicapping/grading probably occur at performance levels equal to local age group winners and below.

This begs the question, how should the effect of aging on maximal athletic performance be modeled across the full spectrum of human ability, ranging from ordinary individuals up to the most elite world class athletes? (Since, a large percentage of older individuals may not be able to complete an endurance race at a running speed, their maximal performance may need to be measured on a treadmill or some other more controlled venue.)

At the other end of the spectrum, what is the best way to estimate the frontier or upper limit of human performance at each age? Should it be directly estimated as with YALE_AG and WMA_AG using only “non-dominated” single age running records?  Or should it be extrapolated from models, such as BDR_AH, which are initially derived to fit the entire complement of single age running records?

The evaluations in this study were based on the age range 35-85. This was done in consistency with common conventions related to “masters” runners.  However, many local races start the masters category at age 40, and Racing Among the Ages suggests that the last major inflection points for both participation rates and straggling rates occur between the ages of 40 and 50 in local 5K races.  So, what is the best age range to use in evaluating the performance of the various age handicapping methods for masters runners?  And this of course begs another question:  how should we evaluate the performance of age handicapping methods at other life stages?

Future research into all of these questions promises to deliver important insights into the aging process and human performance.