Digression Alert: Coding in R Blog

May 13, 2024

Years ago, I began a blog that was both a first-person narrative and a tutorial. If you don’t know, I am an adjunct professor who has used R to teach statistics and sports analytics. Sometimes I come across a problem that may not be necessary to teach to my students, but it’s still a learning opportunity, so I decided to share it in blog form.

Now that this Substack has its own life, I figured I would migrate those posts here. Hope this post helps you!

Learning Tidymodels with Baseball Data

Before the 2021 Major League Baseball season was completed, I attended my first L.A. R Users Group meeting where I was introduced to the tidymodels package. The name says it all: using syntax from the tidyverse to create statistical models. RStudio Software Engineer Max Kuhn walked us through how to create a number of statistical models, all in one straightforward workflow. The talk left me inspired both to master this package and to write this post.

I already code in the tidyverse because of how intuitive the language is and how easy it is to convey to others what I am trying to do. Communicating code and statistical approaches are paramount in my work. But I was inspired to write for another reason. Most educational lectures use datasets any statistical student has probably memorized. They are almost always useful in terms of understanding neat and tidy concepts and trends, but not always applicable to what a data scientist will actually study. So, as someone who’s made sports his vocation, I am always looking to find ways to adapt these tutorials to my domain. This post hopes to achieve teaching you a bit about tidymodels, baseball data and exposure to advanced modeling techniques in what could be the future of America’s Pastime.

Installing Necessary Packages

There are two main packages to install, the first is tidymodels itself:

install.packages("tidymodels")

The second package will be for the data. While there are several resources to upload baseball data into your workspace, this tutorial will highlight the baseballrpackage.^[https://billpetti.github.io/baseballr/]. Bill Petti has done a masterful job keeping this package updated so that users can scrape the most sophisticated data from websites like Baseball-Reference.com. Note: this package can also scrape Baseball Savant and FanGraphs. To install this package:

# install.packages("devtools")
devtools::install_github("BillPetti/baseballr")

Exploring the Data

Runs scored and runs allowed have provided a clear foundation for explaining which teams win the most games. But which other factors should go into a model predicting the winning percentage for a team’s entire season?

This data acquisition is basically a three-step process: scrape team records, scrape team statistics from a separate location and join all into one dataframe. For this exercise I will only look at the last ten full seasons to where each row represents one team from one season (30 teams times 10 seasons = 300 observations). Using the standings_on_date_bref() command, here is how to scrape team standings:

al_standings_2011 <- standings_on_date_bref("2011-09-28", "AL Overall", 
                                            from = FALSE)
nl_standings_2011 <- standings_on_date_bref("2011-09-28", "NL Overall", 
                                            from = FALSE)
al_standings_2012 <- standings_on_date_bref("2012-10-03", "AL Overall", 
                                            from = FALSE)
nl_standings_2012 <- standings_on_date_bref("2012-10-03", "NL Overall", 
                                            from = FALSE)
al_standings_2013 <- standings_on_date_bref("2013-09-30", "AL Overall", 
                                            from = FALSE)
nl_standings_2013 <- standings_on_date_bref("2013-09-30", "NL Overall", 
                                            from = FALSE)
al_standings_2014 <- standings_on_date_bref("2014-09-28", "AL Overall", 
                                            from = FALSE)
nl_standings_2014 <- standings_on_date_bref("2014-09-28", "NL Overall", 
                                            from = FALSE)
al_standings_2015 <- standings_on_date_bref("2015-10-04", "AL Overall", 
                                            from = FALSE)
nl_standings_2015 <- standings_on_date_bref("2015-10-04", "NL Overall", 
                                            from = FALSE)
al_standings_2016 <- standings_on_date_bref("2016-10-02", "AL Overall", 
                                            from = FALSE)
nl_standings_2016 <- standings_on_date_bref("2016-10-02", "NL Overall", 
                                            from = FALSE)
al_standings_2017 <- standings_on_date_bref("2017-10-01", "AL Overall", 
                                              from = FALSE)
nl_standings_2017 <- standings_on_date_bref("2017-10-01", "NL Overall", 
                                              from = FALSE)
al_standings_2018 <- standings_on_date_bref("2018-10-01", "AL Overall", 
                                            from = FALSE)
nl_standings_2018 <- standings_on_date_bref("2018-10-01", "NL Overall", 
                                            from = FALSE)
al_standings_2019 <- standings_on_date_bref("2019-10-01", "AL Overall", 
                                            from = FALSE)
nl_standings_2019 <- standings_on_date_bref("2019-10-01", "NL Overall", 
                                            from = FALSE)
al_standings_2020 <- standings_on_date_bref("2020-09-27", "AL Overall", 
                                            from = FALSE)
nl_standings_2020 <- standings_on_date_bref("2020-09-27", "NL Overall", 
                                            from = FALSE)
standings_2011 <- dplyr::bind_rows(al_standings_2011$`AL Overall_up to_2011-09-28`,
                                   nl_standings_2011$`NL Overall_up to_2011-09-28`)
standings_2012 <- dplyr::bind_rows(al_standings_2012$`AL Overall_up to_2012-10-03`,
                                   nl_standings_2012$`NL Overall_up to_2012-10-03`)
standings_2013 <- dplyr::bind_rows(al_standings_2013$`AL Overall_up to_2013-09-30`,
                                   nl_standings_2013$`NL Overall_up to_2013-09-30`)
standings_2014 <- dplyr::bind_rows(al_standings_2014$`AL Overall_up to_2014-09-28`,
                                   nl_standings_2014$`NL Overall_up to_2014-09-28`)
standings_2015 <- dplyr::bind_rows(al_standings_2015$`AL Overall_up to_2015-10-04`,
                                   nl_standings_2015$`NL Overall_up to_2015-10-04`)
standings_2016 <- dplyr::bind_rows(al_standings_2016$`AL Overall_up to_2016-10-02`,
                                   nl_standings_2016$`NL Overall_up to_2016-10-02`)
standings_2017 <- dplyr::bind_rows(al_standings_2017$`AL Overall_up to_2017-10-01`,
                                   nl_standings_2017$`NL Overall_up to_2017-10-01`)
standings_2018 <- dplyr::bind_rows(al_standings_2018$`AL Overall_up to_2018-10-01`,
                                   nl_standings_2018$`NL Overall_up to_2018-10-01`)
standings_2019 <- dplyr::bind_rows(al_standings_2019$`AL Overall_up to_2019-10-01`,
                                   nl_standings_2019$`NL Overall_up to_2019-10-01`)
standings_2020 <- dplyr::bind_rows(al_standings_2020$`AL Overall_up to_2020-09-27`,
                                   nl_standings_2020$`NL Overall_up to_2020-09-27`)

While this command is straightforward, each season ended on different dates of the year, so I prefer creating a data frame writing redundant code to avoid missing games, as opposed to writing a “for loop”. A quick inspection of the first few results shows what’s assembled:

head(standings_2020)
# A tibble: 6 x 8
#  Tm        W     L `W-L%` GB       RS    RA `pythW-L%`
#  <chr> <int> <int>  <dbl> <chr> <int> <int>      <dbl>
#1 TBR      40    20  0.667 --      289   229      0.605
#2 MIN      36    24  0.6   4.0     269   215      0.601
#3 OAK      36    24  0.6   4.0     274   232      0.576
#4 CHW      35    25  0.583 5.0     306   246      0.599
#5 CLE      35    25  0.583 5.0     248   209      0.578
#6 NYY      33    27  0.55  7.0     315   270      0.57

Next, concerning explanatory variables, let’s look at standard batting box score statistics. I have performed similar exercises where I have combined hitting and pitching data, but often these dataframes cancel each other out, so for ease I will stick to just the offense. Here, statistics will be scraped more manually, so the XML and RCurl packages will be needed:

#install.packages("XML")
#install.packages("RCurl")
#https://cran.r-project.org/web/packages/XML/index.html and
#https://cran.r-project.org/web/packages/RCurl/index.html for more info
library(XML)
library(RCurl)

site_2020 <- paste0("https://www.baseball-reference.com/leagues/MLB/2020.shtml")
url_2020 <- getURL(site_2020)
batting_2020 <- readHTMLTable(url_2020)
batting_2020 <- batting_2020$teams_standard_batting %>% filter(Tm != 'LgAvg')

site_2019 <- paste0("https://www.baseball-reference.com/leagues/MLB/2019.shtml")
url_2019 <- getURL(site_2019)
batting_2019 <- readHTMLTable(url_2019)
batting_2019 <- batting_2019$teams_standard_batting %>% filter(Tm != 'LgAvg')

site_2018 <- paste0("https://www.baseball-reference.com/leagues/MLB/2018.shtml")
url_2018 <- getURL(site_2018)
batting_2018 <- readHTMLTable(url_2018)
batting_2018 <- batting_2018$teams_standard_batting %>% filter(Tm != 'LgAvg')

site_2017 <- paste0("https://www.baseball-reference.com/leagues/MLB/2017.shtml")
url_2017 <- getURL(site_2017)
batting_2017 <- readHTMLTable(url_2017)
batting_2017 <- batting_2017$teams_standard_batting %>% filter(Tm != 'LgAvg')

site_2016 <- paste0("https://www.baseball-reference.com/leagues/MLB/2016.shtml")
url_2016 <- getURL(site_2016)
batting_2016 <- readHTMLTable(url_2016)
batting_2016 <- batting_2016$teams_standard_batting %>% filter(Tm != 'LgAvg')

site_2015 <- paste0("https://www.baseball-reference.com/leagues/MLB/2015.shtml")
url_2015 <- getURL(site_2015)
batting_2015 <- readHTMLTable(url_2015)
batting_2015 <- batting_2015$teams_standard_batting %>% filter(Tm != 'LgAvg')

site_2014 <- paste0("https://www.baseball-reference.com/leagues/MLB/2014.shtml")
url_2014 <- getURL(site_2014)
batting_2014 <- readHTMLTable(url_2014)
batting_2014 <- batting_2014$teams_standard_batting %>% filter(Tm != 'LgAvg')

site_2013 <- paste0("https://www.baseball-reference.com/leagues/MLB/2013.shtml")
url_2013 <- getURL(site_2013)
batting_2013 <- readHTMLTable(url_2013)
batting_2013 <- batting_2013$teams_standard_batting %>% filter(Tm != 'LgAvg')

site_2012 <- paste0("https://www.baseball-reference.com/leagues/MLB/2012.shtml")
url_2012 <- getURL(site_2012)
batting_2012 <- readHTMLTable(url_2012)
batting_2012 <- batting_2012$teams_standard_batting %>% filter(Tm != 'LgAvg')

site_2011 <- paste0("https://www.baseball-reference.com/leagues/MLB/2011.shtml")
url_2011 <- getURL(site_2011)
batting_2011 <- readHTMLTable(url_2011)
batting_2011 <- batting_2011$teams_standard_batting %>% filter(Tm != 'LgAvg')

One thing to note is these batting tables include a league summary line that must be filtered out. Note: For a complete list of variable definitions, click here.

Again, one quick inspection of these data:

head(batting_2011)
#   Tm Bat BatAge  R/G   G   PA   AB   R    H  2B 3B  HR RBI  SB CS  BB   SO
#1 ARI   51   28.3 4.51 162 6096 5421 731 1357 293 37 172 702 133 55 531 1249
#2 ATL   45   28.9 3.96 162 6169 5528 641 1345 244 16 173 606  77 44 504 1260
#3 BAL   50   28.3 4.37 162 6156 5585 708 1434 273 13 191 684  81 25 452 1120
#4 BOS   49   30.0 5.40 162 6414 5710 875 1600 352 35 203 842 102 42 578 1108
#5 CHC   42   29.2 4.04 162 6130 5549 654 1423 285 36 148 610  69 23 425 1202
#6 CHW   42   30.0 4.04 162 6159 5502 654 1387 252 16 154 625  81 53 475  989
#    BA  OBP  SLG  OPS OPS+   TB GDP HBP SH SF IBB  LOB
#1 .250 .322 .413 .736   99 2240  82  61 50 33  55 1064
#2 .243 .308 .387 .695   90 2140 113  28 75 30  45 1109
#3 .257 .316 .413 .729   97 2306 154  52 24 43  24 1069
#4 .280 .349 .461 .810  116 2631 136  50 22 50  52 1190
#5 .256 .314 .401 .715   95 2224 123  59 60 35  35 1141
#6 .252 .319 .388 .706   89 2133 125  84 52 46  31 1126

Finally, it is time to merge each season’s batting and league standings dataframes, then join all 10 seasons together:

merge_2011 <- merge(batting_2011,standings_2011,by="Tm")
merge_2012 <- merge(batting_2012,standings_2012,by="Tm")
merge_2013 <- merge(batting_2013,standings_2013,by="Tm")
merge_2014 <- merge(batting_2014,standings_2014,by="Tm")
merge_2015 <- merge(batting_2015,standings_2015,by="Tm")
merge_2016 <- merge(batting_2016,standings_2016,by="Tm")
merge_2017 <- merge(batting_2017,standings_2017,by="Tm")
merge_2018 <- merge(batting_2018,standings_2018,by="Tm")
merge_2019 <- merge(batting_2019,standings_2019,by="Tm")
merge_2020 <- merge(batting_2020,standings_2020,by="Tm")

final_data <- rbind(merge_2011, merge_2012, merge_2013, merge_2014,
                    merge_2015, merge_2016, merge_2017, merge_2018,
                    merge_2019, merge_2020) %>% rename(record = 'W-L%')

One cosmetic change made was renaming W-L% to record since the previous name has characters that could cause coding problems in R. Now, one more quick inspection:

head(final_data)
#Tm #Bat BatAge  R/G   G   PA   AB   R    H  2B 3B  HR RBI  SB CS  BB   SO
#1 ARI   51   28.3 4.51 162 6096 5421 731 1357 293 37 172 702 133 55 531 1249
#2 ATL   45   28.9 3.96 162 6169 5528 641 1345 244 16 173 606  77 44 504 1260
#3 BAL   50   28.3 4.37 162 6156 5585 708 1434 273 13 191 684  81 25 452 1120
#4 BOS   49   30.0 5.40 162 6414 5710 875 1600 352 35 203 842 102 42 578 1108
#5 CHC   42   29.2 4.04 162 6130 5549 654 1423 285 36 148 610  69 23 425 1202
#6 CHW   42   30.0 4.04 162 6159 5502 654 1387 252 16 154 625  81 53 475  989
#    BA  OBP  SLG  OPS OPS+   TB GDP HBP SH SF IBB  LOB  W  L  record GB  RS
#1 .250 .322 .413 .736   99 2240  82  61 50 33  55 1064 94 68 0.580  8.0 731
#2 .243 .308 .387 .695   90 2140 113  28 75 30  45 1109 89 73 0.549 13.0 641
#3 .257 .316 .413 .729   97 2306 154  52 24 43  24 1069 69 93 0.426 28.0 708
#4 .280 .349 .461 .810  116 2631 136  50 22 50  52 1190 90 72 0.556  7.0 875
#5 .256 .314 .401 .715   95 2224 123  59 60 35  35 1141 71 91 0.438 31.0 654
#6 .252 .319 .388 .706   89 2133 125  84 52 46  31 1126 79 83 0.488 18.0 654
#   RA pythW-L%
#1 662    0.545
#2 605    0.526
#3 860    0.412
#4 737    0.578
#5 756    0.434
#6 706    0.465

This final dataframe features 300 observations: 30 teams and 10 seasons with basic hitting statistics merged with records, winning percentage, runs scored and runs allowed. However, there is one issue worth resolving at this stage: because of the way these data were scraped, what should be numeric variables are stored as characters:

sapply(final_data, class)
#        Tm        #Bat      BatAge         R/G           G          PA 
"character" "character" "character" "character" "character" "character" 
#         AB           R           H          2B          3B          HR 
"character" "character" "character" "character" "character" "character" 
#        RBI          SB          CS          BB          SO          BA 
"character" "character" "character" "character" "character" "character" 
#        OBP         SLG         OPS        OPS+          TB         GDP 
"character" "character" "character" "character" "character" "character" 
#        HBP          SH          SF         IBB         LOB           W 
"character" "character" "character" "character" "character"   "integer" 
#          L      record          GB          RS          RA    pythW-L% 
#  "integer"   "numeric" "character"   "integer"   "integer"   "numeric"

Fortunately, there is a quick fix to change every variable except teams into a numeric. Changing classifications could be done in the first step of tidymodels, but often it is better to conduct some steps outside of the “pre-processing” stage if you want these changes to be permanent:

library(magrittr)
cols = c(2:36)
final_data[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))
sapply(final_data, class)
#   Tm        #Bat      BatAge         R/G           G          PA 
"character"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
#         AB           R           H          2B          3B          HR 
  "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
#        RBI          SB          CS          BB          SO          BA 
  "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
#        OBP         SLG         OPS        OPS+          TB         GDP 
  "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
#        HBP          SH          SF         IBB         LOB           W 
  "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
#          L      record          GB          RS          RA    pythW-L% 
  "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"

One issue that should be considered now involves which sets of variables to include. Because of COVID-19 concerns, Major League Baseball shortened its 2020 season. While every other season in the dataframe had 162 games, the 2020 campaign had only 60. Using variables signifying totals like total hits or total home runs would skew the results. Instead, explanatory variables representing averages should be normally distributed. While statistics like BA (batting average) are already in the dataframe, stolen bases, caught stealing and grounded into double play averages are not.

Let’s add those variables and also rename OPS+ so the “plus sign” doesn’t mess with any code:

final_data <- final_data %>%
  mutate(SB.avg = SB/G, CS.avg = CS/G, GDP.avg = GDP/G) %>%
  rename(OPS.plus = `OPS+`)

Now, we can proceed with tidymodels.

tidymodels

If you were to model a handful of these explanatory variables on win percentage to determine which modeling technique is most predictive, the process would be conventional:

run the models desired using training data
predict win percentage using testing data
choose the model that predicts the best using most important criteria for goodness-of-fit

Though tidymodels offers the same framework, the package has its own verbiage and workflow that, ultimately, will make model specification far easier and offer flexibility in tuning parameters necessary for more complex machine learning techniques.^[https://www.tidymodels.org/start/tuning/]

In the past, individual models would have to be coded individually. A linear regression would have its own block of code, a Bayesian model another block, etc., then it would be up to the user to save summary metrics to determine goodness-of-fit, not to mention dealing with each model having its own syntax. The caret package helped alleviate some of this burden by having a number of models available with ease. Consider tidymodels to be another evolutionary step in trying out a number of models with little effort. Other advantages to this package will become apparent at each step of the experiment.

Just like with any experiment, tidymodels can be broken down into three steps:

recipe
- to preprocess data
- to specify the models
prep
- execute the models
bake
- test models using testing data

A Simple Introduction

This dataframe–and the experiment–assumes uniformity, meaning it does not matter how the data are grouped between training and testing. Here, let’s perform an 80-20 split and set a seed so results are reproducible:

set.seed(1234)
data_split <- initial_split(final_data, prop = 0.80)
data_split

data_train <- training(data_split)
data_test  <-  testing(data_split)

Let’s first create a simple recipe to specify a linear regression:

simple_data <- 
  recipe(record ~ OBP + SLG + OPS.plus + SB.avg
         + CS.avg + GDP.avg, data = data_train)

The astute data scientist and baseball fan might notice a problem. Components of OPS.plus already include OBP and SLG. There should be reasonable concerns for multi-collinearity. Fortunately, creating interaction terms is easy in tidymodels thanks to step functions. These commands allow for a variety of different pre-processing steps to occur to elements of the training data in just a few lines of code. Here’s how it is done with creating interaction terms using the step_interact() command:

simple_data <- 
  recipe(record ~ OBP + SLG + OPS.plus + SB.avg
         + CS.avg + GDP.avg, data = data_train) %>%
  step_interact(recipe = simple_data, terms = ~ OBP:OPS.plus + SLG:OPS.plus) %>%
  prep()
simple_data
#Data Recipe

#Inputs:

#      role #variables
#   outcome          1
# predictor          6

#Training data contained 240 data points and no missing data.

#Operations:

#Interactions with OBP:OPS.plus + SLG:OPS.plus [trained]

The prepping stage was already included due to ease and the baking stage is straightforward with the one model:

test_ex <- bake(simple_data, new_data = data_test)
head(test_ex)
# A tibble: 6 x 9
#    OBP   SLG OPS.plus SB.avg CS.avg GDP.avg record OBP_x_OPS.plus
#  <dbl> <dbl>    <dbl>  <dbl>  <dbl>   <dbl>  <dbl>          <dbl>
#1 0.322 0.413       99  0.821  0.340   0.506  0.58            31.9
#2 0.308 0.387       90  0.475  0.272   0.698  0.549           27.7
#3 0.314 0.401       95  0.426  0.142   0.759  0.438           29.8
#4 0.325 0.425      102  0.580  0.191   0.704  0.593           33.2
#5 0.305 0.349       85  1.05   0.272   0.648  0.438           25.9
#6 0.292 0.348       84  0.772  0.247   0.506  0.414           24.5
# … with 1 more variable: OPS.plus_x_SLG <dbl>

The bake command creates a tibble of the test data with explanatory variables used (including interaction terms) and predictions for record.

If you want to look at the model itself, its summary metrics and predictions, a little backtracking is required. First, use the prep command again using the aforementioned recipe, but saved as a new object. One bit of advice Max Kuhn gave in his presentation was to use _fit in naming the model so the user is reminded it is a fitted model. Then, bake two things: one object using the training data and one with the testing. Next, create a model using the same variables but with the training data, use the glance() command to look at summary statistics, and then use the predict() command to see the results. The code looks like this with key output included:

data_rec_prepped <- prep(simple_data)
data_train_prepped <- bake(data_rec_prepped, new_data = NULL)
data_test_prepped <- bake(data_rec_prepped, data_test)

lm_fit <- lm(record ~ ., data = data_train_prepped)
glance(lm_fit)
# A tibble: 1 x 12
#  r.squared adj.r.squared  sigma statistic  p.value    df logLik   AIC   BIC
#      <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> #<dbl>
#1     0.472         0.454 0.0559      25.8 2.70e-28     8   356. -693. -658.

View(tidy(lm_fit))

predict(lm_fit, data_test_prepped %>% head())
#        1         2         3         4         5         6 
#0.5123714 0.4431140 0.4952766 0.5352913 0.4599048 0.4092434

The View() command provides the following estimates and p-values for the model:

There is one coefficient that is not intuitive: SLG is a negative number where a predicted record goes down if slugging percentage goes up. However, the interaction term is positive. Also, the only variable with a p-value greater than .05 was stolen base average. Otherwise, every variable is statistically significant with 95% confidence.

The summary metrics show an r-squared of 0.472, which suggests some improvements should be made if any significant analysis were to be implemented. Again, pitching and defensive data would likely help goodness-of-fit, but for this tutorial, advanced modeling techniques could also be useful and will be the next approach.

A Workflow

Another advantage to tidymodels comes from the ability to streamline the model building and selection process, also called a workflow. The interface is simpler with fewer objects to keep track of, recipes and prep work are more easily executed and more advanced models can be summoned concisely.^[https://workflows.tidymodels.org]

Earlier, the simplified example used a basic linear model already included in R. More advanced models may need outside packages. Fortunately, the set_engine() command states which package or system is needed for the model that will be used. Think of the engine as setting the foundation where additional, more complex, models are based. For instance, as noted in the example, interaction terms were created for the linear model because of concerns about multi-collinearity. Principal Component Analysis (PCA) could also alleviate such concerns by reducing dimensionality to a smaller data set that is overall less correlated. Because PCA is an extension of a standard linear regression, the engine in our baseball experiment can be set to lm:

reg_model <- linear_reg() %>% set_engine("lm")

Then, just like before, the user can create a recipe, only this time include the step_pca() and step_normalize() step functions to create a PCA and to center and scale numeric data, respectively. As a standard, let’s set the number of new features to 10:

pca_rec <- recipe(record ~ OBP + SLG + OPS.plus + SB.avg
                  + CS.avg + GDP.avg, data = data_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_pca(all_numeric_predictors(), num_comp = 10)

A linear regression and a PCA model have been written. These models can now be part of one workflow:

pca_lm_wflow <- workflow() %>%
  add_model(reg_model) %>%
  add_recipe(pca_rec)

Using the fit() command, estimates can now be extracted from the training data and predictions from the testing data:

estimate <- fit(pca_lm_wflow, data_train)
predict(estimate, data_test)
# A tibble: 60 x 1
#   .pred
#   <dbl>
# 1 0.510
# 2 0.444
# 3 0.492
# 4 0.532
# 5 0.448
# 6 0.428
# 7 0.514
# 8 0.441
# 9 0.554
#10 0.637
# … with 50 more rows

Once again, the predict() command produces a tibble of predictions.

Ultimately this workflow is meant to determine which model best predicts. There are myriad of models to choose from. Previously, coding each and every model would be, to put it mildly, a pain. Even keeping track of appropriate options may not always be feasible, or is at the very least, time consuming. To streamline this process, the command parsnip::parsnip_addin() brings up not just available options for models, but the line will generate the code needed for the objects like the specification of tuning parameters, the engines, etc.:

For this workflow, let’s utilize four different “approaches” to two models, a linear regression and a neural network. The latter can be specified this way:

nnet_model <- 
  mlp(penalty = tune(), hidden_units = tune(), epochs = tune()) %>%
  set_engine("nnet") %>%
  set_mode("regression")

Next, let’s look at another PCA, except we “tune” the number of new features:

simple_rec <- recipe(record ~ OBP + SLG + OPS.plus + SB.avg
                     + CS.avg + GDP.avg, data = data_train) %>%
  step_normalize(all_numeric_predictors())

pca_rec <- simple_rec %>%
  step_pca(all_numeric_predictors(), num_comp = tune())

Two more categories are a partial least squares and a filtered approach with a high correlation filter:

pls_rec <- simple_rec %>%
  step_pls(all_numeric_predictors(), outcome = "record", num_comp = tune())

filter_rec <- simple_rec %>%
  step_corr(all_numeric_predictors(), threshold = tune())

All of these objects can now be included in a single workflow:

baseball_wflow <- workflow_set(
  preproc = list(basic = simple_rec, pca = pca_rec, pls = pls_rec,
                 filtered = filter_rec),
  models = list(lm = reg_model, nnet = nnet_model)
)
baseball_wflow
# A workflow set/tibble: 8 x 4
#  wflow_id      info             option      result    
#  <chr>         <list>           <list>      <list>    
#1 basic_lm      <tibble [1 × 4]> <wrkflw__ > <list [0]>
#2 basic_nnet    <tibble [1 × 4]> <wrkflw__ > <list [0]>
#3 pca_lm        <tibble [1 × 4]> <wrkflw__ > <list [0]>
#4 pca_nnet      <tibble [1 × 4]> <wrkflw__ > <list [0]>
#5 pls_lm        <tibble [1 × 4]> <wrkflw__ > <list [0]>
#6 pls_nnet      <tibble [1 × 4]> <wrkflw__ > <list [0]>
#7 filtered_lm   <tibble [1 × 4]> <wrkflw__ > <list [0]>
#8 filtered_nnet <tibble [1 × 4]> <wrkflw__ > <list [0]>

When this workflow is called up, all eight models are listed. But, if you look closely at the result column: nothing has been populated. All that has been done is state what the workflow is supposed to do, it has not been formally conducted.

To determine which model performs the best, we’ll use a bootstrapping approach with five partitions of the training data and one repeat of this V-fold partitioning. It is imperative to install the mixOmics first if it hasn’t been done already:

install.packages("BiocManager")
BiocManager::install('mixOmics')

As for the bootstrapping setup, setting a seed for reproducible results:

set.seed(999)
data_folds <- vfold_cv(data_train, v = 5, repeats = 1)

Now, it is time to execute the workflow using the workflow_map() command:

baseball_res <- baseball_wflow %>%
  workflow_map(resamples = data_folds,
               seed = 1,
               grid = 20)
baseball_res
# A workflow set/tibble: 8 x 4
#  wflow_id      info             option      result   
#  <chr>         <list>           <list>      <list>   
#1 basic_lm      <tibble [1 × 4]> <wrkflw__ > <rsmp[+]>
#2 basic_nnet    <tibble [1 × 4]> <wrkflw__ > <tune[+]>
#3 pca_lm        <tibble [1 × 4]> <wrkflw__ > <tune[+]>
#4 pca_nnet      <tibble [1 × 4]> <wrkflw__ > <tune[+]>
#5 pls_lm        <tibble [1 × 4]> <wrkflw__ > <tune[+]>
#6 pls_nnet      <tibble [1 × 4]> <wrkflw__ > <tune[+]>
#7 filtered_lm   <tibble [1 × 4]> <wrkflw__ > <tune[+]>
#8 filtered_nnet <tibble [1 × 4]> <wrkflw__ > <tune[+]>

Now, when the baseball_res object is called, the result column is populated with the models, represented by “rsmp” and “tune”. The models have been successfully created.

The next step is ranking model success using the rank_results() command:

baseball_res %>%
  rank_results(select_best = TRUE) %>%
  dplyr::select(wflow_id, .metric, mean, std_err, n, rank)
# A tibble: 16 x 6
#   wflow_id      .metric   mean std_err     n  rank
#   <chr>         <chr>    <dbl>   <dbl> <int> <int>
# 1 basic_nnet    rmse    0.0573 0.00346     5     1
# 2 basic_nnet    rsq     0.422  0.0952      5     1
# 3 pls_nnet      rmse    0.0580 0.00339     5     2
# 4 pls_nnet      rsq     0.414  0.0929      5     2
# 5 basic_lm      rmse    0.0581 0.00337     5     3
# 6 basic_lm      rsq     0.419  0.0944      5     3
# 7 filtered_lm   rmse    0.0581 0.00337     5     4
# 8 filtered_lm   rsq     0.419  0.0944      5     4
# 9 pls_lm        rmse    0.0583 0.00340     5     5
#10 pls_lm        rsq     0.415  0.0940      5     5
#11 pca_nnet      rmse    0.0599 0.00262     5     6
#12 pca_nnet      rsq     0.380  0.0805      5     6
#13 pca_lm        rmse    0.0602 0.00248     5     7
#14 pca_lm        rsq     0.384  0.0759      5     7
#15 filtered_nnet rmse    0.0625 0.00414     5     8
#16 filtered_nnet rsq     0.321  0.0949      5     8

Two metrics are listed when ranking all eight models: root mean square error (rmse) and R-squared (rsq), thus giving 16 lines in a new tibble. It turns out that the basic neural network had the best results. But, just for fun, let’s look at a graph of these summary statistics for all eight models:

baseball_res %>% autoplot(select_best = TRUE)

Using rmse and rsq, it is also visually clear the the basic neural network has the best results, though the tradeoff for using others is not substantial. It is up to the user to determine if a more intuitive model would be preferred; in fact, Bayesian analysis is available to determine effect size differences.

Conclusion

This tutorial scratches the surface of how tidymodels can be applied to sports data. Dozens of other models, classification problems and variable selection are other avenues any data scientist can travel to further master this package. There are several other resources I suggest reading beyond what has already been listed, but perhaps the most thorough can be found here.

Adapting Two-Way ANOVA Presentation

For my lesson on including a second grouping variable to the ANOVA experience, I came across this presentation. The background information is elegant, the data visualizations are clear and the opportunities to learn more are sprinkled throughout. The only problem with the lesson was this is an English dataset. Specifically, the dataset involves job satisfaction among education levels, and the dataset uses the terms “school”, “college” and “university” to refer to compulsory school, a two-year program to prepare for university, then undergraduate and beyond, respectively.

Because I did not want to distract my students from the statistical lesson, I wanted to use the tidyverse to rename these education levels to something more American: high school, undergraduate and postgrad:

jobsatisfaction <- jobsatisfaction %>%
  mutate(education_level = str_replace(education_level, "school",
                                       "High School")) %>%
  mutate(education_level = str_replace(education_level, "college",
                                       "Undergrad")) %>%
  mutate(education_level = str_replace(education_level, "university",
                                       "Postgrad"))

While I wanted the next step to be exploratory data analysis with boxplots, there was just one problem: R automatically alphabetizes these new education levels. It would make more intuitive sense to have Postgrad after Undergrad. To make this change, I used the factor command:

jobsatisfaction$education_level <- 
  factor(jobsatisfaction$education_level,
         levels = c("High School", "Undergrad", "Postgrad"))

From there I was able to walk through the rest of the tutorial with ease. It is one I would highly recommend if you are learning/teaching any ANOVA approach.

Randomly Splitting a Class of Students Into Groups

One of the temptations I have when creating group activities in class is to lean on the location of where students are sitting to divide accordingly. Given all of the research concerning how important it is to promote diversity of thought, it would behoove everyone to randomly create pairings and groups among the class.

Originally, I used the sample command to randomly assign anything, but the split command is more effective. This line simply divides a vector into various parts. I am using a base R approach, but certainly slice in dplyr can also be utilized. In my case, I am looking to divide my class of 12:

#To create reproducible results
set.seed(123)

#Create an object called students
students <- as.data.frame(c("Joe", "Cynthia", "Lisa", "Monica", "Emily", "Aaron", "Bobby", "Zoe", "Vanessa", "Chloe", "Brandy", "Jen"))
students <- unname(students) #No headers necessary

#Randomize the students
random_students <- students[sample(nrow(students)),]

#To create pairs
split(random_students,rep(1:6,each=2))

#Or to create groups of three
split(random_students,rep(1:4,each=3))

Depending upon the number of students, simply adjust the numbers as you see fit.

Displaying Statistical Output

Recently I showed my class how to write functions for calculating both z-tests and confidence intervals for the 68% and 95% levels. Because students analyzing these concepts for the first time deserve to have output well organized, I wanted to investigate the different ways for making these functions print results as elegant as possible. The biggest tasks are to have labels for key statistics and know when to have line breaks.

As with most other things in R, there are options. One is cat(). Quotation marks are used for spacing and “\n” is for line breaks. For our z-test function, I utilized cat():

z.test = function(x,mu,popvar){
  one.tail.p <- NULL
  z.score <- round((mean(x)-mu)/(popvar/sqrt(length(x))),3)
  one.tail.p <- round(pnorm(abs(z.score),lower.tail = FALSE),3)
  cat(" z =",z.score,"\n",
      "one-tailed probability =", one.tail.p,"\n",
      "two-tailed probability =", 2*one.tail.p )}

where “x” is the sample vector, “mu” is the mean of the population and “popvar” is the variance of the population. Here, the first line is “z = ” then the z-score, then a line break, then the probability labels and their corresponding p-values. So if an IQ vector were used with a mean of 100 and an sd of 15:

z = 1.733
one-tailed probability = 0.042 
 
two-tailed probability = 0.084

The other option is writeLines(). This command is more concise because you don’t need “\n” for line breaks. Here’s how I used it for our confidence intervals function:

confidence.intervals <- function(sample.size, sd, mean){
  standard.error <- sd/sqrt(sample.size)
  lower.bound <- round(mean - standard.error,2)
  upper.bound <- round(mean + standard.error,2)
  big.lower.bound <- round(mean - (1.96*standard.error),2)
  big.upper.bound <- round(mean + (1.96*standard.error),2)
  return(writeLines(c("68% CI:",
           lower.bound, upper.bound, "",
           "95% CI:", big.lower.bound, big.upper.bound)))
}

Here, anytime there is a comma, a new line is created. The sample output looks like this:

68% CI:

78.79
83.21


95% CI:
76.66

85.34

Not only do you now have options for statistical output, you have functions for computing z-tests and confidence intervals. Don’t say I never did anything for you.

The Edward Egros Substack

Discussion about this post