lm() model
Highlights & Limitations
- Supports prediction intervals, it uses the
qr.solve()
function to parse the interval coefficient of each term. - Supports categorical variables and interactions
- Only treatment contrast (
contr.treatment
) are supported. offset
is supported- Categorical variables are supported
- In-line functions in the formulas are not supported:
- OK -
wt ~ mpg + am
- OK -
mutate(mtcars, newam = paste0(am))
and thenwt ~ mpg + newam
- Not OK -
wt ~ mpg + as.factor(am)
- Not OK -
wt ~ mpg + as.character(am)
- OK -
How it works
df <- mtcars %>%
mutate(char_cyl = paste0("cyl", cyl)) %>%
select(mpg, wt, char_cyl, am)
model <- lm(mpg ~ wt + char_cyl, offset = am, data = df)
It returns a SQL query that contains the coefficients (model$coefficients
) operated against the correct variable or categorical variable value. In most cases the resulting SQL is one short CASE WHEN
statement per coefficient. It appends the offset
field or value, if one is provided.
library(tidypredict)
tidypredict_sql(model, dbplyr::simulate_mssql())
## <SQL> ((((32.4105336886021) + ((`wt`) * (-2.83243330448326))) + ((CASE WHEN (((`char_cyl`) = ('cyl6')) = 'TRUE') THEN (1.0) WHEN (((`char_cyl`) = ('cyl6')) = 'FALSE') THEN (0.0) END) * (-4.26714873091281))) + ((CASE WHEN (((`char_cyl`) = ('cyl8')) = 'TRUE') THEN (1.0) WHEN (((`char_cyl`) = ('cyl8')) = 'FALSE') THEN (0.0) END) * (-6.12588309683682))) + (`am`)
Alternatively, use tidypredict_to_column()
if the results are the be used or previewed in dplyr
.
df %>%
tidypredict_to_column(model) %>%
head(10)
## mpg wt char_cyl am fit
## 1 21.0 2.620 cyl6 1 21.72241
## 2 21.0 2.875 cyl6 1 21.00014
## 3 22.8 2.320 cyl4 1 26.83929
## 4 21.4 3.215 cyl6 0 19.03711
## 5 18.7 3.440 cyl8 0 16.54108
## 6 18.1 3.460 cyl6 0 18.34317
## 7 14.3 3.570 cyl8 0 16.17286
## 8 24.4 3.190 cyl4 0 23.37507
## 9 22.8 3.150 cyl4 0 23.48837
## 10 19.2 3.440 cyl6 0 18.39981
Prediction intervals
Use tidypredict_sql_interval()
to get the SQL query that operates the prediction interval. The interval
defaults to 0.95
tidypredict_sql_interval(model, dbplyr::simulate_mssql())
## <SQL> (2.04840714179524) * SQRT((((((-0.176776695296637) * (-0.176776695296637) * (6.63799055122669)) + (((-0.590557271637747) + ((`wt`) * (0.183559646169165))) * ((-0.590557271637747) + ((`wt`) * (0.183559646169165))) * (6.63799055122669))) + ((((-0.126215672528828) + ((`wt`) * (0.0101118696567173))) + ((CASE WHEN (((`char_cyl`) = ('cyl6')) = 'TRUE') THEN (1.0) WHEN (((`char_cyl`) = ('cyl6')) = 'FALSE') THEN (0.0) END) * (0.428266330860589))) * (((-0.126215672528828) + ((`wt`) * (0.0101118696567173))) + ((CASE WHEN (((`char_cyl`) = ('cyl6')) = 'TRUE') THEN (1.0) WHEN (((`char_cyl`) = ('cyl6')) = 'FALSE') THEN (0.0) END) * (0.428266330860589))) * (6.63799055122669))) + (((((0.386215468111418) + ((`wt`) * (-0.230516217152034))) + ((CASE WHEN (((`char_cyl`) = ('cyl6')) = 'TRUE') THEN (1.0) WHEN (((`char_cyl`) = ('cyl6')) = 'FALSE') THEN (0.0) END) * (0.332336511639638))) + ((CASE WHEN (((`char_cyl`) = ('cyl8')) = 'TRUE') THEN (1.0) WHEN (((`char_cyl`) = ('cyl8')) = 'FALSE') THEN (0.0) END) * (0.646203930513815))) * ((((0.386215468111418) + ((`wt`) * (-0.230516217152034))) + ((CASE WHEN (((`char_cyl`) = ('cyl6')) = 'TRUE') THEN (1.0) WHEN (((`char_cyl`) = ('cyl6')) = 'FALSE') THEN (0.0) END) * (0.332336511639638))) + ((CASE WHEN (((`char_cyl`) = ('cyl8')) = 'TRUE') THEN (1.0) WHEN (((`char_cyl`) = ('cyl8')) = 'FALSE') THEN (0.0) END) * (0.646203930513815))) * (6.63799055122669))) + (6.63799055122669))
Prediction intervals also works in the tidypredict_to_column()
, just set the add_interval
argument to TRUE
.
df %>%
tidypredict_to_column(model, add_interval = TRUE) %>%
head(10)
## mpg wt char_cyl am fit upper lower
## 1 21.0 2.620 cyl6 1 21.72241 27.41716 16.02765
## 2 21.0 2.875 cyl6 1 21.00014 26.65467 15.34560
## 3 22.8 2.320 cyl4 1 26.83929 32.35180 21.32678
## 4 21.4 3.215 cyl6 0 19.03711 24.68113 13.39309
## 5 18.7 3.440 cyl8 0 16.54108 22.07276 11.00940
## 6 18.1 3.460 cyl6 0 18.34317 24.01030 12.67603
## 7 14.3 3.570 cyl8 0 16.17286 21.67635 10.66938
## 8 24.4 3.190 cyl4 0 23.37507 29.06408 17.68606
## 9 22.8 3.150 cyl4 0 23.48837 29.16231 17.81443
## 10 19.2 3.440 cyl6 0 18.39981 24.06411 12.73552
Under the hood
The parser reads several parts of the lm
object to tabulate all of the needed variables. One entry per coefficient is added to the final table, those entries will have the results of qr.solve()
already operated and placed in the correct column, they will have a qr_
prefix. There will be one qr_
column per coefficient.
Other variables are added at the end. Some variables are not required for every parsed model. For example, offset
is listed because it’s part of the formula (call) of the model, if there were no offset in a given model, that line would not exist.
parse_model(model)
## # A tibble: 10 x 10
## labels estimate type field_1 field_2 qr_1 qr_2 qr_3 qr_4
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Inter~ 32.4 term <NA> <NA> - 0.177 - 0.591 - 0.126 0.386
## 2 wt - 2.83 term <NA> {{:}} 0 0.184 0.0101 - 0.231
## 3 char_c~ - 4.27 term cyl6 <NA> 0 0 0.428 0.332
## 4 char_c~ - 6.13 term cyl8 <NA> 0 0 0 0.646
## 5 labels 0 vari~ char_c~ wt NA NA NA NA
## 6 model NA vari~ <NA> <NA> NA NA NA NA
## 7 version NA vari~ <NA> <NA> NA NA NA NA
## 8 residu~ NA vari~ <NA> <NA> NA NA NA NA
## 9 sigma2 NA vari~ <NA> <NA> NA NA NA NA
## 10 offset NA vari~ <NA> <NA> NA NA NA NA
## # ... with 1 more variable: vals <chr>
The output from parse_model()
is transformed into a dplyr
, a.k.a Tidy Eval, formula. All categorical variables are operated using if_else()
.
tidypredict_fit(model)
## ((((32.4105336886021) + ((wt) * (-2.83243330448326))) + ((ifelse((char_cyl) ==
## ("cyl6"), 1, 0)) * (-4.26714873091281))) + ((ifelse((char_cyl) ==
## ("cyl8"), 1, 0)) * (-6.12588309683682))) + (am)
A function to put together the Tidy Eval interval formula is also supported
tidypredict_interval(model)
## (2.04840714179524) * sqrt((((((-0.176776695296637) * (-0.176776695296637) *
## (6.63799055122669)) + (((-0.590557271637747) + ((wt) * (0.183559646169165))) *
## ((-0.590557271637747) + ((wt) * (0.183559646169165))) * (6.63799055122669))) +
## ((((-0.126215672528828) + ((wt) * (0.0101118696567173))) +
## ((ifelse((char_cyl) == ("cyl6"), 1, 0)) * (0.428266330860589))) *
## (((-0.126215672528828) + ((wt) * (0.0101118696567173))) +
## ((ifelse((char_cyl) == ("cyl6"), 1, 0)) * (0.428266330860589))) *
## (6.63799055122669))) + (((((0.386215468111418) + ((wt) *
## (-0.230516217152034))) + ((ifelse((char_cyl) == ("cyl6"),
## 1, 0)) * (0.332336511639638))) + ((ifelse((char_cyl) == ("cyl8"),
## 1, 0)) * (0.646203930513815))) * ((((0.386215468111418) +
## ((wt) * (-0.230516217152034))) + ((ifelse((char_cyl) == ("cyl6"),
## 1, 0)) * (0.332336511639638))) + ((ifelse((char_cyl) == ("cyl8"),
## 1, 0)) * (0.646203930513815))) * (6.63799055122669))) + (6.63799055122669))
From there, the Tidy Eval formula can be used anywhere where it can be operated. tidypredict
provides three paths:
- Use directly inside
dplyr
,mutate(df, !! tidypredict_fit(model))
- Use
tidypredict_to_column(model)
to a piped command set - Use
tidypredict_to_sql(model)
to retrieve the SQL statement
The same applies to the prediction interval functions.
How it performs
Testing the tidypredict
results is easy. The tidypredict_test()
function automatically uses the lm
model object’s data frame, to compare tidypredict_fit()
, and tidypredict_interval()
to the results given by predict()
tidypredict_test(model)
## tidypredict test results
## Difference threshold: 1e-12
##
## All results are within the difference threshold
To run with prediction intervals set the include_intervals
argument to TRUE
tidypredict_test(model, include_intervals = TRUE)
## tidypredict test results
## Difference threshold: 1e-12
##
## All results are within the difference threshold