--- title: "R and Polars expressions" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{R and Polars expressions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` When we use the `tidyverse`, we use R expressions in mainly three places: `filter()`, `mutate()`, and `summarize()`. ```{r echo=FALSE} library(dplyr, warn.conflicts = FALSE) ``` ```{r eval=FALSE} library(dplyr, warn.conflicts = FALSE) filter(mtcars, am + gear > carb) mutate(mtcars, x = (qsec - mean(qsec) / sd(qsec))) mtcars |> group_by(cyl) |> summarize(x = mean(qsec) / sd(qsec)) ``` This is very convenient but creates a challenge for `tidypolars`. Indeed, while it is possible to pass R functions directly to a Polars Data/LazyFrame, it is **strongly discouraged** to do so because it doesn't take advantage of Polars optimizations. Indeed, Polars comes with dozens of built-in functions for maths (`median`, `var`, `arccos`, ...), string manipulation (`len_chars`, `starts`, ...), and date-time (`hour`, `quarter`, `ordinal_day`, ...). All of these functions are optimized internally and are ran in parallel under the hood, which will not be the case if we pass R functions. However, using these Polars expressions would imply that we need to learn these new functions and this new syntax. To avoid doing that, `tidypolars` will automatically translate R expressions into Polars ones. Basically, **you can keep writing R expressions in most situations**, and they will automatically be translated to Polars syntax. However, there are some situations where this might not work, so this vignette explains the process and the limitations. # How does `tidypolars` translate R expressions into Polars expressions? When `tidypolars` receives an expression, it runs a function `translate()` several times until all components are translated to their Polars equivalent. There are four possible components: single values, column names, external objects, and functions. ## Single values, column names, and external objects If you pass a single value, like `x = 1` or `x = "a"`, it is wrapped into `pl$lit()`. This is also the case for external objects with the difference that these need to be wrapped in `{{ }}` and are evaluated before being wrapped into `pl$lit()`. Column names, like `x = mpg`, are wrapped into `pl$col()`. ``` x = "a" -> x = pl$lit("a") x = {{ some_value }} -> x = pl$lit(*value*) x = mpg -> x = pl$col("mpg") ``` ## Functions Functions are split into two categories: built-in functions (i.e functions provided by base R or by other packages), and user-defined functions (UDF) that are written by the user (you). ### Built-in functions In the first case, `tidypolars` checks the function name and whether it has already been translated internally. For example, if we call the R function `mean(x, trim = 2)`, then it looks for a translation of `mean()`. You can see the list of supported R functions at the bottom of this vignette. Note that most of essential base R functions are supported, as well as many functions from `dplyr` or from `stringr` for example. Now that `tidypolars` knows that a translation of `mean()` exists, it parses the arguments in the call to translate them to the Polars syntax: internally, `x` is converted to `pl$col("x")` if there is a column `"x"` in the data. Sometimes, additional arguments do not have their equivalent in Polars. This is the case for the argument `trim` here. In this case, `tidypolars` ignores this argument and warns the user: ```{r} library(tidypolars) library(polars) mtcars |> as_polars_df() |> mutate(x = mean(mpg, trim = 2)) ``` ### User-defined functions User-defined functions (UDF) are more challenging. Indeed, it is technically possible to inspect the code inside a UDF, but rewriting it to match Polars syntax would be extremely complicated. In this situation, you will have to rewrite your custom function using Polars syntax so that it returns a Polars expression. For example, we could make a function to standardize a column like this: ```{r} pl_standardize <- function(x) { (x - x$mean()) / x$std() } ``` Remember that the column name used as `x` will end up wrapped into `pl$col()`, so to check that your function returns a Polars expression, you have to provide a `pl$col()` call: ```{r} pl_standardize(pl$col("mpg")) ``` This function correctly returns a Polars expression, so we can now use it like any other function: ```{r} mtcars |> as_polars_df() |> mutate(x = pl_standardize(mpg)) ``` ### Special case: `across()` [`across()`](https://dplyr.tidyverse.org/reference/across.html) is a very useful function that applies a function (or a list of functions) to a selection of columns. It accepts built-in functions, UDFs, and anonymous functions. ```{r} mtcars |> as_polars_df() |> mutate( across( .cols = contains("a"), list(mean = mean, stand = pl_standardize, ~ sd(.x)) ) ) ``` Similarly, UDFs and anonymous functions will error if they don't return a Polars expression: ```{r error=TRUE} mtcars |> as_polars_df() |> mutate( across( .cols = contains("a"), .fns = list( mean = mean, function(x) { (x - mean(x)) / sd(x) }, ~ sd(.x) ) ) ) ``` ## List of base R and `tidyverse` functions supported by `tidypolars` ```{r echo=FALSE, message=FALSE} library(dplyr) library(knitr) out <- tribble( ~Package, ~Function, "`base`", "`abs`", "`base`", "`acos`", "`base`", "`acosh`", "`base`", "`all`", "`base`", "`any`", "`base`", "`asin`", "`base`", "`asinh`", "`base`", "`atan`", "`base`", "`atanh`", "`base`", "`ceiling`", "`base`", "`cos`", "`base`", "`cosh`", "`base`", "`cummin`", "`base`", "`cumsum`", "`base`", "`diff`", "`base`", "`exp`", "`base`", "`floor`", "`base`", "`grepl`", "`base`", "`ifelse`", "`base`", "`ISOdatetime`", "`base`", "`length`", "`base`", "`log`", "`base`", "`log10`", "`base`", "`max`", "`base`", "`mean`", "`base`", "`min`", "`base`", "`nchar`", "`base`", "`paste0`", "`base`", "`paste`", "`base`", "`rank`", "`base`", "`rev`", "`base`", "`round`", "`base`", "`sin`", "`base`", "`sinh`", "`base`", "`sort`", "`base`", "`sqrt`", "`base`", "`strptime`", "`base`", "`tan`", "`base`", "`tanh`", "`base`", "`tolower`", "`base`", "`toupper`", "`base`", "`unique`", "`base`", "`which.min`", "`base`", "`which.max`", "`dplyr`", "`between`", "`dplyr`", "`case_match`", "`dplyr`", "`case_when`", "`dplyr`", "`coalesce`", "`dplyr`", "`consecutive_id`", "`dplyr`", "`dense_rank`", "`dplyr`", "`first`", "`dplyr`", "`group_keys`", "`dplyr`", "`group_vars`", "`dplyr`", "`if_else`", "`dplyr`", "`lag`", "`dplyr`", "`last`", "`dplyr`", "`min_rank`", "`dplyr`", "`n`", "`dplyr`", "`nth`", "`dplyr`", "`n_distinct`", "`dplyr`", "`row_number`", "`lubridate`", "`ddays`", "`lubridate`", "`dhours`", "`lubridate`", "`dmilliseconds`", "`lubridate`", "`dminutes`", "`lubridate`", "`dseconds`", "`lubridate`", "`dweeks`", "`lubridate`", "`make_date`", "`lubridate`", "`make_datetime`", "`lubridate`", "`wday`", "`stats`", "`median`", "`stats`", "`lag`", "`stats`", "`sd`", "`stats`", "`var`", "`stringr`", "`regex`", "`stringr`", "`str_count`", "`stringr`", "`str_dup`", "`stringr`", "`str_ends`", "`stringr`", "`str_extract`", "`stringr`", "`str_extract_all`", "`stringr`", "`str_length`", "`stringr`", "`str_pad`", "`stringr`", "`str_remove`", "`stringr`", "`str_remove_all`", "`stringr`", "`str_replace`", "`stringr`", "`str_replace_all`", "`stringr`", "`str_split`", "`stringr`", "`str_split_i`", "`stringr`", "`str_squish`", "`stringr`", "`str_starts`", "`stringr`", "`str_sub`", "`stringr`", "`str_trim`", "`stringr`", "`str_to_lower`", "`stringr`", "`str_to_title`", "`stringr`", "`str_to_upper`", "`stringr`", "`str_trunc`", "`stringr`", "`word`", "`tidyr`", "`replace_na`", "`tools`", "`toTitleCase`" ) |> mutate(Notes = case_when( Package == "`lubridate`" & Function == "`make_datetime`" ~ "In `lubridate::make_datetime()`, when there is an overflow (for example `hours = 25`), then it is automatically converted to the higher unit (for example 1 day and 1h). In Polars, this returns `NA`.", Package == "`lubridate`" & Function == "`wday`" ~ "Requires `week_start == 7`. If `label = TRUE`, it returns a string variable and not a factor as in `lubridate`.", Package == "`dplyr`" & Function == "`row_number`" ~ "Doesn't work when `x` is missing.", .default = "" )) |> arrange(Package) kable(out) ```