---
title: "R and Polars expressions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{R and Polars expressions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


When we use the `tidyverse`, we use R expressions in mainly three places: `filter()`,
`mutate()`, and `summarize()`.

```{r echo=FALSE}
library(dplyr, warn.conflicts = FALSE)
```


```{r eval=FALSE}
library(dplyr, warn.conflicts = FALSE)

filter(mtcars, am + gear > carb)
mutate(mtcars, x = (qsec - mean(qsec) / sd(qsec)))
mtcars |> 
  group_by(cyl) |> 
  summarize(x = mean(qsec) / sd(qsec))
```

This is very convenient but creates a challenge for `tidypolars`. Indeed, while
it is possible to pass R functions directly to a Polars Data/LazyFrame, it is
**strongly discouraged** to do so because it doesn't take advantage of Polars
optimizations. 

Indeed, Polars comes with dozens of built-in functions for maths (`median`, 
`var`, `arccos`, ...), string manipulation (`len_chars`, `starts`, ...), and date-time 
(`hour`, `quarter`, `ordinal_day`, ...). All of these functions are optimized
internally and are ran in parallel under the hood, which will not be the case if
we pass R functions.

However, using these Polars expressions would imply that we need to learn these
new functions and this new syntax. To avoid doing that, `tidypolars` will 
automatically translate R expressions into Polars ones. Basically, **you can
keep writing R expressions in most situations**, and they will automatically be
translated to Polars syntax. 

However, there are some situations where this might not work, so this vignette
explains the process and the limitations.

# How does `tidypolars` translate R expressions into Polars expressions?

When `tidypolars` receives an expression, it runs a function `translate()` 
several times until all components are translated to their Polars equivalent. 
There are four possible components: single values, column names, external objects,
and functions.


## Single values, column names, and external objects

If you pass a single value, like `x = 1` or `x = "a"`, it is wrapped into 
`pl$lit()`. This is also the case for external objects with the difference that
these need to be wrapped in `{{ }}` and are evaluated before being wrapped into
`pl$lit()`.

Column names, like `x = mpg`, are wrapped into `pl$col()`. 

```
x = "a"               ->  x = pl$lit("a")
x = {{ some_value }}  ->  x = pl$lit(*value*)
x = mpg               ->  x = pl$col("mpg")
```

## Functions


Functions are split into two categories: built-in functions (i.e functions 
provided by base R or by other packages), and user-defined functions (UDF) that
are written by the user (you). 


### Built-in functions

In the first case, `tidypolars` checks the function name and whether it has 
already been translated internally. For example, if we call the R function 
`mean(x, trim = 2)`, then it looks for a translation of `mean()`. You can see the
list of supported R functions at the bottom of this vignette. Note that most of
essential base R functions are supported, as well as many functions from `dplyr` 
or from `stringr` for example.

Now that `tidypolars` knows that a translation of `mean()` exists, it parses
the arguments in the call to translate them to the Polars syntax: internally,
`x` is converted to `pl$col("x")` if there is a column `"x"` in the data. 
Sometimes, additional arguments do not have their equivalent in Polars. This is
the case for the argument `trim` here. In this case, `tidypolars` ignores
this argument and warns the user:

```{r}
library(tidypolars)
library(polars)

mtcars |> 
  as_polars_df() |> 
  mutate(x = mean(mpg, trim = 2))
```


### User-defined functions

User-defined functions (UDF) are more challenging. Indeed, it is technically 
possible to inspect the code inside a UDF, but rewriting it to match Polars syntax
would be extremely complicated. In this situation, you will have to rewrite your
custom function using Polars syntax so that it returns a Polars expression. For
example, we could make a function to standardize a column like this:

```{r}
pl_standardize <- function(x) {
  (x - x$mean()) / x$std()
}
```

Remember that the column name used as `x` will end up wrapped into `pl$col()`, 
so to check that your function returns a Polars expression, you have to provide
a `pl$col()` call:

```{r}
pl_standardize(pl$col("mpg"))
```

This function correctly returns a Polars expression, so we can now use it like
any other function:

```{r}
mtcars |> 
  as_polars_df() |> 
  mutate(x = pl_standardize(mpg))
```


### Special case: `across()`

[`across()`](https://dplyr.tidyverse.org/reference/across.html) is a very useful
function that applies a function (or a list of functions) to a selection of 
columns. It accepts built-in functions, UDFs, and anonymous functions.

```{r}
mtcars |> 
  as_polars_df() |> 
  mutate(
    across(
      .cols = contains("a"),
      list(mean = mean, stand = pl_standardize, ~ sd(.x))
    )
  )
```

Similarly, UDFs and anonymous functions will error if they don't return a Polars
expression:

```{r error=TRUE}
mtcars |> 
  as_polars_df() |> 
  mutate(
    across(
      .cols = contains("a"),
      .fns = list(
        mean = mean,  
        function(x) {
           (x - mean(x)) / sd(x)
        },
        ~ sd(.x)
      )
    )
  )
```


## List of base R and `tidyverse` functions supported by `tidypolars` 


```{r echo=FALSE, message=FALSE}
library(dplyr)
library(knitr)
out <- tribble(
  ~Package, ~Function,
  "`base`",             "`abs`",
  "`base`",             "`acos`", 
  "`base`",             "`acosh`", 
  "`base`",             "`all`",
  "`base`",             "`any`",
  "`base`",             "`asin`",
  "`base`",             "`asinh`", 
  "`base`",             "`atan`", 
  "`base`",             "`atanh`",
  "`base`",             "`ceiling`",
  "`base`",             "`cos`", 
  "`base`",             "`cosh`",
  "`base`",             "`cummin`",
  "`base`",             "`cumsum`", 
  "`base`",             "`diff`", 
  "`base`",             "`exp`", 
  "`base`",             "`floor`",
  "`base`",             "`grepl`",
  "`base`",             "`ifelse`",
  "`base`",             "`ISOdatetime`",
  "`base`",             "`length`",
  "`base`",             "`log`", 
  "`base`",             "`log10`",
  "`base`",             "`max`", 
  "`base`",             "`mean`", 
  "`base`",             "`min`",
  "`base`",             "`nchar`",
  "`base`",             "`paste0`",
  "`base`",             "`paste`",
  "`base`",             "`rank`",
  "`base`",             "`rev`",
  "`base`",             "`round`",
  "`base`",             "`sin`", 
  "`base`",             "`sinh`", 
  "`base`",             "`sort`", 
  "`base`",             "`sqrt`", 
  "`base`",             "`strptime`",
  "`base`",             "`tan`", 
  "`base`",             "`tanh`",
  "`base`",             "`tolower`",
  "`base`",             "`toupper`",
  "`base`",             "`unique`",
  "`base`",             "`which.min`",
  "`base`",             "`which.max`",
  "`dplyr`",            "`between`",
  "`dplyr`",            "`case_match`",
  "`dplyr`",            "`case_when`",
  "`dplyr`",            "`coalesce`",
  "`dplyr`",            "`consecutive_id`",
  "`dplyr`",            "`dense_rank`",
  "`dplyr`",            "`first`",
  "`dplyr`",            "`group_keys`",
  "`dplyr`",            "`group_vars`",
  "`dplyr`",            "`if_else`",
  "`dplyr`",            "`lag`",
  "`dplyr`",            "`last`",
  "`dplyr`",            "`min_rank`",
  "`dplyr`",            "`n`",
  "`dplyr`",            "`nth`",
  "`dplyr`",            "`n_distinct`",
  "`dplyr`",            "`row_number`",
  "`lubridate`",        "`ddays`",
  "`lubridate`",        "`dhours`",
  "`lubridate`",        "`dmilliseconds`",
  "`lubridate`",        "`dminutes`",
  "`lubridate`",        "`dseconds`",
  "`lubridate`",        "`dweeks`",
  "`lubridate`",        "`make_date`",
  "`lubridate`",        "`make_datetime`",
  "`lubridate`",        "`wday`",
  "`stats`",            "`median`", 
  "`stats`",            "`lag`", 
  "`stats`",            "`sd`", 
  "`stats`",            "`var`",
  "`stringr`",          "`regex`",
  "`stringr`",          "`str_count`",
  "`stringr`",          "`str_dup`",
  "`stringr`",          "`str_ends`",
  "`stringr`",          "`str_extract`",
  "`stringr`",          "`str_extract_all`",
  "`stringr`",          "`str_length`",
  "`stringr`",          "`str_pad`",
  "`stringr`",          "`str_remove`",
  "`stringr`",          "`str_remove_all`",
  "`stringr`",          "`str_replace`",
  "`stringr`",          "`str_replace_all`",
  "`stringr`",          "`str_split`",
  "`stringr`",          "`str_split_i`",
  "`stringr`",          "`str_squish`",
  "`stringr`",          "`str_starts`",
  "`stringr`",          "`str_sub`",
  "`stringr`",          "`str_trim`",
  "`stringr`",          "`str_to_lower`",
  "`stringr`",          "`str_to_title`",
  "`stringr`",          "`str_to_upper`",
  "`stringr`",          "`str_trunc`",
  "`stringr`",          "`word`",
  "`tidyr`",            "`replace_na`",
  "`tools`",            "`toTitleCase`"
) |>
  mutate(Notes = case_when(
    Package == "`lubridate`" & Function == "`make_datetime`" ~ "In `lubridate::make_datetime()`, when there is an overflow (for example `hours = 25`), then it is automatically converted to the higher unit (for example 1 day and 1h). In Polars, this returns `NA`.",
    Package == "`lubridate`" & Function == "`wday`" ~ "Requires `week_start == 7`. If `label = TRUE`, it returns a string variable and not a factor as in `lubridate`.",
    Package == "`dplyr`" & Function == "`row_number`" ~ "Doesn't work when `x` is missing.",
    .default = ""
  )) |> 
  arrange(Package)

kable(out) 
```