--- title: "Getting started" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{css, echo=FALSE} .custom_note { border: solid 3px #0b788e; background-color: #08505e; padding: 5px; margin-bottom: 10px; border-radius: 3px; } .custom_note > p, .custom_note > p > code { color: white; background: #08505e; } .custom_note > p > code > a:any-link { text-decoration-color: white; } ``` Using `tidypolars` requires importing data as Polars `DataFrame`s or `LazyFrame`s. You can read files with the [various `read_*_polars()` functions](https://tidypolars.etiennebacher.com/reference/#import-data) (such as `read_parquet_polars()`) to import them as `DataFrame`s, or with `scan_*_polars()` functions (such as `scan_parquet_polars()`) to import them as `LazyFrame`s. There are several functions to import various file formats, such as CSV, Parquet, or JSON.

Note: in examples or some tutorials, the functions as_polars_df() and as_polars_lf() are sometimes used to convert an existing R data.frame to a Polars DataFrame or LazyFrame. Those are merely convenience functions to quickly convert an existing dataset to Polars, which is useful for showcase purposes. However, this conversion from R to Polars has some cost and it hurts the performance. In real-life usecases, be sure to load the data with the read_\*() or the scan_\*() functions mentioned above.

Here, we're going to use the `who` dataset that is available in the `tidyr` package. I import it both as a classic R `data.frame` and as a Polars `DataFrame` so that we can easily compare `dplyr` and `tidypolars` functions. ```{r setup} library(polars) library(tidypolars) library(dplyr, warn.conflicts = FALSE) library(tidyr, warn.conflicts = FALSE) who_df <- tidyr::who who_pl <- as_polars_df(tidyr::who) ``` `tidypolars` provides methods for `dplyr` and `tidyr` S3 generics. In simpler words, it means that you can use the same functions on a Polars `DataFrame` or `LazyFrame` as in a classic `tidyverse` workflow and it should just work (if it doesn't, please [open an issue](https://github.com/etiennebacher/tidypolars/issues)). Note that you still need to load `dplyr` and `tidyr` in your code. Here's an example of some `dplyr` and `tidyr` code on the classic R `data.frame`: ```{r} who_df |> filter(year > 1990) |> drop_na(newrel_f3544) |> select(iso3, year, matches("^newrel(.*)_f")) |> arrange(iso3, year) |> rename_with(.fn = toupper) |> head() ``` We can simply use our Polars dataset instead: ```{r} who_pl |> filter(year > 1990) |> drop_na(newrel_f3544) |> select(iso3, year, matches("^newrel(.*)_f")) |> arrange(iso3, year) |> rename_with(.fn = toupper) |> head() ``` If you use Polars lazy API, you need to call `compute()` at the end of the chained expression to evaluate the query: ```{r} who_pl_lazy <- as_polars_lf(tidyr::who) who_pl_lazy |> filter(year > 1990) |> drop_na(newrel_f3544) |> select(iso3, year, matches("^newrel(.*)_f")) |> arrange(iso3, year) |> rename_with(.fn = toupper) |> compute() |> head() ```

Note: Several functions trigger the evaluation of a lazy query: `compute()`, `collect()`, `as.data.frame()`, and `as_tibble()`. If you want to return a Polars DataFrame, use `compute()`. If you want to return a standard R data.frame, for example to use it in statistical analysis, use any of the three other functions. Be warned that if the dataset is too big compared to your available memory, this will crash the R session.

`tidypolars` also supports many functions from `base`, `lubridate` or `stringr`. When these are used inside `filter()`, `mutate()` or `summarize()`, `tidypolars` will automatically convert them to use the Polars engine under the hood. Take a look at the vignette ["R and Polars expressions"](https://tidypolars.etiennebacher.com/articles/r-and-polars-expressions) for more information.