Title: | DBI Package for the DuckDB Database Management System |
---|---|
Description: | The DuckDB project is an embedded analytical data management system with support for the Structured Query Language (SQL). This package includes all of DuckDB and an R Database Interface (DBI) connector. |
Authors: | Hannes Mühleisen [aut] (ORCID: <https://orcid.org/0000-0001-8552-0029>), Mark Raasveldt [aut] (ORCID: <https://orcid.org/0000-0001-5005-6844>), Kirill Müller [cre] (ORCID: <https://orcid.org/0000-0002-1416-3412>), Stichting DuckDB Foundation [cph], Apache Software Foundation [cph], PostgreSQL Global Development Group [cph], The Regents of the University of California [cph], Cameron Desrochers [cph], Victor Zverovich [cph], RAD Game Tools [cph], Valve Software [cph], Rich Geldreich [cph], Tenacious Software LLC [cph], The RE2 Authors [cph], Google Inc. [cph], Facebook Inc. [cph], Steven G. Johnson [cph], Jiahao Chen [cph], Tony Kelman [cph], Jonas Fonseca [cph], Lukas Fittl [cph], Salvatore Sanfilippo [cph], Art.sy, Inc. [cph], Oran Agra [cph], Redis Labs, Inc. [cph], Melissa O'Neill [cph], PCG Project contributors [cph] |
Maintainer: | Kirill Müller <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.4.0 |
Built: | 2025-09-20 01:24:14 UTC |
Source: | https://github.com/duckdb/duckdb-r |
This is a SQL backend for dbplyr tailored to take into account DuckDB's possibilities. This mainly follows the backend for PostgreSQL, but contains more mapped functions.
tbl_file()
is an experimental variant of dplyr::tbl()
to directly access files on disk.
It is safer than dplyr::tbl()
because there is no risk of misinterpreting the request,
and paths with special characters are supported.
tbl_function()
is an experimental variant of dplyr::tbl()
to create a lazy table from a table-generating function,
useful for reading nonstandard CSV files or other data sources.
It is safer than dplyr::tbl()
because there is no risk of misinterpreting the query.
See https://duckdb.org/docs/data/overview for details on data importing functions.
As an alternative, use dplyr::tbl(src, dplyr::sql("SELECT ... FROM ..."))
for custom SQL queries.
tbl_query()
is deprecated in favor of tbl_function()
.
Use simulate_duckdb()
with lazy_frame()
to see simulated SQL without opening a DuckDB connection.
tbl_file(src = NULL, path, ..., cache = FALSE) tbl_function(src, query, ..., cache = FALSE) tbl_query(src, query, ...) simulate_duckdb(...)
tbl_file(src = NULL, path, ..., cache = FALSE) tbl_function(src, query, ..., cache = FALSE) tbl_query(src, query, ...) simulate_duckdb(...)
src |
A duckdb connection object, |
path |
Path to existing Parquet, CSV or JSON file |
... |
Any parameters to be forwarded |
cache |
Enable object cache for Parquet files |
query |
SQL code, omitting the |
library(dplyr, warn.conflicts = FALSE) con <- DBI::dbConnect(duckdb(), path = ":memory:") db <- copy_to(con, data.frame(a = 1:3, b = letters[2:4])) db %>% filter(a > 1) %>% select(b) path <- tempfile(fileext = ".csv") write.csv(data.frame(a = 1:3, b = letters[2:4])) db_csv <- tbl_file(con, path) db_csv %>% summarize(sum_a = sum(a)) db_csv_fun <- tbl_function(con, paste0("read_csv_auto('", path, "')")) db_csv %>% count() DBI::dbDisconnect(con, shutdown = TRUE)
library(dplyr, warn.conflicts = FALSE) con <- DBI::dbConnect(duckdb(), path = ":memory:") db <- copy_to(con, data.frame(a = 1:3, b = letters[2:4])) db %>% filter(a > 1) %>% select(b) path <- tempfile(fileext = ".csv") write.csv(data.frame(a = 1:3, b = letters[2:4])) db_csv <- tbl_file(con, path) db_csv %>% summarize(sum_a = sum(a)) db_csv_fun <- tbl_function(con, paste0("read_csv_auto('", path, "')")) db_csv %>% count() DBI::dbDisconnect(con, shutdown = TRUE)
default_conn()
returns a default, built-in connection.
default_conn()
default_conn()
Currently, the connection is established with duckdb(environment_scan = TRUE)
and dbConnect(timezone_out = "", array = "matrix")
so that data frames are automatically available as tables,
timestamps are returned in the local timezone,
and DuckDB's array type is returned as an R matrix.
The details of how the connection is established are subject to change.
In particular, returning the output as a tibble or other object may be supported
in the future.
This connection is intended for interactive use. There is no way for this or other packages to comprehensively track the state of this connection, so scripts and packages should manage their own connections.
A DuckDB connection object
conn <- default_conn() sql_query("SELECT 42", conn = conn)
conn <- default_conn() sql_query("SELECT 42", conn = conn)
duckdb()
creates or reuses a database instance.
duckdb_shutdown()
shuts down a database instance.
Return an adbcdrivermanager::adbc_driver()
for use with Arrow Database
Connectivity via the adbcdrivermanager package.
dbConnect()
connects to a database instance.
dbDisconnect()
closes a DuckDB database connection.
The associated DuckDB database instance is shut down automatically,
it is no longer necessary to set shutdown = TRUE
or to call duckdb_shutdown()
.
duckdb( dbdir = DBDIR_MEMORY, read_only = FALSE, bigint = "numeric", config = list(), ..., environment_scan = FALSE ) duckdb_shutdown(drv) duckdb_adbc() ## S4 method for signature 'duckdb_driver' dbConnect( drv, dbdir = DBDIR_MEMORY, ..., debug = getOption("duckdb.debug", FALSE), read_only = FALSE, timezone_out = "UTC", tz_out_convert = c("with", "force"), config = list(), bigint = "numeric", array = "none" ) ## S4 method for signature 'duckdb_connection' dbDisconnect(conn, ..., shutdown = TRUE)
duckdb( dbdir = DBDIR_MEMORY, read_only = FALSE, bigint = "numeric", config = list(), ..., environment_scan = FALSE ) duckdb_shutdown(drv) duckdb_adbc() ## S4 method for signature 'duckdb_driver' dbConnect( drv, dbdir = DBDIR_MEMORY, ..., debug = getOption("duckdb.debug", FALSE), read_only = FALSE, timezone_out = "UTC", tz_out_convert = c("with", "force"), config = list(), bigint = "numeric", array = "none" ) ## S4 method for signature 'duckdb_connection' dbDisconnect(conn, ..., shutdown = TRUE)
dbdir |
Location for database files. Should be a path to an existing
directory in the file system. With the default (or |
read_only |
Set to |
bigint |
How 64-bit integers should be returned. There are two options: |
config |
Named list with DuckDB configuration flags, see https://duckdb.org/docs/configuration/overview#configuration-reference for the possible options. These flags are only applied when the database object is instantiated. Subsequent connections will silently ignore these flags. |
... |
Reserved for future extensions, must be empty. |
environment_scan |
Set to |
drv |
Object returned by |
debug |
Print additional debug information, such as queries. |
timezone_out |
The time zone returned to R, defaults to |
tz_out_convert |
How to convert timestamp columns to the timezone specified
in |
array |
How arrays should be returned. There are two options: |
conn |
A |
shutdown |
Unused. The database instance is shut down automatically. |
The behavior of with = "force"
at DST transitions depends on how R handles translation from
the underlying time representation to a human-readable format.
If the timestamp is invalid in the target timezone, the resulting value may be NA
or an adjusted time.
duckdb()
returns an object of class duckdb_driver.
dbDisconnect()
and duckdb_shutdown()
are called for their
side effect.
An object of class "adbc_driver"
dbConnect()
returns an object of class duckdb_connection.
library(adbcdrivermanager) with_adbc(db <- adbc_database_init(duckdb_adbc()), { as.data.frame(read_adbc(db, "SELECT 1 as one;")) }) drv <- duckdb() con <- dbConnect(drv) dbGetQuery(con, "SELECT 'Hello, world!'") dbDisconnect(con) duckdb_shutdown(drv) # Shorter: con <- dbConnect(duckdb()) dbGetQuery(con, "SELECT 'Hello, world!'") dbDisconnect(con, shutdown = TRUE)
library(adbcdrivermanager) with_adbc(db <- adbc_database_init(duckdb_adbc()), { as.data.frame(read_adbc(db, "SELECT 1 as one;")) }) drv <- duckdb() con <- dbConnect(drv) dbGetQuery(con, "SELECT 'Hello, world!'") dbDisconnect(con) duckdb_shutdown(drv) # Shorter: con <- dbConnect(duckdb()) dbGetQuery(con, "SELECT 'Hello, world!'") dbDisconnect(con, shutdown = TRUE)
Directly reads a CSV file into DuckDB, tries to detect and create the correct schema for it. This usually is much faster than reading the data into R and writing it to DuckDB.
duckdb_read_csv( conn, name, files, ..., header = TRUE, na.strings = "", nrow.check = 500, delim = ",", quote = "\"", col.names = NULL, col.types = NULL, lower.case.names = FALSE, sep = delim, transaction = TRUE, temporary = FALSE )
duckdb_read_csv( conn, name, files, ..., header = TRUE, na.strings = "", nrow.check = 500, delim = ",", quote = "\"", col.names = NULL, col.types = NULL, lower.case.names = FALSE, sep = delim, transaction = TRUE, temporary = FALSE )
conn |
A DuckDB connection, created by |
name |
The name for the virtual table that is registered or unregistered |
files |
One or more CSV file names, should all have the same structure though |
... |
Reserved for future extensions, must be empty. |
header |
Whether or not the CSV files have a separate header in the first line |
na.strings |
Which strings in the CSV files should be considered to be NULL |
nrow.check |
How many rows should be read from the CSV file to figure out data types |
delim |
Which field separator should be used |
quote |
Which quote character is used for columns in the CSV file |
col.names |
Override the detected or generated column names |
col.types |
Character vector of column types in the same order as col.names, or a named character vector where names are column names and types pairs. Valid types are DuckDB data types, e.g. VARCHAR, DOUBLE, DATE, BIGINT, BOOLEAN, etc. |
lower.case.names |
Transform column names to lower case |
sep |
Alias for delim for compatibility |
transaction |
Should a transaction be used for the entire operation |
temporary |
Set to |
If the table already exists in the database, the csv is appended to it. Otherwise the table is created.
The number of rows in the resulted table, invisibly.
con <- dbConnect(duckdb()) data <- data.frame(a = 1:3, b = letters[1:3]) path <- tempfile(fileext = ".csv") write.csv(data, path, row.names = FALSE) duckdb_read_csv(con, "data", path) dbReadTable(con, "data") dbDisconnect(con) # Providing data types for columns path <- tempfile(fileext = ".csv") write.csv(iris, path, row.names = FALSE) con <- dbConnect(duckdb()) duckdb_read_csv(con, "iris", path, col.types = c( Sepal.Length = "DOUBLE", Sepal.Width = "DOUBLE", Petal.Length = "DOUBLE", Petal.Width = "DOUBLE", Species = "VARCHAR" ) ) dbReadTable(con, "iris") dbDisconnect(con)
con <- dbConnect(duckdb()) data <- data.frame(a = 1:3, b = letters[1:3]) path <- tempfile(fileext = ".csv") write.csv(data, path, row.names = FALSE) duckdb_read_csv(con, "data", path) dbReadTable(con, "data") dbDisconnect(con) # Providing data types for columns path <- tempfile(fileext = ".csv") write.csv(iris, path, row.names = FALSE) con <- dbConnect(duckdb()) duckdb_read_csv(con, "iris", path, col.types = c( Sepal.Length = "DOUBLE", Sepal.Width = "DOUBLE", Petal.Length = "DOUBLE", Petal.Width = "DOUBLE", Species = "VARCHAR" ) ) dbReadTable(con, "iris") dbDisconnect(con)
duckdb_register()
registers a data frame as a virtual table (view)
in a DuckDB connection.
No data is copied.
duckdb_register(conn, name, df, overwrite = FALSE, experimental = FALSE) duckdb_unregister(conn, name)
duckdb_register(conn, name, df, overwrite = FALSE, experimental = FALSE) duckdb_unregister(conn, name)
conn |
A DuckDB connection, created by |
name |
The name for the virtual table that is registered or unregistered |
df |
A |
overwrite |
Should an existing registration be overwritten? |
experimental |
Enable experimental optimizations |
duckdb_unregister()
unregisters a previously registered data frame.
These functions are called for their side effect.
con <- dbConnect(duckdb()) data <- data.frame(a = 1:3, b = letters[1:3]) duckdb_register(con, "data", data) dbReadTable(con, "data") duckdb_unregister(con, "data") dbDisconnect(con)
con <- dbConnect(duckdb()) data <- data.frame(a = 1:3, b = letters[1:3]) duckdb_register(con, "data", data) dbReadTable(con, "data") duckdb_unregister(con, "data") dbDisconnect(con)
duckdb_register_arrow()
registers an Arrow data source as a virtual table (view)
in a DuckDB connection.
No data is copied.
duckdb_register_arrow(conn, name, arrow_scannable, use_async = NULL) duckdb_unregister_arrow(conn, name) duckdb_list_arrow(conn)
duckdb_register_arrow(conn, name, arrow_scannable, use_async = NULL) duckdb_unregister_arrow(conn, name) duckdb_list_arrow(conn)
conn |
A DuckDB connection, created by |
name |
The name for the virtual table that is registered or unregistered |
arrow_scannable |
A scannable Arrow-object |
use_async |
Switched to the asynchronous scanner. (deprecated) |
duckdb_unregister_arrow()
unregisters a previously registered data frame.
These functions are called for their side effect.
sql_query()
runs an arbitrary SQL query using DBI::dbGetQuery()
and returns a data.frame with the query results.
sql_exec()
runs an arbitrary SQL statement using DBI::dbExecute()
and returns the number of affected rows.
These functions are intended as an easy way to interactively run DuckDB without having to manage connections. By default, data frame objects are available as views.
Scripts and packages should manage their own connections and prefer the DBI methods for more control.
sql_query(sql, conn = default_conn()) sql_exec(sql, conn = default_conn())
sql_query(sql, conn = default_conn()) sql_exec(sql, conn = default_conn())
sql |
A SQL string |
conn |
An optional connection, defaults to |
A data frame with the query result
# Queries sql_query("SELECT 42") # Statements with side effects sql_exec("CREATE TABLE test (a INTEGER, b VARCHAR)") sql_exec("INSERT INTO test VALUES (1, 'one'), (2, 'two')") sql_query("FROM test") # Data frames available as views sql_query("FROM mtcars")
# Queries sql_query("SELECT 42") # Statements with side effects sql_exec("CREATE TABLE test (a INTEGER, b VARCHAR)") sql_exec("INSERT INTO test VALUES (1, 'one'), (2, 'two')") sql_query("FROM test") # Data frames available as views sql_query("FROM mtcars")