Read Parquet, CSV, and other files using DuckDB

df_from_file() uses arbitrary table functions to read data. See https://duckdb.org/docs/data/overview for a documentation of the available functions and their options. To read multiple files with the same schema, pass a wildcard or a character vector to the path argument,

duckplyr_df_from_file() is a thin wrapper around df_from_file() that calls as_duckplyr_df() on the output.

These functions ingest data from a file using a table function. The results are transparently converted to a data frame, but the data is only read when the resulting data frame is actually accessed.

df_from_csv() reads a CSV file using the read_csv_auto() table function.

duckplyr_df_from_csv() is a thin wrapper around df_from_csv() that calls as_duckplyr_df() on the output.

df_from_parquet() reads a Parquet file using the read_parquet() table function.

duckplyr_df_from_parquet() is a thin wrapper around df_from_parquet() that calls as_duckplyr_df() on the output.

df_to_parquet() writes a data frame to a Parquet file via DuckDB. If the data frame is a duckplyr_df, the materialization occurs outside of R. An existing file will be overwritten. This function requires duckdb >= 0.10.0.

Usage

df_from_file(path, table_function, ..., options = list(), class = NULL)

duckplyr_df_from_file(
  path,
  table_function,
  ...,
  options = list(),
  class = NULL
)

df_from_csv(path, ..., options = list(), class = NULL)

duckplyr_df_from_csv(path, ..., options = list(), class = NULL)

df_from_parquet(path, ..., options = list(), class = NULL)

duckplyr_df_from_parquet(path, ..., options = list(), class = NULL)

df_to_parquet(data, path)

Arguments

path: Path to files, glob patterns * and ? are supported.
table_function: The name of a table-valued DuckDB function such as "read_parquet", "read_csv", "read_csv_auto" or "read_json".
...: These dots are for future extensions and must be empty.
options: Arguments to the DuckDB function indicated by table_function.
class: The class of the output. By default, a tibble is created. The returned object will always be a data frame. Use class = "data.frame" or class = character() to create a plain data frame.
data: A data frame to be written to disk.

Value

A data frame for df_from_file(), or a duckplyr_df for duckplyr_df_from_file(), extended by the provided class.

Examples

# Create simple CSV file
path <- tempfile("duckplyr_test_", fileext = ".csv")
write.csv(data.frame(a = 1:3, b = letters[4:6]), path, row.names = FALSE)

# Reading is immediate
df <- df_from_csv(path)

# Materialization only upon access
names(df)
#> [1] "a" "b"
df$a
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(/tmp/RtmpRxm7mw/duckplyr_test_1af84418813e.csv)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (BIGINT)
#> - b (VARCHAR)
#> 
#> [1] 1 2 3

# Return as tibble, specify column types:
df_from_file(
  path,
  "read_csv",
  options = list(delim = ",", types = list(c("DOUBLE", "VARCHAR"))),
  class = class(tibble())
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv(/tmp/RtmpRxm7mw/duckplyr_test_1af84418813e.csv)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (DOUBLE)
#> - b (VARCHAR)
#> 
#> # A tibble: 3 × 2
#>       a b    
#>   <dbl> <chr>
#> 1     1 d    
#> 2     2 e    
#> 3     3 f    

# Read multiple file at once
path2 <- tempfile("duckplyr_test_", fileext = ".csv")
write.csv(data.frame(a = 4:6, b = letters[7:9]), path2, row.names = FALSE)

duckplyr_df_from_csv(file.path(tempdir(), "duckplyr_test_*.csv"))
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_csv_auto(/tmp/RtmpRxm7mw/duckplyr_test_*.csv)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (BIGINT)
#> - b (VARCHAR)
#> 
#> # A tibble: 6 × 2
#>       a b    
#>   <dbl> <chr>
#> 1     1 d    
#> 2     2 e    
#> 3     3 f    
#> 4     4 g    
#> 5     5 h    
#> 6     6 i    

unlink(c(path, path2))

# Write a Parquet file:
path_parquet <- tempfile(fileext = ".parquet")
df_to_parquet(df, path_parquet)

# With a duckplyr_df, the materialization occurs outside of R:
df %>%
  as_duckplyr_df() %>%
  mutate(b = a + 1) %>%
  df_to_parquet(path_parquet)

duckplyr_df_from_parquet(path_parquet)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> read_parquet(/tmp/RtmpRxm7mw/file1af88ed3962.parquet)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (DOUBLE)
#> - b (DOUBLE)
#> 
#> # A tibble: 3 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2
#> 2     2     3
#> 3     3     4

unlink(path_parquet)