connector
is a package that provides a set of functions
to connect to different data sources (such as
databases, APIs, and file systems) and read and write data from
them.
It is designed to be a generic package that is extensible, so that new data sources can be added easily.
Introduction
This vignette demonstrates how to use the connector
package to connect to either a file system or a
database to access different types of data.
First let’s load all the packages we’re going to use in this vignette:
You may notice we’re not loading only connector package, but
connector.databricks
and connector.sharepoint
as well. These are packages that
extend the functionality of the connector
package to
connect to Databricks
and SharePoint
,
respectively.
Connector configuration
Main function in this package is connect()
. This
function based on a configuration file or a list creates a
connectors()
object with a connector
for each
of the specified datasources. The configuration file can be in any
format that can be read through read_file()
, and contains a
list. If a yaml file is provided, expressions are evaluated when parsing
it using yaml::read_yaml()
with
eval.expr = TRUE
.
The input list (or configuration file) has to have the following structure:
- Only metadata, env, connections, and datasources fields are allowed.
- All elements must be named.
- datasources is mandatory.
- metadata and env must each be a list of named character vectors of length 1 specified.
- datasources must each be a list of unnamed lists.
- Each datasources must have the named character element name and the named list element backend
- For each connection backend.type must be provided
Here is an example of a configuration file with different types of connections:
metadata:
trial: "yyyy"
project: "nnxxxx"
instance: "current"
http_path: !!expr Sys.getenv("DATABRICKS_HTTP_PATH")
catalog: !!expr Sys.getenv("DATABRICKS_CATALOG_NAME")
sharepoint: "https://my_organisation.sharepoint.com"
root_path: "/home/my_root_path"
datasources:
- name: "adam"
backend:
type: "connector::connector_fs"
path: "{metadata.root_path}/{metadata.project}/{metadata.trial}/{metadata.instance}/adam"
- name: "tfl"
backend:
type: "connector.databricks::connector_databricks_volume"
catalog: "{metadata.catalog}"
schema: "{metadata.project}_{metadata.trial}_tfl"
path: "output"
- name: "metadata"
backend:
type: "connector.databricks::connector_databricks_dbi"
http_path: "{metadata.http_path}"
catalog: "{metadata.catalog}"
schema: "{metadata.project}_{metadata.trial}_metadata"
- name: "output_sh"
backend:
type: "connector.sharepoint::connector_sharepoint"
site_url: "{metadata.sharepoint}/sites/{metadata.project}_{metadata.trial}_output"
In this example we have multiple connections to multiple data sources. ADaM data is stored on the file system, metadata resides on Databricks Tables, TFL is also on Databricks, but on Volumes, because this is where the unstructured data is stored. Finally, the output is stored on SharePoint.
Connect to datasources
When we want to connect to datasources we can use the
connect()
function.
# Load data connections
db <- connect()
This creates list of connectors, which can be accessed by their names:
# ADaM data
db$adam
# TFL data
db$tfl
# Metadata
db$metadata
# Output on SharePoint
db$output_sh
After the setup is done we can use these connections to read and write data from datasources.
Reading and writing data
The connector
packages provide a set of functions to
read and write data from the datasources. They all have similar
interface, so it’s easy to switch between them.
Now, we will show how to read and write different types data from/to
different datasources. In these examples we will be using
iris
and mtcars
datasets.
Here is an example of writing data to the ADaM table:
library(dplyr)
# Manipulate data
## Iris data
setosa <- iris |>
filter(Species == "setosa")
mean_for_all_iris <- iris |>
group_by(Species) |>
summarise_all(list(mean, median, sd, min, max))
## Mtcars data
cars <- mtcars |>
filter(mpg > 22)
mean_for_all_mtcars <- mtcars |>
group_by(gear) |>
summarise(
across(
everything(),
list("mean" = mean, "median" = median, "sd" = sd, "min" = min, "max" = max),
.names = "{.col}_{.fn}")
) |>
tidyr::pivot_longer(
cols = -gear,
names_to = c(".value", "stat"),
names_sep = "_"
)
## Store data
db$adam |>
write_cnt(setosa, "setosa", overwrite = TRUE)
db$adam |>
write_cnt(mean_for_all_iris, "mean_iris", overwrite = TRUE)
db$adam |>
write_cnt(cars, "cars_mpg", overwrite = TRUE)
db$adam |>
write_cnt(mean_for_all_mtcars, "mean_mtcars", overwrite = TRUE)
Now, let’s read the data back manipulate it a bit and write it to the SharePoint and DataBricks. This way we can save different types of data in different formats.
library(gt)
library(tidyr)
library(ggplot2)
# List and load data
db$adam |>
list_content_cnt()
table <- db$adam |>
read_cnt("mean_mtcars")
gttable <- table |>
gt(groupname_col = "gear")
## Save table
db$tfl$write_cnt(gttable$`_data`, "tmeanallmtcars.csv")
db$tfl$write_cnt(gttable, "tmeanallmtcars.rds")
## Using Sharepoint
tmp_file <- tempfile(fileext = ".docx")
gtsave(gttable, tmp_file)
db$output_sh$upload_cnt(tmp_file, "tmeanallmtcars.docx")
# Manipulate data
setosa_fsetosa <- db$adam |>
read_cnt("setosa") |>
filter(Sepal.Length > 5)
fsetosa <- ggplot(setosa) +
aes(x = Sepal.Length, y = Sepal.Width) +
geom_point()
## Using Databricks Volumes
## Store Tfls
db$tfl$write_cnt(fsetosa$data, "fsetosa.csv")
db$tfl$write_cnt(fsetosa, "fsetosa.rds")
## Using Sharepoint
tmp_file <- tempfile(fileext = ".png")
ggsave(tmp_file, fsetosa)
db$output_sh$upload(tmp_file, "fsetosa.png")
## Using Databricks Volumes
db$tfl$upload_cnt(contents = tmp_file, file_path = "fsetosa.png")
Working example
Here is an example anyone can run to see how the
connector
package works. We will use the configuration file
provided below, which will only use file system as type of connection
for ADaM and TFL data.
metadata:
adam_path: !expr file.path(tempdir(), "adam")
tfl_path: !expr file.path(tempdir(), "tfl")
datasources:
- name: "adam"
backend:
type: "connector::connector_fs"
path: "{metadata.adam_path}"
- name: "tfl"
backend:
type: "connector::connector_fs"
path: "{metadata.tfl_path}"
Now, let’s run the example:
library(connector)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
# Let's create ADaM and TFL directories in temporary directory of the session
dir.create(file.path(tempdir(), "adam"))
dir.create(file.path(tempdir(), "tfl"))
# Get example configuration from file
config_file <- system.file("config", "config_file_system.yml", package = "connector")
# Load data connections
db <- connect(config = config_file)
#> ────────────────────────────────────────────────────────────────────────────────
#> Connection to:
#> → adam
#> • connector_fs
#> • /tmp/Rtmpav0T1L/adam
#> ────────────────────────────────────────────────────────────────────────────────
#> Connection to:
#> → tfl
#> • connector_fs
#> • /tmp/Rtmpav0T1L/tfl
## Iris data
setosa <- iris |>
filter(Species == "setosa")
mean_for_all_iris <- iris |>
group_by(Species) |>
summarise_all(list(mean, median, sd, min, max))
## Store data
db$adam |>
write_cnt(setosa, "setosa.rds")
db$adam |>
write_cnt(mean_for_all_iris, "mean_iris.rds")
## List and load data
db$adam |>
list_content_cnt()
#> [1] "mean_iris.rds" "setosa.rds"
# Manipulate data
setosa_fsetosa <- db$adam |>
read_cnt("setosa") |>
filter(Sepal.Length > 5)
fsetosa <- ggplot(setosa) +
aes(x = Sepal.Length, y = Sepal.Width) +
geom_point()
## Store Tfls
db$tfl$write_cnt(fsetosa$data, "fsetosa.csv")
db$tfl$write_cnt(fsetosa, "fsetosa.rds")
## Store images
tmp_file <- tempfile(fileext = ".png")
ggsave(tmp_file, fsetosa)
#> Saving 7.29 x 4.51 in image
db$tfl$upload_cnt(tmp_file, "fsetosa.png")
# Check if everything is written into temporary TFL directory
db$tfl$list_content_cnt()
#> [1] "fsetosa.csv" "fsetosa.png" "fsetosa.rds"
Conclusion
In this vignette we showed how to connect to different datasources,
read and write data from them. We used connector.databricks
and connector.sharepoint
packages to extend the
functionality of the connector
package to connect to
Databricks
and SharePoint
, respectively.
This package is designed to be extensible, so that new data sources can be added easily.