Query database tables and views over a DBI connection using
data.table’s [i, j, by] syntax, attach
database schemas to the search path, and programmatically load database
catalogs.
This vignette assumes that you are already fluent with
data.table’s syntax and that you know how to open a
database connection using the DBI package.
If you haven’t alreay done so, install the dbi.table
package from CRAN.
This section uses the Chinook Database (included in the package) to demonstrate how to
dbi.table using the
dbi.table function,dbi.table using data.table’s
[i, j, by] syntax,dbi.attach
function, anddbi.catalog
function.The function chinook.duckdb returns an open
duckdb (DBI) connection to the sample Chinook Database.
This connection is a typical DBI connection as returned by
DBI::dbConnect that can be used as the conn
argument in DBI package functions. Let’s get started by loading the
package and opening the connection.
dbi.tableThe dbi.table function takes 2 arguments: a DBI
connection, and an Id indentifying a database table or view.
The object my_album is a dbi.table, a data
structure that represents an SQL query (which we refer to as the
dbi.table’s underlying SQL query). The
print method displays a preview of the underlying SQL
query.
## <chinook_duckdb> Album 
##  AlbumId                                 Title ArtistId
##    <int>                                <char>    <int>
##        1 For Those About To Rock We Salute You        1
##        2                     Balls to the Wall        2
##        3                     Restless and Wild        2
##        4                     Let There Be Rock        1
##        5                              Big Ones        3
##  ---The preview has a format similar to a data.table with
two notable exceptions.
The row numbers are omitted. SQL queries do not necessarily
return the result set in a reliable order (even on subsequent
evaluations of the same query), and dbi.table does not make
any extra effort to order the rows by default.
Only the first 5 rows of the dbi.table are displayed
(data.table displays the first 5 and the last 5). Again,
since the result set does not have a reliable order, it is not possible
to say which rows are the first and which are the last. The rows
displayed are the first 5 returned by the RDBMS.
The function as.data.table executes the
dbi.table’s underlying SQL query and retrieves the result
set as a data.table. Pro tip: calling the extract method
([]) with no arguments is a shortcut for
as.data.table.
##      AlbumId                                                        Title ArtistId
##        <int>                                                       <char>    <int>
##   1:       1                        For Those About To Rock We Salute You        1
##   2:       2                                            Balls to the Wall        2
##   3:       3                                            Restless and Wild        2
##   4:       4                                            Let There Be Rock        1
##   5:       5                                                     Big Ones        3
##  ---                                                                              
## 343:     343                                       Respighi:Pines of Rome      226
## 344:     344 Schubert: The Late String Quartets & String Quintet (3 CD's)      272
## 345:     345                                          Monteverdi: L'Orfeo      273
## 346:     346                                        Mozart: Chamber Music      274
## 347:     347           Koyaanisqatsi (Soundtrack from the Motion Picture)      275Since the result set is instantiated locally as a
data.table, the row numbers and the last 5 rows are
displayed.
Note: by default, as.data.table (and its []
shortcut) fetches a maximum of 10,000 rows. To override this limit,
either set the option dbitable.max.fetch or call
as.data.table and provide the n argument
(e.g., n = -1 to fetch the entire result set).
Lastly, the csql function displays the
dbi.table’s underlying SQL query.
## SELECT Album.AlbumId AS AlbumId,
##        Album.Title AS Title,
##        Album.ArtistId AS ArtistId
## 
##   FROM Album AS Album
## 
##  LIMIT 10000The underlying SQL query of a newly created dbi.table
selects all the columns from the database table.
dbi.table using data.table SyntaxThis extract from data.table’s Introduction to
data.table vignette pretty much sums up what dbi.table
does.
In general, dbi.table should be able to handle basic
data.table syntax. SQL translation is done by
dbplyr::translate_sql_ which works with a variety of R
functions. However, complicated expressions (e.g., custom functions in
j, nested aggregation functions, most special symbols) do
not work.
Best practice is to use dbi.table to subset and wrangle
on the database, then data.table to fine tune locally.
When i is a logical expression of the variables in the
dbi.table then it becomes the WHERE clause in the
dbi.table’s underlying SQL query.
## SELECT Album.AlbumId AS AlbumId,
##        Album.Title AS Title,
##        Album.ArtistId AS ArtistId
## 
##   FROM Album AS Album
## 
##  WHERE Album.AlbumId = (Album.ArtistId + 1)
## 
##  LIMIT 10000## <chinook_duckdb> Album 
##  AlbumId               Title ArtistId
##    <int>              <char>    <int>
##        3   Restless and Wild        2
##       59 Deep Purple In Rock       58
##       88            Faceless       87When i is a call to order (or
chorder), it becomes the ORDER BY clause in the
dbi.table’s underlying SQL query.
## SELECT Album.AlbumId AS AlbumId,
##        Album.Title AS Title,
##        Album.ArtistId AS ArtistId
## 
##   FROM Album AS Album
## 
##  ORDER BY LENGTH(Album.Title), Album.AlbumId DESC
## 
##  LIMIT 10000## <chinook_duckdb> Album 
##  AlbumId  Title ArtistId
##    <int> <char>    <int>
##      131     IV       22
##      239    War      150
##      236    Pop      150
##      182    Vs.      118
##      181    Ten      118
##  ---When j is a list of expressions of the variables in the
dbi.table, then j becomes the SELECT
clause in the dbi.table’s underlying SQL query.
## <chinook_duckdb> Album 
##  AlbumId                                 Title
##    <int>                                <char>
##        1 For Those About To Rock We Salute You
##        2                     Balls to the Wall
##        3                     Restless and Wild
##        4                     Let There Be Rock
##        5                              Big Ones
##  ---When by is a list of expressions of the variables in the
dbi.table, then by becomes the GROUP
BY clause in the dbi.table’s underlying SQL query.
## SELECT Album.ArtistId AS ArtistId,
##        COUNT(*) AS "# of Albums"
## 
##   FROM Album AS Album
## 
##  GROUP BY Album.ArtistId
## 
##  LIMIT 10000## <chinook_duckdb> Album 
##  ArtistId # of Albums
##     <int>       <num>
##         1           2
##         2           2
##         3           1
##         4           1
##         5           1
##  ---The dbi.attach function attaches a DBI
connection to the search path. This means that dbi.attach
creates a dbi.table for each table and view in the schema
associated with the DBI connection, then assigns these
dbi.tables to an environment on the search path.
A quick look at the search path shows the database attached in position 2.
## [1] ".GlobalEnv"            "duckdb:chinook_duckdb" "package:dbi.table"The tables and views in the database schema are queriable as
dbi.tables in the attached environment
duckdb:chinook_duckdb.
##  [1] "Album"         "Artist"        "Customer"      "Employee"      "Genre"        
##  [6] "Invoice"       "InvoiceLine"   "MediaType"     "Playlist"      "PlaylistTrack"
## [11] "Track"Note: Attaching a DBI connection is intended for an interactive exploratory analysis of a database (schema). For programatic use cases, see the Load a Database Catalog section.
dbi.tablesTwo dbi.tables that share the same connection (for
example, all the dbi.tables in an attached schema share the
same connection) can be merged. Merging two dbi.tables
results in an SQL join that describes the same result set as the
associated data.table merge. That is,
and
are the same data.table up to row order.
## SELECT Album.ArtistId AS ArtistId,
##        Album.AlbumId AS AlbumId,
##        Album.Title AS Title,
##        Artist."Name" AS "Name"
## 
##   FROM chinook_duckdb.main.Album AS Album
## 
##  INNER JOIN chinook_duckdb.main.Artist AS Artist
##     ON Album.ArtistId = Artist.ArtistId
## 
##  LIMIT 10000If dbi.table can determine the foreign key
constraints between x and y, and if there is
only one foreign key (either x referring to y,
or y referring to x), then this foreign key is
used as the default by when merging. Otherwise,
dbi.table uses the same algorithm to determine the default
by columns as data.table.
## SELECT Customer.SupportRepId AS SupportRepId,
##        Customer.CustomerId AS CustomerId,
##        Customer.PostalCode AS "PostalCode.x",
##        Customer.Address AS "Address.x",
##        Customer.Fax AS "Fax.x",
##        Customer.Email AS "Email.x",
##        Customer.City AS "City.x",
##        Customer.Country AS "Country.x",
##        Customer.FirstName AS "FirstName.x",
##        Customer.Phone AS "Phone.x",
##        Customer.Company AS Company,
##        Customer.LastName AS "LastName.x",
##        Customer.State AS "State.x",
##        Employee.Fax AS "Fax.y",
##        Employee.Email AS "Email.y",
##        Employee.FirstName AS "FirstName.y",
##        Employee.HireDate AS HireDate,
##        Employee.Address AS "Address.y",
##        Employee.Country AS "Country.y",
##        Employee.PostalCode AS "PostalCode.y",
##        Employee.Phone AS "Phone.y",
##        Employee.LastName AS "LastName.y",
##        Employee.ReportsTo AS ReportsTo,
##        Employee.City AS "City.y",
##        Employee.Title AS Title,
##        Employee.BirthDate AS BirthDate,
##        Employee.State AS "State.y"
## 
##   FROM chinook_duckdb.main.Customer AS Customer
## 
##  INNER JOIN chinook_duckdb.main.Employee AS Employee
##     ON Customer.SupportRepId = Employee.EmployeeId
## 
##  LIMIT 10000In this case, the Customer table has a foreign key
SupportRepId that refers to the Employee
table’s primary key EmployeeId.
When the y argument is omitted, dbi.table’s
merge uses the foreign key constraints that x
references to determine the y (or ys) to merge
with.
When y is missing, merge.dbi.table merges
with the tables referenced by x’s foreign keys (one merge
for each foreign key).
## SELECT Track.MediaTypeId AS MediaTypeId,
##        Track.GenreId AS GenreId,
##        Track.AlbumId AS AlbumId,
##        Track.TrackId AS TrackId,
##        Track."Milliseconds" AS "Milliseconds",
##        Track.Composer AS Composer,
##        Track.Bytes AS Bytes,
##        Track."Name" AS "Name",
##        Track.UnitPrice AS UnitPrice,
##        Album.Title AS "AlbumId.Title",
##        Album.ArtistId AS "AlbumId.ArtistId",
##        Genre."Name" AS "GenreId.Name",
##        MediaType."Name" AS "MediaTypeId.Name"
## 
##   FROM chinook_duckdb.main.Track AS Track
## 
##   LEFT OUTER JOIN chinook_duckdb.main.Album AS Album
##     ON Track.AlbumId = Album.AlbumId
## 
##   LEFT OUTER JOIN chinook_duckdb.main.Genre AS Genre
##     ON Track.GenreId = Genre.GenreId
## 
##   LEFT OUTER JOIN chinook_duckdb.main.MediaType AS MediaType
##     ON Track.MediaTypeId = MediaType.MediaTypeId
## 
##  LIMIT 10000When the optional recursive argument is
TRUE, merge.dbi.table calls merge
with recuresive = TRUE on each referenced table and merges
x with the result. In this example, Track has
a foreign key that references Album and Album
has a foreign key that references Artist. The columns from
Artist are included in the result set when
merge is called with recuresive = TRUE.
## WITH CTE2 AS (
## SELECT Album.ArtistId AS "AlbumId.ArtistId",
##        Album.AlbumId AS "AlbumId.AlbumId",
##        Album.Title AS "AlbumId.Title",
##        Artist."Name" AS "AlbumId.ArtistId.Name"
## 
##   FROM chinook_duckdb.main.Album AS Album
## 
##   LEFT OUTER JOIN chinook_duckdb.main.Artist AS Artist
##     ON Album.ArtistId = Artist.ArtistId
## )
## 
## SELECT Track.MediaTypeId AS MediaTypeId,
##        Track.GenreId AS GenreId,
##        Track.AlbumId AS AlbumId,
##        Track.TrackId AS TrackId,
##        Track."Milliseconds" AS "Milliseconds",
##        Track.Composer AS Composer,
##        Track.Bytes AS Bytes,
##        Track."Name" AS "Name",
##        Track.UnitPrice AS UnitPrice,
##        CTE2."AlbumId.ArtistId" AS "AlbumId.ArtistId",
##        CTE2."AlbumId.Title" AS "AlbumId.Title",
##        CTE2."AlbumId.ArtistId.Name" AS "AlbumId.ArtistId.Name",
##        Genre."Name" AS "GenreId.Name",
##        MediaType."Name" AS "MediaTypeId.Name"
## 
##   FROM chinook_duckdb.main.Track AS Track
## 
##   LEFT OUTER JOIN CTE2 AS CTE2
##     ON Track.AlbumId = CTE2."AlbumId.AlbumId"
## 
##   LEFT OUTER JOIN chinook_duckdb.main.Genre AS Genre
##     ON Track.GenreId = Genre.GenreId
## 
##   LEFT OUTER JOIN chinook_duckdb.main.MediaType AS MediaType
##     ON Track.MediaTypeId = MediaType.MediaTypeId
## 
##  LIMIT 10000As a best practice for programatic use, it is better to load the catalog in order to avoid modifying the search path.
Printing the catalog lists its schemas.
## <Database Catalog> duckdb::chinook_duckdb (2 schemas containing 19 objects) 
## [1] "information_schema" "main"Individual tables can be accessed using
catalog$schema$table syntax.
## <chinook_duckdb> Album 
## Key (non-strict): <AlbumId>
##  AlbumId                                 Title ArtistId
##    <int>                                <char>    <int>
##        1 For Those About To Rock We Salute You        1
##        2                     Balls to the Wall        2
##        3                     Restless and Wild        2
##        4                     Let There Be Rock        1
##        5                              Big Ones        3
##  ---This section provides a brief explanation of what the
dbi.table package is trying to do.
Suppose that x is a dbi.table and that
e is an expression involving x that returns
either a dbi.table or a data.table.
Since dbi.table’s syntax is a subset of
data.table’s syntax, if e can be evaluated
successfully (i.e., eval(e) does not throw an error), then
e should also be able to be successfully evaluated when
x is a data.table. There are thus 2 paths to
the final data.table result:
evaluate e then coerce the result using
as.data.table, or
coerce x to a data.table then evaluate
e.
Path 2 is referred to as the reference implementation and describes
the correct answer: the reference result set. The
design goal of dbi.table is to get the same result set as
the reference result set, up to row order.
result_set <- as.data.table(eval(e))
x <- as.data.table(x)
reference_result_set <- eval(e)
all.equal(reference_result_set, result_set, ignore.row.order = TRUE)## [1] TRUEThe dbi.table package includes the function
reference.test that compares the result set to the
reference result set in the more general case where expr
(the function’s first argument) is an expression involving 1 or more
dbi.tables.
x <- dbi.table(chinook, DBI::Id("Album"))
reference.test({
  x[, .("# of Albums" = .N), .(ArtistId)]
})## [1] TRUEThis function is used extensively in dbi.table’s
unit/regression tests.
We used the chinook.duckdb function to open a DBI
connection at the beginning of this vignette and now it is up to us to
close it.
However, this leaves our R session in a wonky state. The environment
“duckdb:chinook_duckdb” is still attached and there are several
dbi.tables in the global environment - all of these
dbi.tables are associated with an invalid DBI
connection.
## Error in `dbSendQuery()`:
## ! rapi_prepare: Invalid connectionThe R objects associated with our now-closed DBI connection need to be cleaned up manually (or you could just restart R).
Alternatively, when using either dbi.attach or
dbi.catalog, the first arguement can be a zero-argument
function that returns an open DBI connection. When
dbi.table uses a function to open the DBI connection, then
that connection belongs to dbi.table and
dbi.table will take care of closing it when it is no longer
needed.
When dbi.table is managing the connection, then all the
user has to do is detach (or delete if a catalog). The DBI connection
will be closed when the object is garbage collected.
Further, when dbi.table owns the connection, it is able
to reconnect in the event that the connection unexpectedly drops.