In contrast to other programming languages, R has no widely
established and undisputed style guide (e.g. PEP 8 for Python). As a
data scientist, I helped to establish a company wide R style guide.
While it mainly relies on the tidyverse style guide, we
generally decided to be more explicit in our coding practice. This
includes that we always refer to functions from non-native R packages
with the double colon operator ::. While it is relatively
easy to establish such a convention in new projects, it is challenging
to adapt ongoing projects and legacy code. origin allows
for much faster conversions of both legacy code as well as currently
written code.
originThe main purpose is to add pkg:: to an R function call,
i.e. it changes code like this:
originIn general, you can either originize some selected text (more on that
later in Addins), a whole script, or a all scripts in a specific folder,
e.g. your project folder. There is a specifically designed function for
each purpose yet they all share the same options. Therefore, only
originize_file() is extensively presented as an example
with its default options.
originize_file(file = "testscript.R",
               pkgs = .packages(), 
               overwrite = TRUE,
               ask_before_applying_changes = TRUE,
               ignore_comments = TRUE,
               check_conflicts = TRUE,
               add_base_packages = FALSE,
               check_base_conflicts = TRUE, 
               check_local_conflicts = TRUE,
               excluded_functions = list(dplyr = c("%>%", "across"),
                                         data.table = c(":=", "%like%"),
                                         # exclude from all packages:
                                         c("first", "last")), 
               verbose = TRUE, 
               use_markers = TRUE)pkgs: which packages to check for functions used in the
code (see Considered Packages). The default are all
packages attached via library or requireoverwrite: actually insert pkg:: into the
code. Otherwise, logging shows only what would happen. Note
that ask_before_applying_changes still allows to keep
control over your code before origin changes anything.ask_before_applying_changes: whether changes should be
applied immediately or the user must approve them first.check_conflicts: should origin check for
potential namespace conflicts, i.e. a used function is defined in more
than one considered package. User input is required to solve the issue.
Strongly encouraged to be set to TRUE.add_base_packages: should base packages also be added,
e.g. base::sum().check_base_conflicts: Should origin also check for
conflicts with base R functions.check_local_conflicts: Should origin also check for
conflicts with locally defined functions anywhere in your project? Note
that it does not check the environment but solely parses files and scans
them for function definitions.excluded_functions: a (named) list of functions to
exclude from checking.verbose: some sort of logging is performed, either in
the console or via the markers tab in RStudio.use_markers: whether to use the Markers tab in
RStudio.filetypes: Which filetypes to consider. Currently,
origin supports .R, .Rmd, and .Qmd (Quarto) files.Besides using regular R functions to originize files, there are also
useful addins delivered with origin. These addins are
designed to be used on-the-fly while coding. You can either originize
selected text, the currently opened file, or all scripts in the
currently opened project. However, to have as much control as when using
functions, each function argument corresponds to an option that can be
set and used inside the addins, e.g.
Actually, most function arguments of origin first check
whether an option has been declared and uses the assigned value as its
default. This allows for equal outcomes regardless whether you use the
addin or a function sequentially.
Since origin changes files on disk, it is very important
that the user has full control over what happens and user input is
required before critical steps.
Most importantly, the user must be aware of what the originized file(s) would look like. For this, all changes and potential missed changes are presented, either in the Markers tab (recommended) or in the console.
pkg:: is
inserted prior to a functionorigin highlights such cases in the
logging output.%>% are exported by packages but cannot be called with
the pkg::fun() convention. Such functions are highlighted
by default to point the user that these stem from a package. When using
dplyr-style code, consider to exclude the pipe-operator via
exclude_functions.Due to the variety of R packages, function names must not be unique
across all packages out there. By default, R masks priorly imported
functions by those imported afterwards. origin mimics this
rule by applying a higher priority to those packages that are listed
first. In case there is a conflict regarding a used
function, These functions are listed along with the packages from which
they stem.
Used functions in mutliple Packages!
filter: dplyr, stats first: data.table, dplyr
Order in which relevant packges are evaluated: data.table >> dplyr >> stats
Do you want to proceed? 1: YES 2: NO
As packages mask each others functions, the same applies to locally
defined custom functions. In case you defined your own last
function in your project, origin should
not add dplyr:: to it. Therefore, your
project is searched for function definitions and local functions have
higher priority than those exported by packages. Note that, depending on
the project size, this process can take quite some time. In this case,
set the argument/option path_to_local_functions to a
subdirectory or check_local_conflicts to FALSE
to skip this feature.
Locally defined and used functions mask exported functions from packages
last: dplyr
Local functions have higher priority. In case you want to use an exported version of a function listed above set pkg::fun manually
Got it? 1: YES 2: NO 3: Show files
When originizing a complete folder or project, many R
scripts might be checked. In case the user is unaware that there are
many files in the selected folder, resulting in a long run time of
origin, a warning is triggered and user input is
required.
You are about to originize 99 files.
Proceed? 1: YES 2: NO 3: Show files
Before the proposed changes are applied eventually, a final user input is required.
Happy with the result? 😀
1: YES 2: NO
Whether or not to add pkg:: to each (imported) function
is a controversial issue in the R
community. While the tidyverse style guide does not mention explicit
namespacing, R Packages and the Google
R style guide are in favor of it.
Pros
Cons
(minimal) performance issue
more writing required
longer code
infix functions like %>% cannot be called via
magrittr::%>% and workarounds are still required here.
Either use
library(magrittr, include.only = "%>%")
`%>%` <- magrittr::`%>%`calling library() on top of a script clearly
indicates which packages are needed. A not yet installed package throws
an error right away, not until a function cannot be found later in the
script. However, one can use the include_only argument and
set it to NULL. No functions are attached into the search
list then.
library(magrittr, include_only = NULL)As a new feature origin origin exports the function
check_pkg_usage. Given you take over a project or just
built a huge barrage of library calls over time. Which of
those are actually still needed. Just run all those
library(...) calls and then call
check_pkg_usage()
== Package Usage Report ================================================
-- Used Packages: 2 ----------------------------------------------------
v data.table
v testthat  
-- Unused Packages: 1 --------------------------------------------------
i dplyr
-- Possible Namespace Conflicts:  1 -----------------------------------
x last      data.table >> dplyr
-- Specifically (`pkg::fun()`) further used Packages: 2 ----------------
i purrr
-- Functions with unknown origin: 1 ------------------------------------
x mapThe output shows - we had attached 3 packages: {data.table},
{testthat}, and {dplyr} - functions from {data.table} and {testthat} are
used - {dplyr} functions are not used - a namespace conflict for the
function last between {data.table} and {dplyr} -
additionally, we use purrr:: at some occasions - we use the
map() function that is not exported from {data.table},
{testthat}, or {dplyr}. Note that map is exported from
{purrr} that is used elsewhere but here our code would fail since
{purrr} is not attached and `map cannot be found.
A markers Tab shows all unknown functions and unknown packages that are used explicitly
Having a closer look into result
as.data.frame(result)
#>       pkg         fun n_calls namespaced conflict conflict_pkgs
#> 1    base        %in%      53      FALSE    FALSE            NA
#> 2    base   .packages       8      FALSE    FALSE            NA
#> 3    base      Filter       3      FALSE    FALSE            NA
#> 4    base         Map       1      FALSE    FALSE            NA
#> 5    base      Reduce       5      FALSE    FALSE            NA
#> ...It first shows a lot of base functions. That is, even though their are not explicitly attached, base r packages are always attached. The print output does not show them but if you want to deep dive into the functions that are used in the project they are available
#>             pkg              fun n_calls namespaced conflict conflict_pkgs
#> 110  data.table           %like%      10      FALSE    FALSE            NA
#> 111  data.table               :=       1      FALSE    FALSE            NA
#> 112  data.table               CJ       1      FALSE    FALSE            NA
#> 113  data.table    as.data.table       1      FALSE    FALSE            NA
#> 114  data.table    as.data.table       3       TRUE    FALSE            NA
#> 115  data.table             last       2       TRUE    FALSE            NA
#> 116  data.table             last       1      FALSE     TRUE         dplyrGoing further, there are a bunch of {data.table} functions that have
been used. Some are listed twice because they were sometimes called via
data.table::, sometimes not. Furthermore, last
is marked with conflict = TRUE. This is because {dplyr}
does export a last function, as well. However, since
{data.table} has the higher priority than {dplyr} in this project,
{origin} considers it as an {data.table} function. Note that if a
function is namespaced via ::, no conflict is given.
Finally, at the end of the output:
#>         pkg  fun n_calls namespaced conflict conflict_pkgs
#> 219    <NA>  map       1      FALSE       NA            NA
#> 220  dplyr  <NA>       0         NA       NA            NAHere we see the map function that would not be assigned
to one of the given packages and the {dplyr} package that has not been
used.
Locally defined functions are also detected via parsing. These also do have a higher priority than exported function from other packages.