parsermd
The goal of parsermd is to extract the content of an R Markdown file to allow for programmatic interactions with the document’s contents (i.e. code chunks and markdown text). The goal is to capture the fundamental structure of the document and as such we do not attempt to parse every detail of the Rmd. Specifically, the yaml front matter, markdown text, and R code are read as text lines allowing them to be processed using other tools.
Installation
parsermd
can be installed from CRAN with:
install.packages("parsermd")
You can install the latest development version of
parsermd
from GitHub with:
remotes::install_github("rundel/parsermd")
Parsing Rmds
This is a basic example which shows you the basic abstract syntax tree (AST) that results from parsing a simple Rmd file,
rmd = parsermd::parse_rmd(system.file("examples/minimal.Rmd", package = "parsermd"))
The R Markdown document is parsed and stored in a flat, ordered list
object containing tagged elements. By default the package will present a
hierarchical view of the document where chunks and markdown text are
nested within headings, which is shown by the default print method for
rmd_ast
objects.
print(rmd)
#> ├── YAML [4 fields]
#> ├── Heading [h1] - Setup
#> │ └── Chunk [r, 1 option, 1 line] - setup
#> └── Heading [h1] - Content
#> ├── Heading [h2] - R Markdown
#> │ ├── Markdown [6 lines]
#> │ ├── Chunk [r, 0 options, 1 line] - cars
#> │ └── Chunk [r, 0 options, 1 line] - unnamed-chunk-1
#> └── Heading [h2] - Including Plots
#> ├── Markdown [2 lines]
#> ├── Chunk [r, 1 option, 1 line] - pressure
#> └── Markdown [2 lines]
If you would prefer to see the underlying flat structure, this can be
printed by setting flat = TRUE
with print
.
print(rmd, flat = TRUE)
#> ├── YAML [4 fields]
#> ├── Heading [h1] - Setup
#> ├── Chunk [r, 1 option, 1 line] - setup
#> ├── Heading [h1] - Content
#> ├── Heading [h2] - R Markdown
#> ├── Markdown [6 lines]
#> ├── Chunk [r, 0 options, 1 line] - cars
#> ├── Chunk [r, 0 options, 1 line] - unnamed-chunk-1
#> ├── Heading [h2] - Including Plots
#> ├── Markdown [2 lines]
#> ├── Chunk [r, 1 option, 1 line] - pressure
#> └── Markdown [2 lines]
Additionally, to ease the manipulation of the AST the package
supports the transformation of the object into a tidy tibble with
as_tibble
or as.data.frame
(both return a
tibble).
as_tibble(rmd)
#> # A tibble: 12 × 5
#> sec_h1 sec_h2 type label ast
#> <chr> <chr> <chr> <chr> <rmd_ast>
#> 1 NA NA rmd_yaml NA <yaml>
#> 2 Setup NA rmd_heading NA <heading [h1]>
#> 3 Setup NA rmd_chunk setup <chunk [r]>
#> 4 Content NA rmd_heading NA <heading [h1]>
#> 5 Content R Markdown rmd_heading NA <heading [h2]>
#> 6 Content R Markdown rmd_markdown NA <markdown>
#> 7 Content R Markdown rmd_chunk cars <chunk [r]>
#> 8 Content R Markdown rmd_chunk unnamed-chunk-1 <chunk [r]>
#> 9 Content Including Plots rmd_heading NA <heading [h2]>
#> 10 Content Including Plots rmd_markdown NA <markdown>
#> 11 Content Including Plots rmd_chunk pressure <chunk [r]>
#> 12 Content Including Plots rmd_markdown NA <markdown>
and it is possible to convert from these data frames back into an
rmd_ast
.
as_ast( as_tibble(rmd) )
#> ├── YAML [4 fields]
#> ├── Heading [h1] - Setup
#> │ └── Chunk [r, 1 option, 1 line] - setup
#> └── Heading [h1] - Content
#> ├── Heading [h2] - R Markdown
#> │ ├── Markdown [6 lines]
#> │ ├── Chunk [r, 0 options, 1 line] - cars
#> │ └── Chunk [r, 0 options, 1 line] - unnamed-chunk-1
#> └── Heading [h2] - Including Plots
#> ├── Markdown [2 lines]
#> ├── Chunk [r, 1 option, 1 line] - pressure
#> └── Markdown [2 lines]
Finally, we can also convert the rmd_ast
back into an R
Markdown document via as_document
cat(
as_document(rmd),
sep = "\n"
)
#> ---
#> title: Minimal
#> author: Colin Rundel
#> date: 7/21/2020
#> output: html_document
#> ---
#>
#> # Setup
#>
#> ```{r setup, include = FALSE}
#> knitr::opts_chunk$set(echo = TRUE)
#> ```
#>
#> # Content
#>
#> ## R Markdown
#>
#> This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML,
#> PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
#>
#> When you click the **Knit** button a document will be generated that includes both content as well
#> as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
#>
#>
#> ```{r cars}
#> summary(cars)
#> ```
#>
#> ```{r unnamed-chunk-1}
#> knitr::knit_patterns$get()
#> ```
#>
#> ## Including Plots
#>
#> You can also embed plots, for example:
#>
#>
#> ```{r pressure, echo = FALSE}
#> plot(pressure)
#> ```
#>
#> Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code
#> that generated the plot.
Working with the AST
Once we have parsed an R Markdown document, there are a variety of
things that we can do with our new abstract syntax tree (ast). Below we
will demonstrate some of the basic functionality within
parsermd
to manipulate and edit these objects as well as
check their properties.
rmd = parse_rmd(system.file("examples/hw01-student.Rmd", package="parsermd"))
rmd
#> ├── YAML [2 fields]
#> ├── Heading [h3] - Load packages
#> │ └── Chunk [r, 1 option, 2 lines] - load-packages
#> ├── Heading [h3] - Exercise 1
#> │ ├── Markdown [2 lines]
#> │ └── Heading [h4] - Solution
#> │ └── Markdown [5 lines]
#> ├── Heading [h3] - Exercise 2
#> │ ├── Markdown [2 lines]
#> │ └── Heading [h4] - Solution
#> │ ├── Markdown [2 lines]
#> │ ├── Chunk [r, 2 options, 5 lines] - plot-dino
#> │ ├── Markdown [2 lines]
#> │ └── Chunk [r, 0 options, 2 lines] - cor-dino
#> └── Heading [h3] - Exercise 3
#> ├── Markdown [2 lines]
#> └── Heading [h4] - Solution
#> ├── Chunk [r, 2 options, 5 lines] - plot-star
#> └── Chunk [r, 0 options, 2 lines] - cor-star
Say we were interested in examining the solution a student entered
for Exercise 1 - we can get access to this using the
rmd_select
function and its selection helper functions,
specifically the by_section
helper.
rmd_select(rmd, by_section( c("Exercise 1", "Solution") ))
#> └── Heading [h3] - Exercise 1
#> └── Heading [h4] - Solution
#> └── Markdown [5 lines]
To view the content instead of the AST we can use the
as_document()
function,
rmd_select(rmd, by_section( c("Exercise 1", "Solution") )) |>
as_document()
#> [1] "### Exercise 1"
#> [2] ""
#> [3] "#### Solution"
#> [4] ""
#> [5] "2 columns, 13 rows, 3 variables: "
#> [6] "dataset: indicates which dataset the data are from "
#> [7] "x: x-values "
#> [8] "y: y-values "
#> [9] ""
#> [10] ""
Note that this gives us the Exercise 1 and Solution
headings and the contained markdown text, if we only wanted the markdown
text then we can refine our selector to only include nodes with the type
rmd_markdown
via the has_type
helper.
rmd_select(rmd, by_section(c("Exercise 1", "Solution")) & has_type("rmd_markdown")) |>
as_document()
#> [1] "2 columns, 13 rows, 3 variables: "
#> [2] "dataset: indicates which dataset the data are from "
#> [3] "x: x-values "
#> [4] "y: y-values "
#> [5] ""
#> [6] ""
This approach uses the tidyselect &
operator within
the selection to find the intersection of the selectors
by_section(c("Exercise 1", "Solution"))
and
has_type("rmd_markdown")
. Alternative the same result can
be achieved by chaining multiple rmd_select
s together,
rmd_select(rmd, by_section(c("Exercise 1", "Solution"))) |>
rmd_select(has_type("rmd_markdown")) |>
as_document()
#> [1] "2 columns, 13 rows, 3 variables: "
#> [2] "dataset: indicates which dataset the data are from "
#> [3] "x: x-values "
#> [4] "y: y-values "
#> [5] ""
#> [6] ""
Wildcards
One useful feature of the by_section()
and
has_label()
selection helpers is that they support glob style
pattern matching. As such we can do the following to extract all of the
solutions from our document:
rmd_select(rmd, by_section(c("Exercise *", "Solution")))
#> ├── Heading [h3] - Exercise 1
#> │ └── Heading [h4] - Solution
#> │ └── Markdown [5 lines]
#> ├── Heading [h3] - Exercise 2
#> │ └── Heading [h4] - Solution
#> │ ├── Markdown [2 lines]
#> │ ├── Chunk [r, 2 options, 5 lines] - plot-dino
#> │ ├── Markdown [2 lines]
#> │ └── Chunk [r, 0 options, 2 lines] - cor-dino
#> └── Heading [h3] - Exercise 3
#> └── Heading [h4] - Solution
#> ├── Chunk [r, 2 options, 5 lines] - plot-star
#> └── Chunk [r, 0 options, 2 lines] - cor-star
Similarly, if we wanted to just extract the chunks that involve plotting we can match for chunk labels with a “plot” prefix,
rmd_select(rmd, has_label("plot*"))
#> ├── Chunk [r, 2 options, 5 lines] - plot-dino
#> └── Chunk [r, 2 options, 5 lines] - plot-star
ast as a tibble
As mentioned earlier, the ast can also be represented as a tibble, in which case we construct several columns using the properties of the ast (sections, type, and chunk label).
tbl = as_tibble(rmd)
tbl
#> # A tibble: 19 × 5
#> sec_h3 sec_h4 type label ast
#> <chr> <chr> <chr> <chr> <rmd_ast>
#> 1 NA NA rmd_yaml NA <yaml>
#> 2 Load packages NA rmd_heading NA <heading [h3]>
#> 3 Load packages NA rmd_chunk load-packages <chunk [r]>
#> 4 Exercise 1 NA rmd_heading NA <heading [h3]>
#> 5 Exercise 1 NA rmd_markdown NA <markdown>
#> 6 Exercise 1 Solution rmd_heading NA <heading [h4]>
#> 7 Exercise 1 Solution rmd_markdown NA <markdown>
#> 8 Exercise 2 NA rmd_heading NA <heading [h3]>
#> 9 Exercise 2 NA rmd_markdown NA <markdown>
#> 10 Exercise 2 Solution rmd_heading NA <heading [h4]>
#> 11 Exercise 2 Solution rmd_markdown NA <markdown>
#> 12 Exercise 2 Solution rmd_chunk plot-dino <chunk [r]>
#> 13 Exercise 2 Solution rmd_markdown NA <markdown>
#> 14 Exercise 2 Solution rmd_chunk cor-dino <chunk [r]>
#> 15 Exercise 3 NA rmd_heading NA <heading [h3]>
#> 16 Exercise 3 NA rmd_markdown NA <markdown>
#> 17 Exercise 3 Solution rmd_heading NA <heading [h4]>
#> 18 Exercise 3 Solution rmd_chunk plot-star <chunk [r]>
#> 19 Exercise 3 Solution rmd_chunk cor-star <chunk [r]>
All of the functions above also work with this tibble representation, and allow for the same manipulations of the underlying ast.
rmd_select(tbl, by_section(c("Exercise *", "Solution")))
#> # A tibble: 13 × 5
#> sec_h3 sec_h4 type label ast
#> <chr> <chr> <chr> <chr> <rmd_ast>
#> 1 Exercise 1 NA rmd_heading NA <heading [h3]>
#> 2 Exercise 1 Solution rmd_heading NA <heading [h4]>
#> 3 Exercise 1 Solution rmd_markdown NA <markdown>
#> 4 Exercise 2 NA rmd_heading NA <heading [h3]>
#> 5 Exercise 2 Solution rmd_heading NA <heading [h4]>
#> 6 Exercise 2 Solution rmd_markdown NA <markdown>
#> 7 Exercise 2 Solution rmd_chunk plot-dino <chunk [r]>
#> 8 Exercise 2 Solution rmd_markdown NA <markdown>
#> 9 Exercise 2 Solution rmd_chunk cor-dino <chunk [r]>
#> 10 Exercise 3 NA rmd_heading NA <heading [h3]>
#> 11 Exercise 3 Solution rmd_heading NA <heading [h4]>
#> 12 Exercise 3 Solution rmd_chunk plot-star <chunk [r]>
#> 13 Exercise 3 Solution rmd_chunk cor-star <chunk [r]>
As the complete ast is store directly in the ast
column,
we can also manipulate this tibble using dplyr or similar packages and
have these changes persist. For example we can use the
rmd_node_length
function to return the number of lines in
the various nodes of the ast and add a new length column to our
tibble.
tbl_lines = tbl |>
dplyr::mutate(lines = rmd_node_length(ast))
tbl_lines
#> # A tibble: 19 × 6
#> sec_h3 sec_h4 type label ast lines
#> <chr> <chr> <chr> <chr> <rmd_ast> <int>
#> 1 NA NA rmd_yaml NA <yaml> 2
#> 2 Load packages NA rmd_heading NA <heading [h3]> NA
#> 3 Load packages NA rmd_chunk load-packages <chunk [r]> 2
#> 4 Exercise 1 NA rmd_heading NA <heading [h3]> NA
#> 5 Exercise 1 NA rmd_markdown NA <markdown> 2
#> 6 Exercise 1 Solution rmd_heading NA <heading [h4]> NA
#> 7 Exercise 1 Solution rmd_markdown NA <markdown> 5
#> 8 Exercise 2 NA rmd_heading NA <heading [h3]> NA
#> 9 Exercise 2 NA rmd_markdown NA <markdown> 2
#> 10 Exercise 2 Solution rmd_heading NA <heading [h4]> NA
#> 11 Exercise 2 Solution rmd_markdown NA <markdown> 2
#> 12 Exercise 2 Solution rmd_chunk plot-dino <chunk [r]> 5
#> 13 Exercise 2 Solution rmd_markdown NA <markdown> 2
#> 14 Exercise 2 Solution rmd_chunk cor-dino <chunk [r]> 2
#> 15 Exercise 3 NA rmd_heading NA <heading [h3]> NA
#> 16 Exercise 3 NA rmd_markdown NA <markdown> 2
#> 17 Exercise 3 Solution rmd_heading NA <heading [h4]> NA
#> 18 Exercise 3 Solution rmd_chunk plot-star <chunk [r]> 5
#> 19 Exercise 3 Solution rmd_chunk cor-star <chunk [r]> 2
Now we can apply a rmd_select
to this updated tibble
rmd_select(tbl_lines, by_section(c("Exercise 2", "Solution")))
#> # A tibble: 6 × 6
#> sec_h3 sec_h4 type label ast lines
#> <chr> <chr> <chr> <chr> <rmd_ast> <int>
#> 1 Exercise 2 NA rmd_heading NA <heading [h3]> NA
#> 2 Exercise 2 Solution rmd_heading NA <heading [h4]> NA
#> 3 Exercise 2 Solution rmd_markdown NA <markdown> 2
#> 4 Exercise 2 Solution rmd_chunk plot-dino <chunk [r]> 5
#> 5 Exercise 2 Solution rmd_markdown NA <markdown> 2
#> 6 Exercise 2 Solution rmd_chunk cor-dino <chunk [r]> 2
and see that our new lines
column is maintained.
Note that using the rmd_select
function is optional here
and we can also accomplish the same task using
dplyr::filter
or any similar approach
tbl_lines |>
dplyr::filter(
sec_h3 == "Exercise 2",
sec_h4 == "Solution"
)
#> # A tibble: 5 × 6
#> sec_h3 sec_h4 type label ast lines
#> <chr> <chr> <chr> <chr> <rmd_ast> <int>
#> 1 Exercise 2 Solution rmd_heading NA <heading [h4]> NA
#> 2 Exercise 2 Solution rmd_markdown NA <markdown> 2
#> 3 Exercise 2 Solution rmd_chunk plot-dino <chunk [r]> 5
#> 4 Exercise 2 Solution rmd_markdown NA <markdown> 2
#> 5 Exercise 2 Solution rmd_chunk cor-dino <chunk [r]> 2
As such, it is possible to mix and match between
parsermd
’s built-in functions and any of your other
preferred data manipulation packages.
One small note of caution is that when converting back to an ast,
as_ast
, or document, as_document
, only the
structure of the ast
column matters so changes made to the
section columns, type
column, or the label
column will not affect the output in any way. This is particularly
important when headings are filtered out, as their columns may still
appear in the tibble while they are no longer in the ast -
rmd_select
attempts to avoid this by recalculating these
specific columns as part of the subsetting process.
tbl |>
dplyr::filter(
sec_h3 == "Exercise 2",
sec_h4 == "Solution",
type == "rmd_chunk"
)
#> # A tibble: 2 × 5
#> sec_h3 sec_h4 type label ast
#> <chr> <chr> <chr> <chr> <rmd_ast>
#> 1 Exercise 2 Solution rmd_chunk plot-dino <chunk [r]>
#> 2 Exercise 2 Solution rmd_chunk cor-dino <chunk [r]>
tbl |>
dplyr::filter(
sec_h3 == "Exercise 2",
sec_h4 == "Solution",
type == "rmd_chunk"
) |>
as_document() |>
cat(sep="\n")
#> ```{r plot-dino, fig.height = 3, fig.width = 6}
#> dino_data <- datasaurus_dozen %>%
#> filter(dataset == "dino")
#>
#> ggplot(data = dino_data, mapping = aes(x = x, y = y)) +
#> geom_point()
#> ```
#>
#> ```{r cor-dino}
#> dino_data %>%
#> summarize(r = cor(x, y))
#> ```
tbl |>
rmd_select(
by_section(c("Exercise 2", "Solution")) &
has_type("rmd_chunk")
) |>
as_document() |>
cat(sep="\n")
#> ```{r plot-dino, fig.height = 3, fig.width = 6}
#> dino_data <- datasaurus_dozen %>%
#> filter(dataset == "dino")
#>
#> ggplot(data = dino_data, mapping = aes(x = x, y = y)) +
#> geom_point()
#> ```
#>
#> ```{r cor-dino}
#> dino_data %>%
#> summarize(r = cor(x, y))
#> ```