August 18, 2019

Map over each row of a dataframe in R with purrr

I often find myself wanting to do something a bit more complicated with each entry in a dataset in R. All my data lives in data frames or tibbles, that I hand over to the next step in my data handling with magrittrs %>%. When I finally found an elegant way to add more complex functions to this data flow, I was intrigued:

df_with_time <- df %>%
  pmap_dfr(function(...) {
    current <- tibble(...)
    # do cool stuff and access content from current row with
    print(current$column_name)
    # return
    current %>%
      mutate(date = Sys.Date())
  })

⚠️ The three dots ... are not just a random placeholder but very intentionally placed!

They are usually placed after the first one or two arguments in a function, this blog post explains how. This frees us of typing every column name as a function parameter, which would be tedious.

On the first line we “catch” all arguments passed to our function and save it into a one-rowed tibble. You can choose any name for this, but I usually stick to “current” as it reminds me that the contents inside are from the current iteration over the data frame – our current row.

With the $ dollar sign we can now access the contents of a specific column. When we’re done we can return a mutation of the current data – or anything else. The results will just be row-binded together.

If you don’t want to return anything, use pwalk instead of pmap_dfr. As described in the docs, this is a signal to other programmers that the function is executed for its side effects:

df %>%
  pwalk(function(...) {
    current <- tibble(...)
    # do cool stuff and access content from current row with
    print(current$column_name)
    # no return
  })

purrr <3 lists

The functions map and walk (as well as reduce, by the way) from the purrr package were designed to work with lists and vectors.

When you only need to iterate over one column of a data frame, it’s even easier with these functions:

df$column_name %>%
  walk(function(current_value) {
    # do great stuff
  })

I would go so far as to say that these functions can replace every loop you’ve ever written. With the benefit of functions and scope (they don’t pollute your global environment).

So if you want to execute some code for every year between 1999 and 2017, instead of:

for(year in 1999:2017) {
  print(year)
}

…it’s now:

1999:2017 %>%
  walk(function(year) {
    print(year)
  })

🙌🏼 Bonus: If you’re not sure that every iteration will execute successfully (e.g. when you’re scraping a website and have to take errors into account), you can wrap your functions with possibly or safely and provide a fallback value for errorful iterations. Read more about it in the docs.

One downside

Regarding performance: There are more performant ways to apply functions to datasets. Iterating over 20’000 rows of a data frame took 7 to 9 seconds on my MacBook Pro to finish.

But when coding interactively / iteratively the execution time of some lines of code is much less important than other areas of software development. Sure, the code is reproducible and I will run it a dozen of times until I’ve finished the project, but that’s not a whole lot, when you compare it to other software, that thousands of clients download and execute. That’s why I value developer experience much higher here.

And this is the most convenient way I found for this kind of stuff. Using apply, I always had difficulties regarding data types. But I am happy to hear about your favorite way iterate over a data frame – write me on twitter!


Angelo Zehr

Written by Angelo Zehr, data journalist at SRF Data and teacher.


Further reading