June 20, 2019

How to extract tabular data from PDFs with R

So far, I have only extracted data from tables using Tabula. I was delighted to find out, that there is a package that lets me automate this process with R.

As a use case, we are going to read in some data about movie admissions in Switzerland. You can find an example PDF here or in the public Github repo, where also the final code lies.

Meet tabulizer

The docs look very straight forward – I was sure I’d be up and running within minutes, or so I thought…

⚠️ tl;dr: I do not recommend using tabulizer for most use cases. If you still want to see it in action, read along. Otherwise, you might as well skip to the next chapter.

Unfortunately to run tabulizer we first need to get rJava running.

Install Java

I am using Mac OS, so I’ll guide you through the steps you need to take on this platform:

When I tried to load tabulizer, I got the following error: rJava.so Reason: image not found). I checked if I have Java installed on my computer by typing java -version into my terminal.

If it is not installed, get the Java Development Kit from Oracle here. As of May 2019, I needed Version 11, not the latest one which would have been 12. I installed that one first and ran into the error WARNING: Initial Java 12 release has broken JNI support and does NOT work. Use stable Java 11 (or watch for 12u if available).

Also, check if your JAVA_HOME is set correctly by typing echo $JAVA_HOME in your Terminal. It should return the location of your Home (in my case that’s /Library/Java/JavaVirtualMachines/jdk-11.0.3.jdk/Contents/Home)

If it doesn’t, open your profile file (e.g. ~/.zshrc or ~/.bash_profile depending on the Terminal you use) and append the following line (your version number of jdk might be different, depending on which version you installed): export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-11.0.3.jdk/Contents/Home. After changing the file, run source ~./zshrc or source ~/.bash_profile to apply the changes or restart your Terminal.

Check if that worked by running $JAVA_HOME/bin/java -version. The command should return some information about your java version like java version "11.0.3" 2019-04-16 LTS etc.

After installing, run sudo R CMD javareconf in the terminal and restart RStudio. Open again, maybe restart R and try again.

Use it like Tabula

What’s pretty cool: In Tabula we can set the different positions of tables in a document by hand – with tabulizer we can do the same thing. After installing the packages shiny and miniUI we can run the following command:

locate_areas("input/pdf/top10adm-02_08.pdf")

In the RStudio Viewer a small interface will appear where we can drag and resize a rectangle:

locate areas

After submitting, the tool logs out the coordinates of the rectangle we’ve drawn:

      top      left    bottom     right 
 98.96016  30.81076 773.23108 554.12550

We now can pass these numbers to extract_tables in the argument area. Important: we also need to set guess to FALSE.

test <- extract_tables(
  "input/pdf/top10adm-02_08.pdf",
  guess = FALSE,
  area = list(c(98, 30, 773, 554))
)

The result is a list – it’s first entry is the data frame we are looking for and it looks really promising:

extract tables result

This is exactly what we wanted, but when we try a second file ("input/pdf/top10adm-02_12.pdf") with a slightly different look, we suddenly get this:

extract tables result2

😧 Here the cell values are a mess! Setting the method = "lattice" or method = "stream" manually does not help.

Luckily tabulizer offers a second command to read not tabular data but plain text from a PDF: extract_text.

extract_text("input/pdf/top10adm-02_12.pdf")

Its result look like this:

extract text results

Not bad, with regular expressions we could extract the numbers we need. But then I tried to extract_text from the first file, I was quite surprised: The ranks, which were at the very left, suddenly get appended to another row:

extract text results2

😩 I start to lose patience with this package. For some files, it works pretty well, for others not at all. Plus: We installed Java for this? Meh.

pdftools to the rescue

The package pdftools offers two similar functions to extract data from PDFs: pdf_data and pdf_text.

When we run pdf_data we get this table a result:

width height x y space text
57 18 100 47 TRUE Weekend
29 16 46 71 TRUE Week
102 12 441 59 FALSE www.filmdistribution.ch
31 11 285 128 FALSE WDSMP
36 11 97 301 TRUE Weekend
31 11 285 367 FALSE WDSMP
20 11 285 487 FALSE Xenix
36 11 97 523 TRUE Weekend
31 11 285 572 FALSE WDSMP
31 11 285 675 FALSE WDSMP
36 11 97 711 TRUE Weekend

Uhm… that’s not exactly what I expected. Sure, we could identify movie titles by x and y position, but that seems ridiculously complicated to do.

So let’s try pdf_text. Its result looks like this:

pdf text result

Now that looks like tabular data to me! Sure, it’s only texts separated by white spaces but what if we split every line like this?

pdf text result with lines

By identifying the position of the numbers (e.g. at the 55th character in each line) we can now split each line with these fixed positions and extract the numbers.

result <- pdf_text(current_path) %>%
  # split filecontent by newline character
  str_split("\n") %>%
  # convert to tibble and assign unique column names
  as_tibble(.name_repair = make.names) %>%
  # extract the data of each column by position
  mutate(
    rank = str_sub(X, 0, 5),
    movie = str_sub(X, 6, 55),
    distributor = str_sub(X, 56, 67),
    screens = str_sub(X, 68, 75),
    admissions = str_sub(X, 76, 92),
    total = str_sub(X, 93)
  ) %>%
  # remove original string
  select(-X) %>%
  # remove white spaces around values
  mutate_all(str_trim)

After removing the top and bottom of the document we finally have what we want:

region rank movie distributor screens admissions total
Deutschschweiz 1 Smurfs, The (3D) Disney 79 27910 28999
Deutschschweiz 2 Cars 2 (3D) Disney 105 27442 72305
Deutschschweiz 3 Harry Potter 7: The Deathly Hallows - Part 2 (3D) WB 77 20807 287859
Deutschschweiz 4 Super 8 Universal 57 18392 20017
Deutschschweiz 5 Bridesmaids Universal 37 13408 63861
Deutschschweiz 6 Green Lantern (3D) WB 36 6131 19096
Deutschschweiz 7 Insidious Elite 19 5726 26100
Deutschschweiz 8 Zookeeper Disney 33 5720 85142
Deutschschweiz 9 Bad Teacher Disney 22 4858 135208
Deutschschweiz 10 Rien A Declarer Pathé 12 4666 93706

(truncated after 10 entries)

👉🏼 Update: After reading in about 800 PDF files, this approach worked in 99% of the cases. In 1% the column positions differed a little, which resulted in wrong cutoffs. So instead I used good old regular expressions to extract the information:

# convert long blob of text into a single rowed dataframe
current_lines <- current_filecontent %>%
  str_split("\n") %>%
  first() %>%
  as_tibble()

# prepare pieces of regular expression with name of group and pattern
movie_data_search <- c(
  "region" = "^\\s*(Deutschschweiz|Suisse Romande|Svizzera Italiana)|",
  "rank" = "^\\s*(\\d+)\\s+",
  "title" = "(.*?)\\s+",
  "distributor" = "([\\w]+!?)\\s+",
  "screens" = "(\\d+)\\s+",
  "admissions" = "([\\d']+)\\s+",
  "total" = "([\\d']+)\\s*$"
)

# we match the lines against the search above
result <- tidyr::extract(
  current_lines,
  1, # from first column
  into = names(movie_data_search),
  # collapse values from above and make the regex case insensitive
  regex = paste0("(?i)", glue_collapse(movie_data_search))
)

Weird files

For a bunch of files neither tabulizer nor pdftools could extract meaningful information:

  • top10adm-02_11.pdf
  • top10adm-03_11.pdf
  • top10adm-05_11.pdf

Why is this? If we try to read the file the first file with pdf_text("input/pdf/top10adm-02_11.pdf"), we receive something that looks like this:

\u008b\0036)9\003\020\003$6')\n7RS\003\024\023\003\020\003:HHNHQG\003$GPLVVLRQ\003_\0036:,7=(5/$1'                                                     ZZZ\021ILOPGLVWULEXWLRQ\021FK\n:HHN\003\026\023\017\003\025\023\024\024                   7KXUVGD\\\017\003\025\033\021\023\032\021\025\023\024\024\003\020\0036XQGD\\\017\003\026\024\021\023\032\021\025\023\024\024\n 'HXWVFKVFKZHL]\003 *6                                                       6FUHHQV $GPLVVLRQ      WRWDO\003$GP\003WR\020GDWH\003 *6\n     \024 &DUV\003\025\003 \026'                                              'LVQH\\       \024\023\033       \026\025 \030\033\032                       \026\027 \034\024\032\n…

Weird, right? Does anyone recognize this? I have no idea what’s going on. So how do we get any values out of these?

The last resort: tesseract

We are going to try to crack the last couple of files with OCR. We install library(tesseract) and set up the engine with English as a lot of the movie titles are English. We then read our first weird file:

eng <- tesseract("eng")
test <- ocr("input/pdf/top10adm-02_11.pdf", engine = eng)

And tadaa, the result is not tabular, but with regular expressions, we might be able to tall the cell content apart. But wait, I spot some difficulties coming up:

tesseract result

Why is that? We are reading in machine created text – no handwriting or anything like that. The png is super clean, black on white text, perfectly aligned, but still, if we collect together all lines that do not match our regular expression we are left with quite a few:

tesseract errors

And these are only the errors that did not get through the regular expression. When I checked the numbers, I found out, that a lot of 5s and 9s were read incorrectly:

tesseract error1

Here even the apostrophe got confused with a dot?! Quite a bummer – tesseract states that “the existing model data provided has been trained on about 400000 text lines spanning about 4500 fonts” and the font in the document is Arial Narrow – not exactly a very exotic font. 😑

I added a quality check by summing all the admissions and checking whether the sum matches the number in the original PDF (where the total number of admissions was listed).

If the two numbers did not match, I opened the file and manually looked for the error. Not a lot of fun.

📦 You can find the whole code in this Github repo.

If you have any idea how I could improve the OCR process, let me know and hit me up on twitter.


Angelo Zehr

Written by Angelo Zehr, data journalist at SRF Data and teacher.


Further reading