So far, I have only extracted data from tables using Tabula. I was delighted to find out, that there is a package that lets me automate this process with R.
As a use case, we are going to read in some data about movie admissions in Switzerland. You can find an example PDF here or in the public Github repo, where also the final code lies.
Meet tabulizer
The docs look very straight forward – I was sure I’d be up and running within minutes, or so I thought…
⚠️ tl;dr: I do not recommend using tabulizer for most use cases. If you still want to see it in action, read along. Otherwise, you might as well skip to the next chapter.
Unfortunately to run tabulizer we first need to get rJava running.
Install Java
I am using Mac OS, so I’ll guide you through the steps you need to take on this platform:
When I tried to load tabulizer, I got the following error: rJava.so Reason: image not found). I checked if I have Java installed on my computer by typing java -version into my terminal.
If it is not installed, get the Java Development Kit from Oracle here. As of May 2019, I needed Version 11, not the latest one which would have been 12. I installed that one first and ran into the error WARNING: Initial Java 12 release has broken JNI support and does NOT work. Use stable Java 11 (or watch for 12u if available).
Also, check if your JAVA_HOME is set correctly by typing echo $JAVA_HOME in your Terminal. It should return the location of your Home (in my case that’s /Library/Java/JavaVirtualMachines/jdk-11.0.3.jdk/Contents/Home)
If it doesn’t, open your profile file (e.g. ~/.zshrc or ~/.bash_profile depending on the Terminal you use) and append the following line (your version number of jdk might be different, depending on which version you installed):
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-11.0.3.jdk/Contents/Home. After changing the file, run source ~./zshrc or source ~/.bash_profile to apply the changes or restart your Terminal.
Check if that worked by running $JAVA_HOME/bin/java -version. The command should return some information about your java version like java version "11.0.3" 2019-04-16 LTS etc.
After installing, run sudo R CMD javareconf in the terminal and restart RStudio. Open again, maybe restart R and try again.
Use it like Tabula
What’s pretty cool: In Tabula we can set the different positions of tables in a document by hand – with tabulizer we can do the same thing. After installing the packages shiny and miniUI we can run the following command:
locate_areas("input/pdf/top10adm-02_08.pdf")In the RStudio Viewer a small interface will appear where we can drag and resize a rectangle:
After submitting, the tool logs out the coordinates of the rectangle we’ve drawn:
top left bottom right
98.96016 30.81076 773.23108 554.12550We now can pass these numbers to extract_tables in the argument area. Important: we also need to set guess to FALSE.
test <- extract_tables(
"input/pdf/top10adm-02_08.pdf",
guess = FALSE,
area = list(c(98, 30, 773, 554))
)The result is a list – it’s first entry is the data frame we are looking for and it looks really promising:
This is exactly what we wanted, but when we try a second file ("input/pdf/top10adm-02_12.pdf") with a slightly different look, we suddenly get this:
😧 Here the cell values are a mess! Setting the method = "lattice" or method = "stream" manually does not help.
Luckily tabulizer offers a second command to read not tabular data but plain text from a PDF: extract_text.
extract_text("input/pdf/top10adm-02_12.pdf")Its result look like this:
Not bad, with regular expressions we could extract the numbers we need. But then I tried to extract_text from the first file, I was quite surprised: The ranks, which were at the very left, suddenly get appended to another row:
😩 I start to lose patience with this package. For some files, it works pretty well, for others not at all. Plus: We installed Java for this? Meh.
pdftools to the rescue
The package pdftools offers two similar functions to extract data from PDFs: pdf_data and pdf_text.
When we run pdf_data we get this table a result:
| width | height | x | y | space | text |
|---|---|---|---|---|---|
| 57 | 18 | 100 | 47 | TRUE | Weekend |
| 29 | 16 | 46 | 71 | TRUE | Week |
| 102 | 12 | 441 | 59 | FALSE | www.filmdistribution.ch |
| 31 | 11 | 285 | 128 | FALSE | WDSMP |
| 36 | 11 | 97 | 301 | TRUE | Weekend |
| 31 | 11 | 285 | 367 | FALSE | WDSMP |
| 20 | 11 | 285 | 487 | FALSE | Xenix |
| 36 | 11 | 97 | 523 | TRUE | Weekend |
| 31 | 11 | 285 | 572 | FALSE | WDSMP |
| 31 | 11 | 285 | 675 | FALSE | WDSMP |
| 36 | 11 | 97 | 711 | TRUE | Weekend |
Uhm… that’s not exactly what I expected. Sure, we could identify movie titles by x and y position, but that seems ridiculously complicated to do.
So let’s try pdf_text. Its result looks like this:
Now that looks like tabular data to me! Sure, it’s only texts separated by white spaces but what if we split every line like this?
By identifying the position of the numbers (e.g. at the 55th character in each line) we can now split each line with these fixed positions and extract the numbers.
result <- pdf_text(current_path) %>%
# split filecontent by newline character
str_split("\n") %>%
# convert to tibble and assign unique column names
as_tibble(.name_repair = make.names) %>%
# extract the data of each column by position
mutate(
rank = str_sub(X, 0, 5),
movie = str_sub(X, 6, 55),
distributor = str_sub(X, 56, 67),
screens = str_sub(X, 68, 75),
admissions = str_sub(X, 76, 92),
total = str_sub(X, 93)
) %>%
# remove original string
select(-X) %>%
# remove white spaces around values
mutate_all(str_trim)After removing the top and bottom of the document we finally have what we want:
| region | rank | movie | distributor | screens | admissions | total |
|---|---|---|---|---|---|---|
| Deutschschweiz | 1 | Smurfs, The (3D) | Disney | 79 | 27910 | 28999 |
| Deutschschweiz | 2 | Cars 2 (3D) | Disney | 105 | 27442 | 72305 |
| Deutschschweiz | 3 | Harry Potter 7: The Deathly Hallows - Part 2 (3D) | WB | 77 | 20807 | 287859 |
| Deutschschweiz | 4 | Super 8 | Universal | 57 | 18392 | 20017 |
| Deutschschweiz | 5 | Bridesmaids | Universal | 37 | 13408 | 63861 |
| Deutschschweiz | 6 | Green Lantern (3D) | WB | 36 | 6131 | 19096 |
| Deutschschweiz | 7 | Insidious | Elite | 19 | 5726 | 26100 |
| Deutschschweiz | 8 | Zookeeper | Disney | 33 | 5720 | 85142 |
| Deutschschweiz | 9 | Bad Teacher | Disney | 22 | 4858 | 135208 |
| Deutschschweiz | 10 | Rien A Declarer | Pathé | 12 | 4666 | 93706 |
(truncated after 10 entries)
👉🏼 Update: After reading in about 800 PDF files, this approach worked in 99% of the cases. In 1% the column positions differed a little, which resulted in wrong cutoffs. So instead I used good old regular expressions to extract the information:
# convert long blob of text into a single rowed dataframe
current_lines <- current_filecontent %>%
str_split("\n") %>%
first() %>%
as_tibble()
# prepare pieces of regular expression with name of group and pattern
movie_data_search <- c(
"region" = "^\\s*(Deutschschweiz|Suisse Romande|Svizzera Italiana)|",
"rank" = "^\\s*(\\d+)\\s+",
"title" = "(.*?)\\s+",
"distributor" = "([\\w]+!?)\\s+",
"screens" = "(\\d+)\\s+",
"admissions" = "([\\d']+)\\s+",
"total" = "([\\d']+)\\s*$"
)
# we match the lines against the search above
result <- tidyr::extract(
current_lines,
1, # from first column
into = names(movie_data_search),
# collapse values from above and make the regex case insensitive
regex = paste0("(?i)", glue_collapse(movie_data_search))
)Weird files
For a bunch of files neither tabulizer nor pdftools could extract meaningful information:
- top10adm-02_11.pdf
- top10adm-03_11.pdf
- top10adm-05_11.pdf
- …
Why is this? If we try to read the file the first file with pdf_text("input/pdf/top10adm-02_11.pdf"), we receive something that looks like this:
\u008b\0036)9\003\020\003$6')\n7RS\003\024\023\003\020\003:HHNHQG\003$GPLVVLRQ\003_\0036:,7=(5/$1' ZZZ\021ILOPGLVWULEXWLRQ\021FK\n:HHN\003\026\023\017\003\025\023\024\024 7KXUVGD\\\017\003\025\033\021\023\032\021\025\023\024\024\003\020\0036XQGD\\\017\003\026\024\021\023\032\021\025\023\024\024\n 'HXWVFKVFKZHL]\003 *6 6FUHHQV $GPLVVLRQ WRWDO\003$GP\003WR\020GDWH\003 *6\n \024 &DUV\003\025\003 \026' 'LVQH\\ \024\023\033 \026\025 \030\033\032 \026\027 \034\024\032\n…Weird, right? Does anyone recognize this? I have no idea what’s going on. So how do we get any values out of these?
The last resort: tesseract
We are going to try to crack the last couple of files with OCR. We install library(tesseract) and set up the engine with English as a lot of the movie titles are English. We then read our first weird file:
eng <- tesseract("eng")
test <- ocr("input/pdf/top10adm-02_11.pdf", engine = eng)And tadaa, the result is not tabular, but with regular expressions, we might be able to tall the cell content apart. But wait, I spot some difficulties coming up:
Why is that? We are reading in machine created text – no handwriting or anything like that. The png is super clean, black on white text, perfectly aligned, but still, if we collect together all lines that do not match our regular expression we are left with quite a few:
And these are only the errors that did not get through the regular expression. When I checked the numbers, I found out, that a lot of 5s and 9s were read incorrectly:
Here even the apostrophe got confused with a dot?! Quite a bummer – tesseract states that “the existing model data provided has been trained on about 400000 text lines spanning about 4500 fonts” and the font in the document is Arial Narrow – not exactly a very exotic font. 😑
I added a quality check by summing all the admissions and checking whether the sum matches the number in the original PDF (where the total number of admissions was listed).
If the two numbers did not match, I opened the file and manually looked for the error. Not a lot of fun.
📦 You can find the whole code in this Github repo.
If you have any idea how I could improve the OCR process, let me know and hit me up on twitter.