So far, I’ve only extracted data from tables using Tabula. I was delighted to find out, that there is a package that lets me automate this process with R.
The docs look very straight forward – I was sure I’d be up and running within minutes, or so I thought…
⚠️ tl;dr: I do not recommend using
tabulizer for most use cases. If you still want to see it in action, read along. Otherwise you might as well skip to the next chapter.
Unfortunately to run
tabulizer we first need to get
I am using Mac OS, so I’ll guide you through the steps you need to take on this platform:
When I tried to load
tabulizer, I got the following error:
rJava.so Reason: image not found). I checked if I have Java installed on my computer by typing
java -version into my terminal.
If it is not installed, get the Java Development Kit from Oracle here. As of May 2019 I needed Version 11 not the latest one which would have been 12. I installed that one first and ran into the error
WARNING: Initial Java 12 release has broken JNI support and does NOT work. Use stable Java 11 (or watch for 12u if available).
Also, check if your
JAVA_HOME is set correctly by typing
echo $JAVA_HOME in your Terminal. It should return the location of your
Home (in my case that’s
If it doesn’t, open your profile file (e.g.
~/.bash_profile depending on the Terminal you use) and append the following line (your version number of jdk might be different, depending on which version you installed):
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-11.0.3.jdk/Contents/Home. After changing the file, run
source ~./zshrc to apply the changes or restart your Terminal.
Check if that worked by running
$JAVA_HOME/bin/java -version. The command should return some information about your java version like
java version "11.0.3" 2019-04-16 LTS etc.
After installing, run
sudo R CMD javareconf in the terminal and restart RStudio. Open again, maybe restart R and try again.
Use it like Tabula
Now, what’s pretty cool: In Tabula we can set the different positions of tables in a document by hand – with
tabulizer we can do the same thing. After installing the packages
miniUI we can run the following command:
In the RStudio Viewer a small interface will appear where we can drag and resize a rectangle:
On submit the tool logs out the coordinates of the rectangle we’ve drawn:
top left bottom right 98.96016 30.81076 773.23108 554.12550
We now can pass these numbers to
extract_tables in the argument
area. Important: we also need to set
test <- extract_tables( "input/pdf/top10adm-02_08.pdf", guess = FALSE, area = list(c(98, 30, 773, 554)) )
The result is a list – it’s first entry is the data frame we are looking for and it looks really promising:
This is exactly what we wanted, but when we try a second file (
"input/pdf/top10adm-02_12.pdf") with a slightly different look, we suddenly get this:
😧 Here the cell values are a mess! Setting the
method = "lattice" or
method = "stream" manually does not help.
tabulizer offers a second command to read not tabular data but plain text from a PDF:
It’s result look like this:
Not bad, with regular expressions we could extract the numbers we need. But then I tried to
extract_text from the first file, I was quite surprised: The ranks, which were at the very left, suddenly get appended to another row:
😩 I start to lose patience with this package. For some files it works pretty well, for others not at all. Plus: We installed Java for this? Meh.
pdftools to the rescue
pdftools offers two similar functions to extract data form PDFs:
When we run
pdf_data we get this table a result:
Uhm… that’s not exactly what I expected. Sure, we could identify movie titles by
y position, but that seems ridiculously complicated to do.
So let’s try
pdf_text. It’s result looks like this:
Now that looks like tabular data to me! Sure, it’s only texts separated by white spaces but what if we split every line like this?
By identifying the position of the numbers (e.g. at the 55th character in each line) we can now split each line with these fixed positions and extract the numbers.
result <- pdf_text(current_path) %>% # split filecontent by newline character str_split("\n") %>% # convert to tibble and assign unique column names as_tibble(.name_repair = make.names) %>% # extract the data of each column by position mutate( rank = str_sub(X, 0, 5), movie = str_sub(X, 6, 55), distributor = str_sub(X, 56, 67), screens = str_sub(X, 68, 75), admissions = str_sub(X, 76, 92), total = str_sub(X, 93) ) %>% # remove original string select(-X) %>% # remove white spaces around values mutate_all(str_trim)
After removing the top and bottom of the document we finally have what we want:
|Deutschschweiz||1||Smurfs, The (3D)||Disney||79||27910||28999|
|Deutschschweiz||2||Cars 2 (3D)||Disney||105||27442||72305|
|Deutschschweiz||3||Harry Potter 7: The Deathly Hallows - Part 2 (3D)||WB||77||20807||287859|
|Deutschschweiz||6||Green Lantern (3D)||WB||36||6131||19096|
|Deutschschweiz||10||Rien A Declarer||Pathé||12||4666||93706|
(truncated after 10 entries)
👉🏼 Update: After reading in about 800 PDF files, this approach worked in 99% of the cases. In 1% the column positions differed a little, what resulted in wrong cutoffs. So instead I used good old regular expressions to extract the information:
For the following files neither
pdftools could extract meaningful information:
Why is this? If we try to read the file the first file with
pdf_text("input/pdf/top10adm-30_11.pdf"), we receive something that looks like this:
\u008b\0036)9\003\020\003$6')\n7RS\003\024\023\003\020\003:HHNHQG\003$GPLVVLRQ\003_\0036:,7=(5/$1' ZZZ\021ILOPGLVWULEXWLRQ\021FK\n:HHN\003\026\023\017\003\025\023\024\024 7KXUVGD\\\017\003\025\033\021\023\032\021\025\023\024\024\003\020\0036XQGD\\\017\003\026\024\021\023\032\021\025\023\024\024\n 'HXWVFKVFKZHL]\003 *6 6FUHHQV $GPLVVLRQ WRWDO\003$GP\003WR\020GDWH\003 *6\n \024 &DUV\003\025\003 \026' 'LVQH\\ \024\023\033 \026\025 \030\033\032 \026\027 \034\024\032\n…
Weird, right? Does anyone recognize this? I have no idea what’s going on.
The last resort:
We are going to try to crack the last couple of files with OCR. We install
library(tesseract) and set up the engine with English as a lot of the movie titles are English. We then read our first weird file:
eng <- tesseract("eng") test <- ocr("input/pdf/top10adm-02_11.pdf", engine = eng)
- TO DO: stelle überarbeiten And tadaa, the result is not tabular, but looks pretty decent. Using our regular expressions, we should be able to read in the data:
But wait. Upon closer inspection we see detect some errors, which surprised me! We are reading in machine created text – no handwriting or anything like that. The png is superclean, black on white text, perfectly aligned, but still:
And these are only the errors that did not get through the regular expression.
Quite a bummer – tesseract states that “the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts” and the font in the document is Arial Narrow – not exactly a very exotic font.