The docs look very straight forward – I was sure I’d be up and running within minutes, or so I thought…
⚠️ tl;dr: I do not recommend using
tabulizer for most use cases. If you still want to see it in action, read along. Otherwise, you might as well skip to the next chapter.
Unfortunately to run
tabulizer we first need to get
I am using Mac OS, so I’ll guide you through the steps you need to take on this platform:
When I tried to load
tabulizer, I got the following error:
rJava.so Reason: image not found). I checked if I have Java installed on my computer by typing
java -version into my terminal.
If it is not installed, get the Java Development Kit from Oracle here. As of May 2019, I needed Version 11, not the latest one which would have been 12. I installed that one first and ran into the error
WARNING: Initial Java 12 release has broken JNI support and does NOT work. Use stable Java 11 (or watch for 12u if available).
Also, check if your
JAVA_HOME is set correctly by typing
echo $JAVA_HOME in your Terminal. It should return the location of your
Home (in my case that’s
If it doesn’t, open your profile file (e.g.
~/.bash_profile depending on the Terminal you use) and append the following line (your version number of jdk might be different, depending on which version you installed):
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-11.0.3.jdk/Contents/Home. After changing the file, run
source ~./zshrc or
source ~/.bash_profile to apply the changes or restart your Terminal.
Check if that worked by running
$JAVA_HOME/bin/java -version. The command should return some information about your java version like
java version "11.0.3" 2019-04-16 LTS etc.
After installing, run
sudo R CMD javareconf in the terminal and restart RStudio. Open again, maybe restart R and try again.
Use it like Tabula
What’s pretty cool: In Tabula we can set the different positions of tables in a document by hand – with
tabulizer we can do the same thing. After installing the packages
miniUI we can run the following command:
In the RStudio Viewer a small interface will appear where we can drag and resize a rectangle:
After submitting, the tool logs out the coordinates of the rectangle we’ve drawn:
top left bottom right 98.96016 30.81076 773.23108 554.12550
We now can pass these numbers to
extract_tables in the argument
area. Important: we also need to set
test <- extract_tables( "input/pdf/top10adm-02_08.pdf", guess = FALSE, area = list(c(98, 30, 773, 554)) )
The result is a list – it’s first entry is the data frame we are looking for and it looks really promising:
This is exactly what we wanted, but when we try a second file (
"input/pdf/top10adm-02_12.pdf") with a slightly different look, we suddenly get this:
😧 Here the cell values are a mess! Setting the
method = "lattice" or
method = "stream" manually does not help.
tabulizer offers a second command to read not tabular data but plain text from a PDF:
Its result look like this:
Not bad, with regular expressions we could extract the numbers we need. But then I tried to
extract_text from the first file, I was quite surprised: The ranks, which were at the very left, suddenly get appended to another row:
😩 I start to lose patience with this package. For some files, it works pretty well, for others not at all. Plus: We installed Java for this? Meh.
pdftools to the rescue
pdftools offers two similar functions to extract data from PDFs:
When we run
pdf_data we get this table a result:
Uhm… that’s not exactly what I expected. Sure, we could identify movie titles by
y position, but that seems ridiculously complicated to do.
So let’s try
pdf_text. Its result looks like this:
Now that looks like tabular data to me! Sure, it’s only texts separated by white spaces but what if we split every line like this?
By identifying the position of the numbers (e.g. at the 55th character in each line) we can now split each line with these fixed positions and extract the numbers.
result <- pdf_text(current_path) %>% # split filecontent by newline character str_split("\n") %>% # convert to tibble and assign unique column names as_tibble(.name_repair = make.names) %>% # extract the data of each column by position mutate( rank = str_sub(X, 0, 5), movie = str_sub(X, 6, 55), distributor = str_sub(X, 56, 67), screens = str_sub(X, 68, 75), admissions = str_sub(X, 76, 92), total = str_sub(X, 93) ) %>% # remove original string select(-X) %>% # remove white spaces around values mutate_all(str_trim)
After removing the top and bottom of the document we finally have what we want:
|Deutschschweiz||1||Smurfs, The (3D)||Disney||79||27910||28999|
|Deutschschweiz||2||Cars 2 (3D)||Disney||105||27442||72305|
|Deutschschweiz||3||Harry Potter 7: The Deathly Hallows - Part 2 (3D)||WB||77||20807||287859|
|Deutschschweiz||6||Green Lantern (3D)||WB||36||6131||19096|
|Deutschschweiz||10||Rien A Declarer||Pathé||12||4666||93706|
(truncated after 10 entries)
👉🏼 Update: After reading in about 800 PDF files, this approach worked in 99% of the cases. In 1% the column positions differed a little, which resulted in wrong cutoffs. So instead I used good old regular expressions to extract the information:
# convert long blob of text into a single rowed dataframe current_lines <- current_filecontent %>% str_split("\n") %>% first() %>% as_tibble() # prepare pieces of regular expression with name of group and pattern movie_data_search <- c( "region" = "^\\s*(Deutschschweiz|Suisse Romande|Svizzera Italiana)|", "rank" = "^\\s*(\\d+)\\s+", "title" = "(.*?)\\s+", "distributor" = "([\\w]+!?)\\s+", "screens" = "(\\d+)\\s+", "admissions" = "([\\d']+)\\s+", "total" = "([\\d']+)\\s*$" ) # we match the lines against the search above result <- tidyr::extract( current_lines, 1, # from first column into = names(movie_data_search), # collapse values from above and make the regex case insensitive regex = paste0("(?i)", glue_collapse(movie_data_search)) )
For a bunch of files neither
pdftools could extract meaningful information:
Why is this? If we try to read the file the first file with
pdf_text("input/pdf/top10adm-02_11.pdf"), we receive something that looks like this:
\u008b\0036)9\003\020\003$6')\n7RS\003\024\023\003\020\003:HHNHQG\003$GPLVVLRQ\003_\0036:,7=(5/$1' ZZZ\021ILOPGLVWULEXWLRQ\021FK\n:HHN\003\026\023\017\003\025\023\024\024 7KXUVGD\\\017\003\025\033\021\023\032\021\025\023\024\024\003\020\0036XQGD\\\017\003\026\024\021\023\032\021\025\023\024\024\n 'HXWVFKVFKZHL]\003 *6 6FUHHQV $GPLVVLRQ WRWDO\003$GP\003WR\020GDWH\003 *6\n \024 &DUV\003\025\003 \026' 'LVQH\\ \024\023\033 \026\025 \030\033\032 \026\027 \034\024\032\n…
Weird, right? Does anyone recognize this? I have no idea what’s going on. So how do we get any values out of these?
The last resort:
We are going to try to crack the last couple of files with OCR. We install
library(tesseract) and set up the engine with English as a lot of the movie titles are English. We then read our first weird file:
eng <- tesseract("eng") test <- ocr("input/pdf/top10adm-02_11.pdf", engine = eng)
And tadaa, the result is not tabular, but with regular expressions, we might be able to tall the cell content apart. But wait, I spot some difficulties coming up:
Why is that? We are reading in machine created text – no handwriting or anything like that. The png is super clean, black on white text, perfectly aligned, but still, if we collect together all lines that do not match our regular expression we are left with quite a few:
And these are only the errors that did not get through the regular expression. When I checked the numbers, I found out, that a lot of
9s were read incorrectly:
Here even the apostrophe got confused with a dot?! Quite a bummer – tesseract states that “the existing model data provided has been trained on about 400000 text lines spanning about 4500 fonts” and the font in the document is Arial Narrow – not exactly a very exotic font. 😑
I added a quality check by summing all the admissions and checking whether the sum matches the number in the original PDF (where the total number of admissions was listed).
If the two numbers did not match, I opened the file and manually looked for the error. Not a lot of fun.
📦 You can find the whole code in this Github repo.
If you have any idea how I could improve the OCR process, let me know and hit me up on twitter.