Useful Tools¶

Extracting text-based PDF screenplays¶

Occasionally you’ll get annoying PDFs that you can’t automatically parse with a language like C, PHP, or Java. In this case, if it’s a text-based export from something like Final Draft, you can use the free and open source XPDF-based library pdftotext.

Install pdftotext¶

On OS X, it is part of the poppler package:

brew install poppler-utils

On Linux, it’s the same:

yum install poppler-utils # CentOS
apt-get install poppler-utils # Debian

Convert from the command line (+/- password)¶

Usage from the command line using a file without password protection:

pdftotext script.pdf script.txt

And WITH password encryption:

pdftotext -upw 'password' script.pdf script.txt

Web Back-Ends

And called from a back-end web server process, e.g. in Laravel:

$cmd = 'pdftotext -layout -upw '.$password_text.' '.$pdf_file_path.' '.$output_txt_path;
exec($cmd, $pdftotext_output, $exit_code);

$contents = file_get_contents($output_txt_path);

There are plenty of packages available for NPM:

https://www.npmjs.com/browse/keyword/pdftotext

Extracting image-based PDF screenplays¶

More often, - especially with older scripts - you’ll have a PDF that contains image scans of each page. Script vendors often do this along with disabling printing etc, thinking it’s a form of “copy protection”.

For this type of file, you’re going to need Optical Character Recognition (OCR).

First-off, OCR only works effectively with high-resolution image files, so you need to convert the PDF to TIFF format.

Note: there are plenty of packages available for NPM:

https://www.npmjs.com/browse/keyword/tesseract

Export to TIFF¶

Open the PDF in Preview, and use File > Export to save as a TIFF file. This can take a long time and produce a file that is dozens of GBs in size.

You can also do this programmatically with ImageMagick, obviously.

convert -density 300 /path/to/script.pdf -depth 8 -strip -background white -alpha off script.tiff

Install Tesseract¶

Tesseract (https://github.com/tesseract-ocr/tesseract) is an open-source OCR engine that can be installed on OS X and/or Linux.

On OS X:

brew install tesseract

On Linux:

apt-get install tesseract-ocr

Perform OCR on your TIFF file¶

OCR isn’t perfect, so the file will need manual correction. But it’s better than typing the thing out by hand, manually.

tesseract script.tiff script.txt