Useful Tools

Extracting text-based PDF screenplays

Occasionally you’ll get annoying PDFs that you can’t automatically parse with a language like C, PHP, or Java. In this case, if it’s a text-based export from something like Final Draft, you can use the free and open source XPDF-based library pdftotext.

Install pdftotext

On OS X, it is part of the poppler package:

brew install poppler-utils

On Linux, it’s the same:

yum install poppler-utils # CentOS
apt-get install poppler-utils # Debian

Convert from the command line (+/- password)

Usage from the command line using a file without password protection:

pdftotext script.pdf script.txt

And WITH password encryption:

pdftotext -upw 'password' script.pdf script.txt

Web Back-Ends

And called from a back-end web server process, e.g. in Laravel:

$cmd = 'pdftotext -layout -upw '.$password_text.' '.$pdf_file_path.' '.$output_txt_path;
exec($cmd, $pdftotext_output, $exit_code);

$contents = file_get_contents($output_txt_path);

There are plenty of packages available for NPM:


Extracting image-based PDF screenplays

More often, - especially with older scripts - you’ll have a PDF that contains image scans of each page. Script vendors often do this along with disabling printing etc, thinking it’s a form of “copy protection”.

For this type of file, you’re going to need Optical Character Recognition (OCR).

First-off, OCR only works effectively with high-resolution image files, so you need to convert the PDF to TIFF format.

Note: there are plenty of packages available for NPM:

Export to TIFF

Open the PDF in Preview, and use File > Export to save as a TIFF file. This can take a long time and produce a file that is dozens of GBs in size.

You can also do this programmatically with ImageMagick, obviously.

convert -density 300 /path/to/script.pdf -depth 8 -strip -background white -alpha off script.tiff

Install Tesseract

Tesseract (https://github.com/tesseract-ocr/tesseract) is an open-source OCR engine that can be installed on OS X and/or Linux.

On OS X:

brew install tesseract

On Linux:

apt-get install tesseract-ocr

Perform OCR on your TIFF file

OCR isn’t perfect, so the file will need manual correction. But it’s better than typing the thing out by hand, manually.

tesseract script.tiff script.txt