The problem: thirty years of screenplays, no way to read them as a set
Every studio has it. A network drive, a SharePoint, an S3 bucket, a legacy
vault — somewhere between 3,000 and 30,000 screenplays, accumulated across
decades of production, archived in whatever format the writer happened to
use. Final Draft 6 .fdx files next to .fadein files next to scanned
PDFs next to Word documents that were printed to PDF.
You can find a given title by filename. You cannot, today, answer:
- “Show me every night-exterior scene in our horror catalogue.”
- “Which scripts feature a character named ‘GRACE’ who speaks more than 40 lines?”
- “What percentage of our 2017–2024 drama slate is set in New York?”
- “Return all scripts whose logline involves a heist and whose genre tag includes ‘noir’.”
Not because the information isn’t there — it’s all on the page — but because the information isn’t data yet.
ScreenJSON is how it becomes data. Every screenplay, converted to a structured document with UUIDs, typed elements, scene-level breakdown tags, and optional pre-computed embeddings, dropped into a database. Once that ingestion has run once, every question above becomes a query.
The ingestion pipeline
For a single writer’s desk, the CLI is enough. For a studio archive, you want Greenlight — a job queue microservice with S3 ingestion that wraps the same CLI.
The pipeline is opinionated and straightforward:
S3 ingest → Redis job → Worker claims → screenjson convert → Upload to S3
Point Greenlight at a bucket. Drop files in. Workers pick them up, convert them to ScreenJSON, validate against the schema, optionally encrypt the text runs for sensitive material, and write the result to an output bucket. From there, a second pipeline can index into Elasticsearch, MongoDB, DynamoDB, Cassandra, or Redis — all of which are first-class destinations in the CLI.
A realistic task pipeline:
{
"tasks": [
"convert-fdx-to-screenjson",
"validate-screenjson",
"encrypt-screenjson"
]
}
Three tasks, one submission, one final deliverable.
Choosing a storage layer
Different studios answer this differently, so ScreenJSON is backend-agnostic.
Elasticsearch is the obvious choice if your primary workload is search-heavy: faceted browsing of your catalogue, full-text search across dialogue, boolean queries combining scene metadata and free text. The native CLI integration writes directly into a named index.
MongoDB is a strong default for a document-oriented system of record. The whole screenplay is one document, nested structure is preserved, and every node’s UUID becomes a natural secondary key.
DynamoDB wins when your access pattern is “by title, by id, in
bulk, globally distributed”. Partition by id, sort by version or
revision, and let the schema’s UUID-first design do the rest.
Cassandra and Redis cover the edge cases — time-series access to revision history, or a hot cache of the most recently viewed scripts.
Whichever you pick, the on-disk document is the same: a ScreenJSON file.
Making the catalogue actually searchable
A screenplay in JSON isn’t automatically a search index. Two ScreenJSON features earn their keep here.
Taggables, genres, themes. The document root has three slug-keyed
arrays — taggable, genre, themes —
that exist precisely so you can faceted-browse the catalogue without
reparsing free text. Populate them during the ingestion pipeline (you can
derive a sensible first pass from a quick LLM call) and promote them to
Elasticsearch keyword fields.
Optional embeddings and passages. The analysis block can
hold pre-computed embeddings, keyed by the UUID of the scene, element, or
character they describe. If semantic search on dialogue is a primary
workflow — “find me the scene where someone makes a speech about
forgiveness” — run the embedding generation as a pipeline step, once per
ingest, and you never have to recompute them.
Passages (retrieval-sized chunks) and summaries (scene- or document-level)
complete the picture. Everything in analysis is additive and
discardable: you can throw it away and regenerate it without touching the
canonical document.
On rights, licensing, and encryption
A studio catalogue has different tiers of sensitivity. Scripts from shelved projects, unreleased sequels, draft-stage commissions, and legally contested titles can’t sit unencrypted on the same bucket as reference materials. ScreenJSON encrypts text, not structure.
That means:
- The scene count, runtime estimate, cast size, and revision history stay visible to your indexer, your scheduler, your reporting tools.
- The dialogue and action stay opaque unless someone with the key opens them.
- You can run an encrypted archive through search-adjacent tooling — classifying by scene count, by location, by genre tag — without ever letting that tooling see a word of the script.
Keys live in your KMS. The CLI reads them from an environment variable.
On versioning the schema itself
The schema is versioned. A document carries the ScreenJSON version it was authored against. A studio ingestion pipeline should pin to a known good version and upgrade deliberately. The reference validator will tell you, on every document, whether it conforms.
Governance and the open layer
The schema is open, published at
https://screenjson.com/draft/<version>/schema, and
changes go through a public draft process. Everything downstream of the
schema — the CLI, the batch service, the viewer — is available under a
commercial license for studio-scale deployments, and the reference tool is
MIT.
Put plainly: you never have to trust that ScreenJSON Inc. will still be here in five years. The format will still open.
A suggested rollout
- Pilot — point Greenlight at 500 scripts from one production year. Measure ingestion throughput, validation failures, and search relevance.
- Hardening — encode your studio’s taggable taxonomy, your internal genre slugs, your contributor roles. Work with your legal and licensing teams on encryption policy.
- Index — stand up Elasticsearch (or your storage of choice) with a schema mapping that matches the ScreenJSON fields you care about most — heading, characters, tags, genre, themes, logline.
- Backfill — run the rest of the catalogue through Greenlight in batches, over nights or weekends.
- Build on top — internal search, producer breakdown tools, dialogue discovery, rights compliance tooling. The hard part was step 4; the interesting part is everything after.