Film Corpus 2.0

If you use this data in your research, please refer to citation information here (Film Corpus 1.0).

Overview: This corpus is an updated version of the Film Corpus 1.0. It contains complete texts for the scripts of 1068 films in txt files, scraped from on Nov, 2015 using scrapy. It also contains 960 film scripts where the dialog in the film has been separated from the scene descriptions.

The Data: Film scripts are classified by genre,  but one film can be in multiple genres. There are fewer than 1068 separated scripts because we use our own script to automatically separate the dialog and scene descriptions. The distribution of the corpus by genre is: 

Genre Number of films
Action 290
Adventure  166 
Animation  35 
Comedy  347 
Crime  201 
Drama  579 
Family  39 
Fantasy  113 
Horror  149 
Musical  22 
Mystery  107 
Romance  192 
Sci-Fi  155 
Thriller  373 
War  26 
Western  13 


Works that use this corpus:

  • The scene descriptions have been used to infer pairs of contingent/causal events in different genres, as described in: Hu, Zhichao, Elahe Rahimtoroghi, Larissa Munishkina, Reid Swanson, and Marilyn A. Walker. "Unsupervised Induction of Contingent Event Pairs from Film Scenes." In Conference on Empirical Methods in Natural Language Processing (EMNLP), Seattle, Washington, USA, 2013.
  • Dialogue has been used for developing statistical stylistic character models as described in: Walker, Marilyn A., Ricky Grant, Jennifer Sawyer, Grace I. Lin, Noah Wardrip-Fruin, and Michael Buell. "Perceived or Not Perceived: Film Character Models for Expressive NLG." In International Conference on Interactive Digital Storytelling (ICIDS), Vancouver, Canada, 2011. BEST PAPER AWARD.
  • Grace I. Lin and Marilyn A.Walker. "All the World's a Stage: Learning Character Models from Film." In Proceedings of the Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE), Stanford, California, USA, 2011. BEST STUDENT PAPER AWARD.

Download: Fill out the following form to download the Film Corpus 2.0.

Contact: Please direct questions to Zhichao Hu: zhu [at] ucsc [dot] edu

User Information