If you use this data in your research, please refer to citation information here (Film Corpus 1.0).
Overview: This corpus is an updated version of the Film Corpus 1.0. It contains complete texts for the scripts of 1068 films in txt files, scraped from imsdb.com on Nov, 2015 using scrapy. It also contains 960 film scripts where the dialog in the film has been separated from the scene descriptions.
The Data: Film scripts are classified by genre, but one film can be in multiple genres. There are fewer than 1068 separated scripts because we use our own script to automatically separate the dialog and scene descriptions. The distribution of the corpus by genre is:
Genre |
Number of films |
Action |
290 |
Adventure |
166 |
Animation |
35 |
Biography |
3 |
Comedy |
347 |
Crime |
201 |
Drama |
579 |
Family |
39 |
Fantasy |
113 |
Film-Noir |
4 |
History |
3 |
Horror |
149 |
Music |
5 |
Musical |
22 |
Mystery |
107 |
Romance |
192 |
Sci-Fi |
155 |
Short |
3 |
Sport |
2 |
Thriller |
373 |
War |
26 |
Western |
13 |
Works that use this corpus:
- The scene descriptions have been used to infer pairs of contingent/causal events in different genres, as described in: Hu, Zhichao, Elahe Rahimtoroghi, Larissa Munishkina, Reid Swanson, and Marilyn A. Walker. "Unsupervised Induction of Contingent Event Pairs from Film Scenes." In Conference on Empirical Methods in Natural Language Processing (EMNLP), Seattle, Washington, USA, 2013.
- Dialogue has been used for developing statistical stylistic character models as described in: Walker, Marilyn A., Ricky Grant, Jennifer Sawyer, Grace I. Lin, Noah Wardrip-Fruin, and Michael Buell. "Perceived or Not Perceived: Film Character Models for Expressive NLG." In International Conference on Interactive Digital Storytelling (ICIDS), Vancouver, Canada, 2011. BEST PAPER AWARD.
-
Download: Fill out the following form to download the Film Corpus 2.0.
GitHub: https://github.com/zhichaohu/film-corpus-2
Contact: Please direct questions to Zhichao Hu: zhu [at] ucsc [dot] edu
Website last updated June 21, 2024.