Film Corpus 1.0

Updated Corpus: Film Corpus 2.0

If you use this data in your research, please refer to and cite:

Overview: The film corpus consists of 862 film scripts from The Internet Movie Script Database (IMSDb) website (, representing 7,400 characters, with a total of 664,000 lines of dialogue and 9,599,000 tokens. Our snapshot of IMSDb is from May 19, 2010. 

Download: Fill out the following form to download the Film Corpus 1.0.

User Information