Film Corpus 1.0

Description: The film corpus consists of 862 film scripts from The Internet Movie Script Database (IMSDb) website (http://www.imsdb.com/), representing 7,400 characters, with a total of 664,000 lines of dialogue and 9,599,000 tokens. Our snapshot of IMSDb is from May 19, 2010.  
 
Download: (May 21, 2012) Film corpus [110MB] (Note that this is only an *initial* version.  A more up-to-date and better verified version is currently in development.)

Paper: Marilyn A. Walker, Grace I. Lin, and Jennifer E. Sawyer. "An Annotated Corpus of Film Dialogue for Learning and Characterizing Character Style", LREC 2012. Poster
 
Contact: Grace I. Lin (glin@soe.ucsc.edu)