Sarcasm Corpus V2

If you use this data in your research, please refer to and cite: Shereen Oraby, Vrindavan Harrison, Lena Reed, Ernesto Hernandez, Ellen Riloff and Marilyn Walker. "Creating and Characterizing a Diverse Corpus of Sarcasm in Dialogue." In The 17th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL), Los Angeles, California, USA, 2016.

Overview: The Sarcasm Corpus V2 is a subset of the Internet Argument Corpus (IAC, also available for download), including response text from quote-response pairs annotated for sarcasm. It is an update to the Sarcasm Corpus V1, and contains data representing three categories of sarcasm: general sarcasm, hyperbole, and rhetorical questions.

The Data: This download is currently a random sample of the dataset - the full corpus will be released as we continue to add more data.

The data is presented in "quote-response" pairs, where the quote functions as a "dialogic parent" to the response. The quote could be a post earlier in thread, or a quote from a post earlier in the thread (thus, more than one response post may map to the same quote post). The sarcasm annotations relate only to the *response*, but we include the quote text for context.

The largest subset of the sample is the generic sarcasm subset, containing 1,630 quote-response pairs per class (sarcastic and not-sarcastic). The HYP and RQ samples are smaller, containing 291 and 425 quote-response pairs per class, respectively. Some of the posts in the HYP and RQ samples are instances also existing in the GEN corpus, exhibiting sarcastic/not-sarcastic instances of hyperbole or rhetorical questions. HYP response posts contain cue words signaling hyperbole, and RQ response posts contain question-answer pairs where the speaker continues with their turn (not allowing a direct answer to their question).

The sample is a single CSV file with the following fields:

    • Corpus: the corpus type - one of GEN (general sarcasm), HYP (hyperbole), and RQ (rhetorical questions).
    • Label: the class label of the response utterance - one of "sarc" (sarcastic) or "notsarc" (not-sarcastic)
    • ID: a unique ID for the quote-response pair - {corpus}_{label}_{ID}. Each quote-response is independent, i.e. pairs with the same ID numbers across different datasets are not related.
    • Quote Text: the text of the dialogic parent of the response post, for context
    • Response Text: the text of the response to the quote, annotated for sarcasm (i.e. the sarcasm label relates to this utterance)

Download: Fill out the following form to download the Sarcasm Corpus 2.0.

Contact: Please direct questions to Shereen Oraby: soraby [at] ucsc [dot] edu

Last updated: 09/12/2016

Download Sarcasm Corpus V2 Sample