Paraphrase Acquisition via Crowdsourcing and Machine Learning
Steven Burrows, Martin Potthast, and Benno Stein
To paraphrase means to rewrite content whilst preserving the original meaning. Paraphrasing is important in
fields such as text reuse in journalism, anonymising work, and improving the quality of customer-written reviews.
This paper contributes to paraphrase acquisition and focuses on two aspects that are not addressed by
current research: (1) acquisition via crowdsourcing, and (2) acquisition of passage-level samples. The challenge
of the first aspect is automatic quality assurance; without such a means the crowdsourcing paradigm is not effective,
and without crowdsourcing the creation of test corpora is unacceptably expensive for realistic order of
magnitudes. The second aspect addresses the deficit that most of the previous work in generating and evaluating
paraphrases has been conducted using sentence-level paraphrases or shorter; these short-sample analyses are
limited in terms of application to plagiarism detection, for example. We present the Webis Crowd Paraphrase
Corpus 2011 (Webis-CPC-11), which recently formed part of the PAN 2010 international plagiarism detection
competition. This corpus comprises passage-level paraphrases with 4 067 positive samples and 3 792 negative
samples that failed our criteria, using Amazon's Mechanical Turk for crowdsourcing. In this paper, we review
the lessons learned at PAN 2010, and explain in detail the method used to construct the corpus. The empirical
contributions include machine learning experiments to explore if passage-level paraphrases can be identified in a
two-class classification problem using paraphrase similarity features, and we find that a k-nearest-neighbor classifier
can correctly distinguish between paraphrased and non-paraphrased samples with 0.980 precision at 0.523
recall. This result implies that just under half of our samples must be discarded (remaining 0.477 fraction), but
our cost-analysis shows that the automation we introduce results in a 18% financial saving and over 100 hours
of time returned to the researchers when repeating a similar corpus design. On the other hand, when building
an unrelated corpus requiring say 25% training data for the automated component, we show that the financial
outcome is cost-neutral, whilst still returning over 70 hours of time to the researchers. The work presented here
is the first to join the paraphrasing and plagiarism communities.