Textual Entailment Specialized Data Sets are the result of a feasibility study carried out jointly by FBK-Irst, CELCT and Bar-Ilan University, on the application of a methodology for the decomposition of complex Textual Entailment pairs into T-H monothematic pairs, i.e. pairs in which a certain linguistic phenomenon relevant to entailment is highlighted and isolated. The expected benefits of specialized data sets derive from the intuition that investigating the linguistic phenomena separately, i.e. decomposing the complexity of the TE problem, would yield an improvement in the development of specific strategies to cope with them.
The methodology for the creation of the monothematic pairs starts from an existing RTE pair and defines the following steps:
identify the linguistic phenomena present in the original RTE pair
apply an annotation procedure to isolate each phenomenon and create the related monothematic pair
group together all the monothematic T-H pairs relative to the same linguistic phenomenon, hence creating specialized data sets
The methodology has been applied to a sample of 90 T-H pairs randomly extracted from the RTE-5 data set (30 entailment, 30 contradiction and 30 unknown examples), and linguistic phenomena underlying the entailment/contradiction/unknown relations in the pairs (both with fine grained and macro categories) have been annotated by two annotators with skills in linguistics. 203 monothematic pairs have been created from the 90 annotated pairs (157 entailment, 33 contradiction, and 13 unknown examples). Such pilot data sets can be profitably used both to advance in the comprehension of the linguistic phenomena involved in the entailment judgments, and to make a first step toward the creation of large-scale specialized data sets.
The data sets are freely available for research purposes. Here it is possible to download:
Contact: Luisa Bentivogli