Textual Entailment Graph Dataset
This dataset was created within the EU-funded project EXCITEMENT (EXploring Customer Interactions through Textual EntailMENT) as gold standard data to evaluate the task of automatic Textual Entailment Graph (TEG) generation.
A TEG is a directed graph where each node is a complete natural language text (textual fragment) fi and each edge (fi, fj) represents an entailment relation from fi to fj. A textual entailment (fi, fj) holds if the meaning of fi implies the meaning of fj, according to the standard definition of textual entailment which states that fi entails fj if, typically, a human reading fi would infer that fj is most likely true.
Given a set of textual fragments (graph nodes), the task of constructing a TEG is to recognize all the entailments among the fragments, i.e. deciding which directional edges connect which pairs of nodes. The main difference between this task and the traditional Recognizing Textual Entailment (RTE) task is that the text pairs are not independent. The nodes in the graph are inter-connected via entailment edges, which should not represent contradicting decisions. For example, if the edges (u,v) and (v,w) are in the graph, then the edge (u,w) is implied by transitivity.
Our motivating scenario was text exploration - in particular the analysis of customer dissatisfaction - and the dataset was constructed for a text collection of feedbacks sent by customers of a given company. Textual fragments were manually extracted in such a way that each fragment contains a single proposition where a customer states a reason for dissatisfaction with the company, like “No vegetarian snacks in the dining car” or “There’s not enough food selection on train”.
To reduce the annotation complexity, as well as to allow evaluation of TEG generation for particular subtopics (clusters) of the target collection, fragments were manually clustered into subtopics such as “legroom”, “internet connection”, “food choice”. Then, given the textual fragments within each cluster as input, a gold standard TEG was built for each of the clusters as a pipeline of two separate sub-tasks, namely (i) further decomposing each input fragment and constructing its individual textual entailment graph, and (ii) merging the fragment graphs into a single integrated TEG.
The Textual Entailment Graph Dataset is available for English and Italian and is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
English Dataset
The English dataset contains a text collection generated on the basis of 389 emails sent by customers of a railway company. Textual fragments expressing reasons for dissatisfactions are clustered into 29 different topics corresponding to 29 TEGs, for a total of 756 nodes and 7830 edges.
The English dataset is the result of a join effort of:
- NICE SYSTEMS LTD, Israel
- FONDAZIONE BRUNO KESSLER (FBK), Italy
- BAR ILAN UNIVERSITY (BIU), Israel
- DEUTSCHES FORSCHUNGSZENTRUM FUER KUENSTLICHE INTELLIGENZ GMBH (DFKI), Germany
To obtain the data, please fill the form: Form
Italian Dataset
The Italian dataset contains a text collection generated on the basis of 292 posts taken from the official webpage of a mobile service provider in a social network. Textual fragments expressing reasons for dissatisfactions are clustered into 18 different topics corresponding to 18 TEGs, for a total of 760 nodes and 2318 edges.
The Italian dataset is the result of a join effort of:
- ALMAWAVE SRL, Italy
- FONDAZIONE BRUNO KESSLER (FBK), Italy
- BAR ILAN UNIVERSITY (BIU), Israel
To obtain the data, please fill the form: Form
Contact: Bernardo Magnini