We have extracted from Wikipedia a large set of sentences, to which we have automatically assigned a frame label using a WSD approach.
Such sentences can be used either to extend the amount of sentences already annotated for each frame through a manual validation, or exploited as training data for frame identification.
The WSD algorithm applied is described in:
The extraction methodology was first presented in:
The data can be freely downloaded and used for research purposes. Since they were extracted from Wikipedia, the Creative Common license applies.
The datasets comprise:
For English: 2,535,260 sentences extracted from Wikipedia and provided with a frame label. In each sentence, the lexical unit is put between tabs. The frame label set was taken from FrameNet 1.4, while the lexical units may not be present in FrameNet, and they can be used as input to extend the current lexical unit sets.
For Italian: 610,397 sentences extracted from Wikipedia and provided with a frame label, like in the English dataset. In this case, all terms identified as lexical units could be acquired to populate the frames for Italian.
You can download the English file here (458 MB!).
The file has 5 columns:
Frame \t Lexical_unit(s) \t Wikipage_mapped_to_LU(s) \t Wikipage_where_sentence_occurs \t Sentence
Note that the candidate lexical unit in the sentence is put between tabs.
The second column can contain several lexical units, because their meaning can correspond to the same Wikipedia concept. For example, both marsh.n and marshland.n in the Biological_area frame have been assigned to the Marsh page.
You can download the Italian file here (111 MB!).
The file has 4 columns:
Frame \t Wikipage_mapped_to_the_frame \t Wikipage_where_sentence_occurs \t Sentence
Contact: Katerina Tymoshenko (k.tymoshenko[at]gmail[dot]com) and Sara Tonelli