PolEval 2017 :: Tasks

Task 1: POS Tagging

Introduction

There is an ongoing discussion whether the problem of part of speech tagging is already solved, at least for English (see Manning 2011), by reaching the tagging error rates similar or lower than the human inter-annotator agreement, which is ca. 97%. In the case of languages with rich morphology, such as Polish, there is however no doubt that the accuracies of around 91% delivered by taggers leave much to be desired and more work is needed to proclaim this task as solved.

The aim of this proposed task is therefore to stimulate research in potentially new approaches to the problem of POS tagging of Polish, which will allow to close the gap between the tagging accuracy of systems available for English and languages with rich morphology.

Task definition

Subtask (A): Morphosyntactic disambiguation and guessing

Given a sequence of segments, each with a set of possible morphosyntactic interpretations, the goal of the task is to select the correct interpretation for each of the segments and provide an interpretation for segments for which only 'ign' interpretation has been given (segments unknown to the morphosyntactic dictionary).

Subtask (B): Lemmatisation

Given a sequence of segments, each with a set of possible morphosyntactic interpretations, the goal of the task is to select the correct lemma for each of the segments and provide a lemma for segments for which only 'ign' interpretation has been given (segments unknown to the morphosyntactic dictionary).

Complete system (C): POS tagging

Given a raw text in Polish, the goal of the task is to segment the text by separating individual flexemes and provide the correct lemma and POS tag for each of the segments.

Training data

June 12: training data is published.

train-raw.txt.gz - Raw text

train-analyzed.xml.gz - The text after morphosyntactic analysis (processed with Morfeusz analyzer).

train-gold.xml.gz - Gold standard - segmented and manually POS-tagged data in XCES format.

Test data

July 31: the test data is published.

test-analyzed.xml.gz - Subtask (A) and (B) test data

test-raw.txt.gz - Subtask (C) test data

Evaluation script

Please use the linked tagger-eval.py script to evaluate the performance of your method against the provided training data. We will use the same script to assess the accuracy of the submitted algorithms against the final test data.

The script requires corpus2 library to be installed on your system. Please read the linked wiki page for further installation instructions.

Evaluation procedure

Subtask (A): Morphosyntactic disambiguation and guessing

Given (train/test)-analyzed.xml.gz file you should provide a corpus in XCES format, which contains disambiguated POS tags for each of the segments (see train-gold.xml.gz for a reference). For the system evaluation you should provide we will calculate 3 key statistics: the accuracy of the system in selecting the correct tag for known words (segments, for which some interpretations have been provided in the *-analyzed.xml.gz file), the accuracy of the system in guessing the tags for unknown words (segments, for which only 'ign' interpretation has been given in the *-analyzed.xml.gz file) and the overall system accuracy.

Subtask (B): Lemmatisation

Given (train/test)-analyzed.xml.gz file you should provide a corpus in XCES format, which contains disambiguated lemmas for each of the segments (see train-gold.xml.gz for a reference). For the system evaluation you should provide we will calculate 3 key statistics: the accuracy of the system in selecting the correct lemma for known words (segments, for which some intepretations have been provided in the *-analyzed.xml.gz file), the accuracy of the system in guessing the lemmas for unknown words (segments, for which only 'ign' interpretation has been given in the *-analyzed.xml.gz file) and the overall system accuracy.

Complete system (C): POS tagging

Given raw text (train/test)-raw.txt.gz you should perform text segmentation and provide a corpus in XCES format, which contains disambiguated POS tags and lemmas for each of the segments (see train-gold.xml.gz for a reference).

For the system evaluation you should provide we will calculate the following statistics: the accuracy of the system in selecting the correct lemma and tag for known words (segments, for which some interpretations have been provided in the *-analyzed.xml.gz file), the accuracy of the system in guessing the lemmas and tags for unknown words (segments, for which only 'ign' interpretation has been given in the *-analyzed.xml.gz file) and the overall system accuracy in selecting lemmas and tags (6 values).

In the case of a segmentation error (a particular word has been segmented differently in the gold standard and by your system), you should we will count that word as tagging mistake, both in the case of a POS tag and the lemma.

References

Manning, Christopher. "Part-of-speech tagging from 97% to 100%: is it time for some linguistics?." Computational Linguistics and Intelligent Text Processing (2011): 171-189.

Radziszewski, Adam, and Szymon Acedański. "Taggers gonna tag: an argument against evaluating disambiguation capacities of morphosyntactic taggers." International Conference on Text, Speech and Dialogue. Springer Berlin Heidelberg, 2012.

Kobyliński, Łukasz. "PoliTa: A multitagger for Polish". LREC 2014, pp. 2949–2954, Reykjavík, Iceland, 2014. ELRA.

Waszczuk, Jakub. "Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language." COLING. 2012.

Radziszewski, Adam. "A tiered CRF tagger for Polish." Intelligent tools for building a scientific information platform. Springer Berlin Heidelberg, 2013. 215-230.

Acedański, Szymon. "A morphosyntactic Brill tagger for inflectional languages." International Conference on Natural Language Processing. Springer Berlin Heidelberg, 2010.

Task 2: Sentiment analysis

Introduction

Sentiment analysis is a vital research area, approached at different levels: phrase-level (either in the context of opinion targets/aspects or phrases defined as syntactic sub-trees), sentence-level (related to the task of tweet-level analysis).

The aim of this task is to promote research on this topic in the context of the Polish language, provide reference data sets to work and motivation for potentially new methods.

Task definition

Given a set of syntactic dependency trees, the goal of the task is to provide the correct sentiment for each sub-tree (phrase). Phrases correspond to sub-trees of dependency parse tree. The annotations assign sentiment values to whole phrases (and in some cases, sentences), regardless of their type.

Methods applied to this task often include deep learning. Typically, applications compute sentiment recursively, starting from leaves and smaller phrases, then expanding to larger phrases and taking into account sentiment values already computed for their nested sub-phrases. This could be equivalent to recursively folding the tree in a bottom-up fashion. Sentence-level sentiment is then the value of your predictive model after folding the whole sentence.

Datasets such as this one include Stanford Sentiment Treebank.

Training data

June 12: training data is published.

Data set is a selection of:

sentiment-bearing sentences from the Skladnica treebank, dependency version.
product reviews of two types (perfumes and clothing) with automatic dependency parse information.

sentiment-treebank.tar.gz - for each phrase: its sentiment, dependency labels, POS, tokens

Sentiment annotations for each token corresponds to the overall sentiment of the whole phrase under it and inclusive. Specifically:

for every leaf token or word, its sentiment corresponds to this word or token's sentiment
for every non-leaf token or word (node that has non-empty set of children) sentiment field describes the sentiment of the whole phrase, formed by sub-tree starting at this token (that includes this token and all tokens below it)

Test data

Aug 1: the test data is published.

poleval_test.tar.gz - Task 2 test data set

Evaluation script

Please use the linked sent-eval.py script to evaluate the performance of your method against the provided test data. The script expects two files passed as arguments: one wirth your results, the other with gold (reference) annotation. It computes micro accuracy between your system and gold annotations. Micro accuracy is a global sum of correctly labelled sentiment scores, over cases belonging for each class. The scripts expects data formatted exactly the same as in training data.

We count sentiment scores for each leaf and each phrase. We assign the same weights (1) to all sentiment scores, regardless whether they represent sentiments of leafs, phrases or sentences. We do not distinguish between types of errors.

References

https://nlp.stanford.edu/sentiment/treebank.html

POLEVAL 2017

Task 1: POS Tagging

Introduction

Task definition

Subtask (A): Morphosyntactic disambiguation and guessing

Subtask (B): Lemmatisation

Complete system (C): POS tagging

Training data

Test data

Evaluation script

Evaluation procedure

Subtask (A): Morphosyntactic disambiguation and guessing

Subtask (B): Lemmatisation

Complete system (C): POS tagging

References

Task 2: Sentiment analysis

Introduction

Training data

Test data

Evaluation script

References