CLARIN:EL Research Infrastructure provides language processing tools and web services. Through the CLARIN:EL Central Inventory users can have access and use all the available language processing web services such as word analysis tools, word recognition tools, sentence splitting tools and part of speech tagging tools, morphological and syntactic analysis tools, named entity recognition tools, term extraction tools, sentiment analysis tools etc.
Language processing tools and web services work in the following way: they receive as input a text (compatible with specifications described in detail), process it (depending on the competence of each of them) and output the processed result (annotated text).
Access to the metadata descriptions of all the available language processing web services and tools is open to all (i.e. registered and unregistered users) through the CLARIN:EL Central Inventory. In addition, CLARIN:EL registered users have further access to make use of the language processing web services, always in accordance to the relevant licensing terms.
CLARIN:EL Language Processing tools & web services
Tokenization
Tokenization is used for detecting words and phrases in texts. More specifically, these tools are used for splitting textual content (e.g. a document) into smaller units, such as sentences, words, punctuation marks, numbers or symbols. These units are called tokens.
Available tools & web services
Lemmatization
Groups together different infleced types of a word, called lemma. The output of lemmatization is a proper word. Fore example, a lemmatizer should map gone, going and went into go.
Available tools & web services
PoS Tagging
PoS Tagging is used for annotating every word of a text with the corresponding part of speech tag (e.g. noun, verb, adjective, adverb, etc.) based on its context and definition. The result is a POS tag assigned to each token of the text.
Available tools & web services
ILSP Feature-based multi-tiered POS Tagger
OpenNLP Part-of-Speech Tagger (English)
Named Entity Recognition
Named Entity Recognition is used in various information extraction applications for the automatic recognition and classification of Named Entities in texts into predifined classes such as: Person, Location, Organization, GPE (Geo-political entity). The result is a tag with the corresponding category for each named entity identified in the text(s).
Available tools & web services
Sentence Splitting
Sentence splitting is used for detecting sentences in texts. More specifically, these tools identify the boundaries of a sentence by making use of punctuation marks and further detecting whether they mark the end of a sentence or not.
Available tools & web services
ILSP Sentence splitter and Tokenizer for Greek
OpenNLP Sentence Detector (English)
Dependency Parsing
Dependency parsers create tree representations for each input sentence, where each word depends on a head word and is assigned a label depicting its relation to the head word (e.g. subject, object, etc.). Thus, in the sentence Astronomers discovered a new moon, a dependency parser recognizes that the words Astronomers and moon are the subject and the object of the word discovered.
Available tools & web services
Chunking
Chunking involves the identification and segmentation of a text into groups of words, which are related to each other at the syntactic level, such as nominal groups or verbal groups, without further specifying their internal structure or their syntactic role in the sentence.
Available tools & web services
Manual Text Annotation
Annotation is the practice of adding interpretative linguistic information, known also as tags and/or labels, to words, or sets of words of a text or a corpus. Annotation can be done both in raw data as well as in data that have already been processed. Annotation can be done automatically (see all previous tools and web services) or manually, by human annotators.