Trankit’s Documentation

News: Trankit v1.0.0 is out. The new version has pretrained pipelines using XLM-Roberta Large which is much better than our previous version with XLM-Roberta Base. Checkout the performance comparison here. Trankit v1.0.0 also provides a brand new command-line interface that helps users who are not familiar with Python can use Trankit more easily. Finally, the new version has a brand new Auto Mode in which a language detector is used to activate the language-specific models for processing the input, avoiding switching back and forth between languages in a multilingual pipeline.

Trankit is a light-weight Transformer-based Python Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 pretrained pipelines for 56 languages. Built on a state-of-the-art pretrained language model, Trankit significantly outperforms prior multilingual NLP pipelines over sentence segmentation, part-of-speech tagging, morphological feature tagging, and dependency parsing while maintaining competitive performance for tokenization, multi-word token expansion, and lemmatization over 90 Universal Dependencies v2.5 treebanks. Our pipeline also obtains competitive or better named entity recognition (NER) performance compared to existing popular toolkits on 11 public NER datasets over 8 languages.

Trankit’s Github Repo is available at: https://github.com/nlp-uoregon/trankit

Trankit’s Demo Website is hosted at: http://nlp.uoregon.edu/trankit

Citation

If you use Trankit in your research or software. Please cite our following paper:

@inproceedings{nguyen2021trankit,
   title={Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing},
   author={Nguyen, Minh Van and Lai, Viet and Veyseh, Amir Pouran Ben and Nguyen, Thien Huu},
   booktitle="Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations",
   year={2021}
}

News: Trankit v1.0.0 is out

What’s new in Trankit v1.0.0? Let’s check out below.

Installation

pip install trankit==1.0.1

Trankit large

Starting from version 1.0.0, Trankit supports using XLM-Roberta large as the multilingual embedding (i.e., Trankit large), which further boosts the performance over 90 Universal Dependencies treebanks. The usage of the large version is the same as before except that users need to specify the embedding for pipeline initialization. Below is an example for initializing a monolingual and multilingual pipeline.

from trankit import Pipeline 

# Initialize an English pipeline with XLM-Roberta large
p = Pipeline('english', embedding='xlm-roberta-large')

# Initialize a multilingual pipeline for ['english', 'chinese', 'arabic'] with XLM-Roberta large
p = Pipeline('english', embedding='xlm-roberta-large')
p.add('chinese')
p.add('arabic')

Currently, the argument embedding can only be set to one of the two values ['xlm-roberta-large', 'xlm-roberta-base']. If the argument embedding is not specifically set, Trankit will use 'xlm-roberta-base' for its multilingual embedding by default.

Auto mode for multilingual pipelines

Starting from version v1.0.0, Trankit supports a handy Auto Mode in which users do not have to set a particular language active before processing the input. In the Auto Mode, Trankit will automatically detect the language of the input and use the corresponding language-specific models, thus avoiding switching back and forth between languages in a multilingual pipeline. Specifically, there are two methods to turn on the Auto Mode.

The first method is to initialize a multilingual pipeline with a special code 'auto'. After the initialization, the pipeline would be automatically set in the Auto Mode.

from trankit import Pipeline

p = Pipeline('auto')

# Tokenizing an English input
en_output = p.tokenize('''I figured I would put it out there anyways.''') 

# POS, Morphological tagging and Dependency parsing a French input
fr_output = p.posdep('''On pourra toujours parler à propos d'Averroès de "décentrement du Sujet".''')

# NER tagging a Vietnamese input
vi_output = p.ner('''Cuộc tiêm thử nghiệm tiến hành tại Học viện Quân y, Hà Nội''')

Note that, the multilingual pipeline in this case is initialized with all supported languages and all these languages would be considered for the language detection process. In some cases where we want to constrain the detected language to belong to a specific set, the second method is used:

from trankit import Pipeline

p = Pipeline('english')
p.add('french')
p.add('vietnamese')
p.set_auto(True)

# Tokenizing an English input
en_output = p.tokenize('''I figured I would put it out there anyways.''') 

# POS, Morphological tagging and Dependency parsing a French input
fr_output = p.posdep('''On pourra toujours parler à propos d'Averroès de "décentrement du Sujet".''')

# NER tagging a Vietnamese input
vi_output = p.ner('''Cuộc tiêm thử nghiệm tiến hành tại Học viện Quân y, Hà Nội''')

In this way, we are guaranteed that the detected language can only be one of the added languages ["english", "french", "vietnamese"]. Suppose that at some point later, we want to turn off the Auto Mode, this can be done easily with a single line of code as follows:

p.set_auto(False)

After this, our multilingual pipeline can be used in the manual mode where we can manually set a particular language active. As a final note, we use langid to perform language detection. The detected language for each input can be inspected by accessing the field "lang" of the output. Thank you loretoparisi for your suggestion on this.

Command-line interface

Starting from version v1.0.0, Trankit supports processing text via command-line interface. This helps users who are not familiar with Python programming language can use Trankit more easily. Please check out this page for tutorials and examples.

Installation

Installing Trankit is easily done via one of the following methods:

Using pip

pip install trankit

The command would install Trankit and all dependent packages automatically.

From source

git clone https://github.com/nlp-uoregon/trankit
cd trankit
pip install -e .

This would first clone our github repo and install Trankit.

Fixing the compatibility issue of Trankit with Transformers

Previous versions of Trankit have encountered the compatibility issue when using recent versions of transformers. To fix this issue, please install the new version of Trankit as follows:

pip install trankit==1.0.1

If you encounter any other problem with the installation, please raise an issue here to let us know. Thanks.

Quick examples

Initialize a pipeline

Monolingual usage

Before using any function of Trankit, we need to initialize a pipeline. Here is how we can do it for English:

from trankit import Pipeline

p = Pipeline('english')

In this example, Trankit receives the string 'english' specifying which language package it should use to initialize a pipeline. To know which language packages are supported we can check this table or directly print out the attribute trankit.supported_langs:

import trankit

print(trankit.supported_langs)
# Output: ['afrikaans', 'ancient-greek-perseus', 'ancient-greek', 'arabic', 'armenian', 'basque', 'belarusian', 'bulgarian', 'catalan', 'chinese', 'traditional-chinese', 'classical-chinese', 'croatian', 'czech-cac', 'czech-cltt', 'czech-fictree', 'czech', 'danish', 'dutch', 'dutch-lassysmall', 'english', 'english-gum', 'english-lines', 'english-partut', 'estonian', 'estonian-ewt', 'finnish-ftb', 'finnish', 'french', 'french-partut', 'french-sequoia', 'french-spoken', 'galician', 'galician-treegal', 'german', 'german-hdt', 'greek', 'hebrew', 'hindi', 'hungarian', 'indonesian', 'irish', 'italian', 'italian-partut', 'italian-postwita', 'italian-twittiro', 'italian-vit', 'japanese', 'kazakh', 'korean', 'korean-kaist', 'kurmanji', 'latin', 'latin-perseus', 'latin-proiel', 'latvian', 'lithuanian', 'lithuanian-hse', 'marathi', 'norwegian-nynorsk', 'norwegian-nynorsklia', 'norwegian-bokmaal', 'old-french', 'old-russian', 'persian', 'polish-lfg', 'polish', 'portuguese', 'portuguese-gsd', 'romanian-nonstandard', 'romanian', 'russian-gsd', 'russian', 'russian-taiga', 'scottish-gaelic', 'serbian', 'slovak', 'slovenian', 'slovenian-sst', 'spanish', 'spanish-gsd', 'swedish-lines', 'swedish', 'tamil', 'telugu', 'turkish', 'ukrainian', 'urdu', 'uyghur', 'vietnamese']

By default, trankit would try to use GPU if a GPU device is available. However, we can force it to run on CPU by setting the tag gpu=False:

from trankit import Pipeline

p = Pipeline('english', gpu=False)

Another tag that we can use is cache_dir. By default, Trankit would check if the pretrained model files exist. If they don’t, it would download all pretrained files including the shared XLMR-related files and the separate language-related files, then store them to ./cache/trankit. However, we can change this by setting the tag cache_dir:

from trankit import Pipeline

p = Pipeline('english', cache_dir='./path-to-your-desired-location/')

Multilingual usage

Processing multilingual inputs is easy and effective with Trankit. For example, to initilize a pipeline that can process inputs of the 3 languages English, Chinese, and Arabic, we can do as follows:

from trankit import Pipeline

p = Pipeline('english')
p.add('chinese')
p.add('arabic')

Each time the add function is called for a particular language (e.g., 'chinese' and 'arabic' in this case), Trankit would only download the language-related files. Therefore, the downloading would be very fast. Here is what will show up when the above snippet is executed:

from trankit import Pipeline

p = Pipeline('english')
# Output:
# Downloading: 100%|██| 5.07M/5.07M [00:00<00:00, 9.28MB/s]
# http://nlp.uoregon.edu/download/trankit/english.zip
# Downloading: 100%|█| 47.9M/47.9M [00:00<00:00, 89.2MiB/s]
# Loading pretrained XLM-Roberta, this may take a while...
# Downloading: 100%|███████| 512/512 [00:00<00:00, 330kB/s]
# Downloading: 100%|██| 1.12G/1.12G [00:14<00:00, 74.8MB/s]
# Loading tokenizer for english
# Loading tagger for english
# Loading lemmatizer for english
# Loading NER tagger for english
# ==================================================
# Active language: english
# ==================================================

p.add('chinese')
# http://nlp.uoregon.edu/download/trankit/chinese.zip
# Downloading: 100%|█| 40.4M/40.4M [00:00<00:00, 81.3MiB/s]
# Loading tokenizer for chinese
# Loading tagger for chinese
# Loading lemmatizer for chinese
# Loading NER tagger for chinese
# ==================================================
# Added languages: ['english', 'chinese']
# Active language: english
# ==================================================

p.add('arabic')
# http://nlp.uoregon.edu/download/trankit/arabic.zip
# Downloading: 100%|█| 38.6M/38.6M [00:00<00:00, 76.8MiB/s]
# Loading tokenizer for arabic
# Loading tagger for arabic
# Loading multi-word expander for arabic
# Loading lemmatizer for arabic
# Loading NER tagger for arabic
# ==================================================
# Added languages: ['english', 'chinese', 'arabic']
# Active language: english
# ==================================================

As we can see, each time a new language is added, the list of the added languages inreases. However, the active langage remains the same, i.e., 'english'. This indicates that the pipeline can work with inputs of the 3 specified languages, however, it is assuming that the inputs that it will receive are in 'english'. To change this assumption, we need to “tell” the pipeline that we’re going to process inputs of a particular language, for example:

p.set_active('chinese')
# ==================================================
# Active language: chinese
# ==================================================

From now, the pipeline is ready to process 'chinese' inputs. To make sure that the language is activated successfully, we can access the attribute active_lang of the pipeline:

print(p.active_lang)
# 'chinese'

Document-level processing

The following lines of code show the basic use of Trankit with English inputs.

from trankit import Pipeline
# initialize a pipeline for English
p = Pipeline('english')

# a non-empty string to process, which can be a document or a paragraph with multiple sentences
doc_text = '''Hello! This is Trankit.'''

# perform all tasks on the input
all = p(doc_text)

Here, doc_text is assumed to be a document. Then, the sentence segmentation and tokenization are then jointly done. For each sentence, Trankit performs part-of-speech tagging, morphological feature tagging, dependency parsing, and also named entity recognition (NER) if the pretrained NER model for that language is available. The result of the entire process is stored in the variable all, which is a hierarchical native Python dictionary that we can retrieve different types of information at both the document and sentence level. The output would look like this (we use […] to improve the visualization):

{
  'text': 'Hello! This is Trankit.',  # input string
  'sentences': [ # list of sentences
    {
      'id': 1, 'text': 'Hello!', 'dspan': (0, 6), 'tokens': [...]
    },
    {
      'id': 2,  # sentence index
      'text': 'This is Trankit.',  'dspan': (7, 23), # sentence span
      'tokens': [ # list of tokens
        {
          'id': 1, # token index
          'text': 'This', 'upos': 'PRON', 'xpos': 'DT',
          'feats': 'Number=Sing|PronType=Dem',
          'head': 3, 'deprel': 'nsubj', 'lemma': 'this', 'ner': 'O',
          'dspan': (7, 11), # document-level span of the token
          'span': (0, 4) # sentence-level span of the token
        },
        {'id': 2...},
        {'id': 3...},
        {'id': 4...}
      ]
    }
  ]
}

Below we show some examples for accessing different information of the output.

Text

At document level, we have two fields to access, which are 'text' storing the input string and 'sentences' storing the tagged sentences of the input. Suppose that we want to get the text form of the first sentence, we would first access the field 'sentences' of the output to get the list of sentences, use an index to locate the sentence in the list, then finally access the 'text' field:

sent_text = all['sentences'][1]['text'] 
print(sent_text)
# Output: This is Trankit.

Span

we can also get the text form of the sentence manually with the 'dspan' field that provides the span-based location of the sentence in the document:

dspan = all['sentences'][1]['dspan']
print(dspan) 
# Output: (7, 23)
sent_text = doc_text[dspan[0]: dspan[1]]
print(sent_text) 
# Output: This is Trankit.

Note that, we use 'dspan' with the prefix d to indicate that this information is at document level.

Token list

Each sentence is associated with a list of tokens, which can be accessed via the 'tokens' field. Each token is in turn a dictionary with different types of information. For example, we can get the information of the first token of the second sentence as follows:

token = all['sentences'][1]['tokens'][0]
print(token)

The information of the token is stored in a Python dictionary:

{
  'id': 1,                   # token index
  'text': 'This',            # text form of the token
  'upos': 'PRON',            # UPOS tag of the token
  'xpos': 'DT',              # XPOS tag of the token
  'feats': 'Number=Sing|PronType=Dem',    # morphological feature of the token
  'head': 3,                 # index of the head token
  'deprel': 'nsubj',         # dependency relation between from the current token and the head token
  'dspan': (7, 11),          # document-level span of the token
  'span': (0, 4),            # sentence-level span of the token
  'lemma': 'this',           # lemma of the token
  'ner': 'O'             # Named Entity Recognitation (NER) tag of the token
}

Here, we provide two different types of span for each token: 'dspan' and 'span'. 'dspan' is used for the global location of the token in the document while 'span' provides the local location of the token in the sentence. We can use either one of these two fields to manually retrieve the text form of the token like this:

# retrieve the text form via 'dspan'
dspan = token['dspan']
print(doc_text[dspan[0]: dspan[1]])
# Output: This

# retrieve the text form via 'span'
span = token['span']
print(sent_text[span[0]: span[1]])
# Output: This

Sentence-level processing

In many cases, we may want to use Trankit to process a sentence instead of a document. This can be achieved by setting the tag is_sent=True:

sent_text = '''Hello! This is Trankit.'''
tokens = p(sent_text, is_sent=True)

The output is now a dictionary with a list of all tokens, instead of a list of sentences as before.

{
  'text': 'Hello! This is Trankit.',
  'tokens': [
    {
      'id': 1,
      'text': 'Hello',
      'upos': 'INTJ',
      'xpos': 'UH',
      'head': 5,
      'deprel': 'discourse',
      'lemma': 'hello',
      'ner': 'O',
      'span': (0, 5)
    },
    {'id': 2...},
    {'id': 3...},
    {'id': 4...},
    {'id': 5...},
    {'id': 6...},
  ]
}

For more examples on other functions, please refer to the following sections: Sentence Segmentation, Tokenization, Part-of-speech, Morphological tagging and Dependency parsing, Lemmatization, Named entity recognition, and Building a customized pipeline.

How Trankit works

Note: The best way to understand how Trankit works is to look at our technical paper, which is available at: https://arxiv.org/pdf/2101.03289.pdf

In this section, we briefly present the most important details of the technologies used by Trankit.

Natural Language Processing (NLP) pipelines of current state-of-the-art multilingual NLP Toolkits such as UDPipe (Straka, 2018) and Stanza (Qi et al., 2020) are trained separately and do not share any component, especially the embedding layers that account for most of the model size. This makes their memory usage grow aggressively as pipelines for more languages are simultaneously needed and loaded into the memory. Most importantly, these toolkits have not explored contextualized embeddings from pretrained transformer-based language models that have the potentials to significantly improve the performance of the NLP tasks, as demonstrated in many prior works (Devlin et al., 2018, Liu et al., 2019b, Conneau et al., 2020). This motivates us to develop Trankit that can overcome such limitations.

Trankit Architecture

Overall architecture of Trankit. Among the five models, three (i.e., joint token and sentence splitter; joint model for part-of-speech tagging, morphological tagging, and dependency parsing; and named entity recognizer) in Trankit are transformer-based. They all share a single multilingual pretrained transformer.

First, we utilize the state-of-the-art multilingual pretrained transformer XLM-Roberta (Conneau et al., 2020) to build three components: the joint token and sentence splitter; the joint model for part-of-speech, morphological tagging, and dependency parsing; and the named entity recognizer (See the figure above). As a result, our system advances state-of-theart performance for sentence segmentation, part-of-speech (POS) tagging, morphological feature tagging, and dependency parsing while achieving competitive or better performance for tokenization, multi-word token expansion, and lemmatizatio over the 90 treebanks.

Second, we simultaneously solve the problem of loading pipelines for many languages into the memory and the problem of the transformer size with our novel plug-and-play mechanism with Adapters (Pfeiffer et al., 2020a, Pfeiffer et al., 2020b). In particular, a set of adapters (for transfomer layers) and task-specific weights (for final predictions) are created for each transformer-based component for each language while only one single large multilingual pretrained transformer is shared across components and languages. During training, the shared pretrained transformer is fixed while only the adapters and task-specific weights are updated. At inference time, depending on the language of the input text and the current active component, the corresponding trained adapter and task-specific weights are activated and plugged into the pipeline to process the input. This mechanism not only solves the memory problem but also substantially reduces the training time.

Model performance

News: Trankit v1.0.0 is out. The new version has pretrained pipelines using XLM-Roberta Large which is much better than our previous version with XLM-Roberta Base. Checkout the performance comparison below on this page. Trankit v1.0.0 also provides a brand new command-line interface that helps users who are not familiar with Python can use Trankit more easily. Finally, the new version has a brand new Auto Mode in which a language detector is used to activate the language-specific models for processing the input, avoiding switching back and forth between languages in a multilingual pipeline.

Universal Dependencies v2.5

The following table shows the performance comparison between Trankit large (using XLM-Roberta large) and base (using XLM-Roberta base), spaCy v2.3, UDPipe v1.2, and Stanza v1.1.1 on the test treebanks of the 5 major languages on the Universal Dependencies v2.5 corpora. Performance for each model are F1 scores obtained by using the official evaluation script of the CoNLL 2018 Shared Task. The meanings of the columns in the following tables are provided by the evaluation script as follows:

  • Tokens: how well do the gold tokens match system tokens.

  • Sents.: how well do the gold sentences match system sentences.

  • Words: how well can the gold words be aligned to system words.

  • UPOS: using aligned words, how well does UPOS match.

  • XPOS: using aligned words, how well does XPOS match.

  • UFeats: using aligned words, how well does universal FEATS match.

  • AllTags: using aligned words, how well does UPOS+XPOS+FEATS match.

  • Lemmas: using aligned words, how well does LEMMA match.

  • UAS: using aligned words, how well does HEAD match.

  • LAS: using aligned words, how well does HEAD+DEPREL(ignoring subtypes) match.

  • CLAS: using aligned words with content DEPREL, how well does HEAD+DEPREL(ignoring subtypes) match.

  • MLAS: using aligned words with content DEPREL, how well does HEAD+DEPREL(ignoring subtypes)+UPOS+UFEATS+FunctionalChildren(DEPREL+UPOS+UFEATS) match.

  • BLEX: using aligned words with content DEPREL, how well does HEAD+DEPREL(ignoring subtypes)+LEMMAS match.

Treebank

System

Tokens

Sents.

Words

UPOS

XPOS

UFeats

Lemmas

UAS

LAS

Arabic-PADT

Trankit[large]

99.95

96.79

99.39

96.8

94.93

94.99

94.92

89.74

85.98

Trankit[base]

99.93

96.59

99.22

96.31

94.08

94.28

94.65

88.39

84.68

Stanza

99.98

80.43

97.88

94.89

91.75

91.86

93.27

83.27

79.33

UDPipe

99.98

82.09

94.58

90.36

84.00

84.16

88.46

72.67

68.14

Chinese-GSD

Trankit[large]

97.75

99.4

97.75

95.13

94.94

97.28

97.75

87.38

84.82

Trankit[base]

97.01

99.7

97.01

94.21

94.02

96.59

97.01

85.19

82.54

Stanza

92.83

98.80

92.83

89.12

88.93

92.11

92.83

72.88

69.82

UDPipe

90.27

99.10

90.27

84.13

84.04

89.05

90.26

61.60

57.81

English-EWT

Trankit[large]

98.67

90.49

98.67

96.65

96.47

96.75

97.25

91.29

89.4

Trankit[base]

98.48

88.35

98.48

95.95

95.71

96.26

96.84

90.14

87.96

Stanza

99.01

81.13

99.01

95.40

95.12

96.11

97.21

86.22

83.59

UDPipe

98.90

77.40

98.90

93.26

92.75

94.23

95.45

80.22

77.03

spaCy

97.30

61.19

97.30

86.72

90.83

87.05

French-GSD

Trankit[large]

99.77

98.8

99.72

97.89

97.34

97.9

94.97

93.37

Trankit[base]

99.7

96.63

99.66

97.85

97.16

97.80

94.00

92.34

Stanza

99.68

94.92

99.48

97.30

96.72

97.64

91.38

89.05

UDPipe

99.68

93.59

98.81

95.85

95.55

96.61

87.14

84.26

spaCy

98.34

77.30

94.15

86.82

87.29

67.46

60.60

Spanish-Ancora

Trankit[large]

99.91

99.39

99.91

99.06

99

98.85

99.15

94.77

93.29

Trankit[base]

99.94

99.13

99.93

99.02

98.94

98.80

99.17

94.11

92.41

Stanza

99.98

99.07

99.98

98.78

98.67

98.59

99.19

92.21

90.01

UDPipe

99.97

98.32

99.95

98.32

98.13

98.13

98.48

88.22

85.10

spaCy

99.47

97.59

98.95

94.04

79.63

86.63

84.13

As can be seen from the table, both the large and base version of Trankit outperform other toolkits over different tasks (e.g., POS and morphological tagging) in which the improvement boost is substantial and significant for sentence segmentation and dependency parsing. For example, English enjoys a 9.36% improvement for sentence segmentation, a 5.07% and 5.81% improvement for UAS and LAS in dependency parsing. For Arabic, Transit has a remarkable improvement of 16.36% for sentence segmentation while Chinese observes 14.50% and 15.0% improvement of UAS and LAS for dependency parsing.

Next, we show the detailed performance comparison of Trankit large (using XLM-Roberta large), Trankit base (using XLM-Roberta base), and Stanza v1.1.1 on 90 Universal Dependencies v2.5 treebanks. Over all 90 treebanks, both versions of Trankit outperform the previous state-of-the-art framework Stanza in most of the tasks, particularly for sentence segmentation (+4.28%), POS tagging (+2.00% for UPOS and +2.14% for XPOS), morphological tagging (+2.18%), and dependency parsing (+5.61% for UAS and +6.81% for LAS) while maintaining the competitive performance on tokenization, multi-word token expansion, and lemmatization.

Tokens

Sentences

Words

UPOS

XPOS

UFeats

AllTags

Lemmas

UAS

LAS

CLAS

MLAS

BLEX

Overall performance

Trankit[large]

99.32

92.86

99.15

96.22

94.64

93.93

91.19

94.47

88.67

85.49

82.97

76.17

78.45

Trankit[base]

99.23

91.82

99.02

95.65

94.05

93.21

90.29

94.27

87.06

83.69

80.88

73.57

76.53

Stanza

99.26

88.58

98.90

94.21

92.50

91.75

88.62

94.15

83.06

78.68

74.65

67.83

71.28

Trankit[large] - Trankit[base]

0.09

1.04

0.13

0.56

0.59

0.72

0.91

0.20

1.61

1.80

2.10

2.60

1.92

Trankit[large] - Stanza

0.07

4.28

0.26

2.00

2.14

2.18

2.57

0.32

5.61

6.81

8.32

8.34

7.17

Afrikaans-AfriBooms

Trankit[large]

99.84

100.00

99.84

98.87

96.63

98.68

96.54

97.47

91.61

89.35

85.30

82.64

81.65

Trankit[base]

99.72

100.00

99.72

98.55

96.10

98.26

95.89

97.39

91.03

88.79

84.46

81.31

80.91

Stanza

99.75

99.65

99.75

97.56

94.27

97.03

94.24

97.48

87.51

84.45

78.58

74.70

75.39

Ancient_Greek-Perseus

Trankit[large]

99.71

98.70

99.71

93.97

87.25

91.66

86.88

88.52

83.48

78.56

73.79

60.72

61.97

Trankit[base]

99.66

98.70

99.66

92.72

84.89

89.95

84.30

88.18

80.95

75.57

70.26

55.75

58.84

Stanza

99.98

98.85

99.98

92.54

85.22

91.06

84.98

88.26

78.75

73.35

67.88

54.22

57.54

Ancient_Greek-PROIEL

Trankit[large]

99.91

67.60

99.91

97.86

97.93

93.03

92.02

97.50

85.63

82.31

78.16

68.27

75.76

Trankit[base]

99.83

58.62

99.83

97.30

97.37

91.59

90.33

97.31

83.21

79.68

74.96

64.13

72.80

Stanza

100.00

51.65

100.00

97.38

97.75

92.09

90.96

97.42

80.34

76.33

71.37

61.23

69.23

Arabic-PADT

Trankit[large]

99.95

96.79

99.39

96.80

94.93

94.99

94.61

94.92

89.74

85.98

83.92

78.47

79.96

Trankit[base]

99.93

96.59

99.22

96.31

94.08

94.28

93.69

94.65

88.39

84.68

82.35

76.46

78.46

Stanza

99.98

80.43

97.88

94.89

91.75

91.86

91.51

93.27

83.27

79.33

76.24

70.58

72.79

Armenian-ArmTDP

Trankit[large]

98.52

99.46

97.26

95.03

97.26

90.42

89.56

93.45

86.47

83.01

80.91

72.45

77.07

Trankit[base]

98.41

98.92

97.23

94.36

97.23

88.89

87.84

93.28

84.22

80.14

77.47

67.92

73.94

Stanza

98.96

95.22

96.58

92.49

96.58

88.19

86.94

92.27

78.18

72.46

68.50

60.39

65.88

Basque-BDT

Trankit[large]

99.89

100.00

99.89

96.95

99.89

93.96

92.30

96.69

90.22

87.47

86.54

78.10

83.21

Trankit[base]

99.81

99.92

99.81

96.23

99.81

92.35

90.40

96.46

88.24

85.25

84.29

74.20

80.92

Stanza

100.00

100.00

100.00

96.23

100.00

93.09

91.34

96.52

86.19

82.76

81.29

73.56

78.26

Belarusian-HSE

Trankit[large]

99.81

82.28

99.81

96.46

31.77

82.82

27.96

80.37

77.48

73.36

70.18

51.71

52.39

Trankit[base]

99.53

84.68

99.53

91.58

36.18

78.47

27.19

79.48

69.40

65.02

64.51

46.24

48.41

Stanza

99.38

78.24

99.38

91.92

31.34

77.73

26.31

79.48

69.28

63.88

58.49

41.88

44.05

Bulgarian-BTB

Trankit[large]

99.78

98.79

99.78

99.29

97.75

98.30

97.29

97.37

96.30

94.21

92.19

89.64

88.79

Trankit[base]

99.84

98.25

99.84

99.18

97.42

98.10

96.91

97.36

95.81

93.47

91.23

88.28

87.80

Stanza

99.93

97.27

99.93

98.68

96.35

97.59

95.75

97.29

93.37

90.21

86.84

83.71

83.62

Catalan-AnCora

Trankit[large]

99.94

99.76

99.94

99.11

99.07

98.66

98.24

98.49

95.15

93.83

91.41

89.31

90.27

Trankit[base]

99.94

100.00

99.93

99.02

98.97

98.57

98.09

98.46

94.61

93.01

90.09

87.89

88.99

Stanza

99.99

99.84

99.98

98.75

98.66

98.29

97.74

98.47

92.84

90.56

86.25

84.07

85.31

Chinese-GSD

Trankit[large]

97.75

99.40

97.75

95.13

94.94

97.28

94.45

97.75

87.38

84.82

83.44

79.82

83.44

Trankit[base]

97.01

99.70

97.01

94.21

94.02

96.59

93.56

97.01

85.19

82.54

80.91

77.42

80.91

Stanza

92.83

98.80

92.83

89.12

88.93

92.11

88.18

92.83

72.88

69.82

66.81

63.26

66.81

Classical_Chinese-Kyoto

Trankit[large]

99.70

70.58

99.70

92.89

91.90

94.43

89.98

99.67

82.03

76.82

75.60

73.16

75.60

Trankit[base]

99.63

61.82

99.63

92.07

91.03

93.88

88.92

99.60

78.79

73.23

72.18

69.27

72.18

Stanza

99.47

46.95

99.47

90.25

89.64

92.68

87.34

99.45

71.81

66.08

64.54

62.61

64.54

Croatian-SET

Trankit[large]

99.93

99.08

99.93

98.58

96.55

96.85

95.99

96.71

93.86

89.74

87.77

82.19

83.56

Trankit[base]

99.92

99.16

99.92

98.38

96.08

96.52

95.44

96.60

93.34

89.36

87.16

81.12

82.91

Stanza

99.96

98.15

99.96

97.88

94.86

95.32

94.22

96.67

90.27

85.56

82.43

76.37

78.78

Czech-CAC

Trankit[large]

99.99

100.00

99.98

99.64

98.17

98.01

97.65

98.30

95.48

93.97

93.06

90.13

90.92

Trankit[base]

99.96

100.00

99.95

99.42

97.40

97.12

96.68

98.13

94.97

93.27

92.08

88.13

89.85

Stanza

99.99

100.00

99.97

98.76

94.79

93.52

92.65

98.00

91.70

89.19

86.84

80.14

84.89

Czech-CLTT

Trankit[large]

99.89

98.89

99.85

99.17

93.80

94.02

93.53

97.81

92.53

90.76

88.90

81.07

86.35

Trankit[base]

99.82

100.00

99.76

98.93

93.36

93.68

93.21

97.66

90.20

88.01

85.13

77.66

82.80

Stanza

99.93

100.00

99.84

98.92

91.89

91.97

91.28

97.48

86.67

83.38

79.35

70.70

77.56

Czech-FicTree

Trankit[large]

99.98

99.50

99.98

99.11

97.26

97.99

97.06

98.64

96.50

94.85

93.44

89.79

91.46

Trankit[base]

99.97

99.38

99.97

98.94

96.47

97.09

96.12

98.61

95.85

93.86

92.10

87.13

90.16

Stanza

99.97

98.60

99.96

98.31

95.23

96.01

94.58

98.43

92.69

89.81

87.30

81.94

85.42

Czech-PDT

Trankit[large]

99.95

97.87

99.95

99.32

98.19

98.19

97.77

98.54

95.24

93.65

92.79

90.18

91.01

Trankit[base]

99.94

97.85

99.94

99.23

97.81

97.77

97.34

98.49

94.81

93.18

92.09

89.11

90.33

Stanza

99.97

94.14

99.97

98.50

95.38

94.61

93.67

98.55

91.00

88.64

86.91

81.12

85.45

Danish-DDT

Trankit[large]

99.81

95.46

99.81

98.71

99.81

98.33

97.87

97.35

91.75

90.33

88.81

85.66

85.42

Trankit[base]

99.79

95.19

99.79

98.35

99.79

97.79

97.19

97.18

90.41

88.78

86.94

82.58

83.38

Stanza

99.96

93.57

99.96

97.75

99.96

97.38

96.45

97.32

86.83

84.19

81.20

77.13

78.46

Dutch-Alpino

Trankit[large]

99.43

90.65

99.43

96.67

95.01

96.55

94.75

96.39

94.41

92.49

89.56

84.22

85.29

Trankit[base]

99.22

89.88

99.22

96.55

94.92

96.22

94.56

96.23

93.28

91.28

87.88

82.58

83.86

Stanza

99.96

89.98

99.96

96.33

94.76

96.28

94.13

96.97

89.56

86.44

81.22

75.76

77.80

Dutch-LassySmall

Trankit[large]

99.36

92.60

99.36

96.52

95.57

96.63

94.99

97.37

92.25

89.68

86.51

82.56

84.19

Trankit[base]

99.21

91.09

99.21

96.20

95.18

96.25

94.53

97.19

91.09

88.18

84.63

80.47

82.44

Stanza

99.90

77.95

99.90

95.97

94.87

96.22

94.05

97.59

85.34

81.93

75.54

71.98

73.49

English-EWT

Trankit[large]

98.67

90.49

98.67

96.65

96.47

96.75

95.49

97.25

91.29

89.40

87.41

83.45

85.73

Trankit[base]

98.48

88.35

98.48

95.95

95.71

96.26

94.62

96.84

90.14

87.96

85.75

81.02

83.83

Stanza

99.01

81.13

99.01

95.40

95.12

96.11

93.90

97.21

86.22

83.59

80.21

76.02

78.50

English-GUM

Trankit[large]

99.52

91.60

99.52

96.66

96.51

97.47

95.77

96.63

91.61

89.09

85.58

81.29

81.80

Trankit[base]

99.45

91.63

99.45

96.39

96.24

97.19

95.46

96.55

91.04

88.43

84.80

80.19

80.81

Stanza

99.82

86.35

99.82

95.89

95.91

96.87

94.99

96.80

87.06

83.57

78.42

74.68

74.97

English-LinES

Trankit[large]

99.46

91.87

99.46

97.31

95.91

96.96

93.91

97.98

89.99

87.00

85.67

80.34

83.77

Trankit[base]

99.53

93.01

99.53

97.14

95.54

96.67

93.17

98.01

89.39

86.32

84.75

78.96

82.78

Stanza

99.95

88.49

99.95

96.88

95.18

96.76

93.11

98.32

85.82

81.97

79.04

74.47

77.31

English-ParTUT

Trankit[large]

99.71

100.00

99.65

96.86

96.65

95.77

94.63

97.71

93.51

91.10

87.37

81.25

85.06

Trankit[base]

99.66

100.00

99.60

96.79

96.55

95.94

94.67

97.64

93.15

90.95

87.21

81.37

84.96

Stanza

99.68

100.00

99.59

96.15

95.83

95.21

93.92

97.45

90.31

87.35

82.56

76.19

80.53

Estonian-EDT

Trankit[large]

99.75

96.58

99.75

97.87

98.35

97.10

96.04

96.09

91.71

89.52

88.57

84.68

84.10

Trankit[base]

99.72

96.55

99.72

97.53

98.13

96.56

95.37

95.98

90.65

88.31

87.15

82.82

82.81

Stanza

99.96

93.32

99.96

97.19

98.04

95.77

94.43

96.05

86.68

83.82

82.41

77.63

78.32

Estonian-EWT

Trankit[large]

97.76

82.58

97.76

94.26

94.93

91.49

89.91

85.71

82.18

78.49

76.41

69.32

64.00

Trankit[base]

96.96

83.72

96.96

92.07

93.16

89.17

86.89

84.65

78.21

73.79

71.59

62.91

59.90

Stanza

99.20

67.14

99.20

88.86

91.70

87.16

83.43

85.62

67.23

60.07

56.21

48.32

47.38

Finnish-FTB

Trankit[large]

99.84

97.36

99.83

98.32

97.29

98.09

96.87

96.94

95.84

94.53

93.54

90.68

90.39

Trankit[base]

99.75

95.83

99.74

97.46

96.23

97.22

95.61

96.58

94.17

92.43

90.84

87.09

87.79

Stanza

100.00

89.59

99.97

95.50

95.12

96.51

93.92

96.16

89.09

86.39

83.80

79.90

81.02

Finnish-TDT

Trankit[large]

99.71

97.22

99.72

98.48

98.78

96.84

96.33

95.59

94.98

93.77

92.92

88.97

88.02

Trankit[base]

99.62

95.98

99.62

97.99

98.44

96.52

95.76

95.39

93.47

91.94

90.78

86.55

86.00

Stanza

99.77

93.05

99.73

96.97

97.72

95.36

94.44

94.98

88.62

86.18

84.66

79.73

80.24

French-GSD

Trankit[large]

99.77

98.80

99.72

97.89

99.72

97.34

96.66

97.90

94.97

93.37

90.17

86.35

87.73

Trankit[base]

99.70

96.63

99.66

97.85

99.66

97.16

96.60

97.80

94.00

92.34

88.66

84.76

86.08

Stanza

99.68

94.92

99.48

97.30

99.47

96.72

96.05

97.64

91.38

89.05

84.38

80.30

82.40

French-ParTUT

Trankit[large]

99.76

98.63

99.65

97.66

97.35

94.55

93.82

96.09

95.05

93.32

90.65

81.49

84.88

Trankit[base]

99.74

98.63

99.69

97.77

97.54

94.20

93.66

96.01

94.20

92.67

89.26

78.71

83.56

Stanza

99.82

100.00

99.37

96.60

96.37

93.98

93.41

95.48

90.71

88.37

83.37

74.41

77.88

French-Sequoia

Trankit[large]

99.81

94.07

99.78

99.22

99.78

98.43

98.13

98.64

95.70

94.85

92.95

90.47

91.15

Trankit[base]

99.73

94.36

99.73

98.90

99.73

97.98

97.57

98.47

94.68

93.59

91.26

88.27

89.44

Stanza

99.90

88.79

99.58

98.19

99.58

97.58

96.94

98.25

90.47

88.34

84.71

81.77

83.31

French-Spoken

Trankit[large]

99.36

53.06

99.19

96.80

96.91

99.19

94.66

96.34

85.70

81.84

76.29

73.65

73.78

Trankit[base]

99.38

39.39

99.18

96.73

96.73

99.18

94.43

96.38

82.40

78.35

71.68

69.01

69.49

Stanza

100.00

22.09

99.45

95.49

97.06

99.45

93.23

96.53

75.82

70.71

62.13

59.57

60.44

Galician-CTG

Trankit[large]

99.76

98.44

99.31

97.30

97.05

99.17

96.77

98.07

85.70

83.14

78.24

72.41

76.90

Trankit[base]

99.76

98.09

99.38

97.17

96.83

99.23

96.54

98.06

85.51

82.81

77.50

71.49

76.20

Stanza

99.89

99.13

99.32

97.21

96.99

99.14

96.71

97.94

85.22

82.66

77.24

71.13

75.96

Galician-TreeGal

Trankit[large]

99.47

95.52

99.22

97.62

95.68

96.50

94.97

91.08

87.17

83.90

80.31

74.67

68.39

Trankit[base]

99.47

94.60

99.06

97.06

94.90

95.89

94.08

90.91

85.38

81.96

77.96

71.57

66.32

Stanza

99.59

89.17

98.41

94.29

91.81

93.36

90.88

94.39

78.04

72.94

65.61

59.06

61.49

German-GSD

Trankit[large]

99.71

89.72

99.72

95.23

97.68

91.68

87.21

96.58

89.01

85.20

81.49

65.82

77.20

Trankit[base]

99.75

92.72

99.75

95.04

97.57

91.51

86.86

96.60

88.73

84.77

80.78

64.76

76.58

Stanza

99.53

85.79

99.53

94.07

96.98

89.52

84.51

96.37

85.39

80.61

75.38

58.57

71.24

German-HDT

Trankit[large]

99.92

99.67

99.92

98.44

98.41

94.05

93.70

97.36

97.63

96.86

95.14

85.73

91.67

Trankit[base]

99.90

99.50

99.90

98.42

98.37

93.95

93.52

97.35

97.38

96.51

94.63

85.02

91.18

Stanza

100.00

97.41

100.00

98.04

97.94

91.77

91.34

97.48

94.91

92.59

88.73

77.26

85.63

Greek-GDT

Trankit[large]

99.85

93.50

99.85

98.41

98.41

96.34

95.84

96.73

95.25

93.87

91.74

85.40

86.54

Trankit[base]

99.75

93.57

99.75

98.04

98.04

95.41

94.73

96.55

94.16

92.80

89.84

82.39

84.83

Stanza

99.88

93.18

99.89

97.84

97.84

94.94

94.33

96.49

91.12

88.78

84.12

78.00

79.48

Hebrew-HTB

Trankit[large]

99.81

99.69

96.31

94.32

94.32

93.03

92.39

93.48

88.41

86.04

82.23

74.92

78.35

Trankit[base]

99.79

100.00

96.03

93.75

93.75

91.96

91.24

93.21

87.02

84.55

80.34

72.38

76.52

Stanza

99.98

99.69

93.19

90.46

90.46

89.24

88.45

90.27

79.18

76.60

71.05

64.51

67.79

Hindi-HDTB

Trankit[large]

99.88

99.91

99.88

98.01

97.70

93.91

92.38

96.54

95.95

92.96

89.79

79.69

88.47

Trankit[base]

99.89

99.64

99.89

97.77

97.38

94.03

92.33

96.54

95.68

92.70

89.59

79.60

88.28

Stanza

100.00

99.44

100.00

97.59

97.08

94.03

92.11

96.66

94.80

91.74

88.20

78.73

87.01

Hungarian-Szeged

Trankit[large]

99.59

99.33

99.59

97.49

99.59

95.23

94.40

94.45

91.31

87.78

86.83

78.95

80.31

Trankit[base]

99.41

98.00

99.41

96.97

99.41

94.47

93.47

94.28

89.43

85.70

85.08

76.13

78.73

Stanza

99.87

97.00

99.87

96.03

99.87

93.76

92.94

94.25

83.62

78.86

77.14

69.46

71.87

Indonesian-GSD

Trankit[large]

99.89

95.54

99.89

93.39

95.06

96.11

89.22

99.53

86.33

79.81

77.99

69.08

77.60

Trankit[base]

99.86

95.37

99.86

93.57

94.18

95.67

88.65

99.49

86.55

80.28

78.64

69.42

78.26

Stanza

99.99

93.78

99.99

93.68

94.79

96.00

89.17

99.61

85.17

79.19

77.04

68.86

76.68

Irish-IDT

Trankit[large]

99.47

98.24

99.47

94.72

93.74

80.90

77.94

92.64

83.47

76.86

70.64

48.55

64.06

Trankit[base]

99.32

97.25

99.32

93.88

92.46

80.36

76.72

92.34

82.52

74.91

67.96

46.29

61.34

Stanza

99.76

95.93

99.76

93.90

92.43

78.19

75.00

91.79

82.65

74.03

66.11

42.98

59.09

Italian-ISDT

Trankit[large]

99.88

99.07

99.86

98.72

98.63

98.32

97.79

98.33

95.73

94.45

91.97

89.08

89.45

Trankit[base]

99.88

98.76

99.87

98.58

98.46

98.20

97.60

98.23

95.31

93.87

90.93

87.81

88.45

Stanza

99.91

98.76

99.76

98.01

97.91

97.72

97.11

98.10

92.79

90.84

86.43

83.60

84.23

Italian-ParTUT

Trankit[large]

99.81

100.00

99.79

98.58

98.42

98.15

97.54

97.84

96.19

94.11

90.65

87.75

87.86

Trankit[base]

99.82

100.00

99.81

98.63

98.41

98.16

97.47

97.94

95.38

93.32

89.28

86.22

86.55

Stanza

99.81

100.00

99.77

97.82

97.76

97.79

96.94

97.57

92.24

90.01

84.39

81.77

82.05

Italian-PoSTWITA

Trankit[large]

99.34

73.95

99.18

96.60

96.43

96.52

95.31

96.41

86.33

82.54

78.49

74.27

75.83

Trankit[base]

99.29

69.95

99.07

96.10

95.91

95.87

94.53

96.30

84.19

80.32

75.33

71.09

72.98

Stanza

99.71

63.70

99.46

96.19

96.04

96.28

95.01

96.70

82.67

78.27

72.20

68.55

70.35

Italian-TWITTIRO

Trankit[large]

99.15

65.72

98.89

95.47

94.90

94.09

91.98

93.12

84.73

79.86

73.94

66.46

66.58

Trankit[base]

99.22

56.00

99.01

95.31

94.74

93.83

91.68

92.96

83.44

78.30

70.79

63.25

63.81

Stanza

99.34

52.40

98.76

94.41

94.01

93.34

91.45

93.17

78.87

72.85

64.64

58.67

59.35

Italian-VIT

Trankit[large]

99.97

98.18

99.84

98.07

97.29

97.76

96.16

98.42

93.02

90.44

86.85

82.53

84.91

Trankit[base]

99.99

96.52

99.81

97.82

97.02

97.39

95.74

98.31

92.39

89.60

85.59

80.70

83.64

Stanza

99.98

94.92

99.49

97.21

96.23

96.79

94.99

98.01

89.32

85.87

80.26

76.16

78.61

Japanese-GSD

Trankit[large]

95.25

95.88

95.25

93.66

93.47

95.23

93.44

94.68

86.67

85.56

78.00

76.02

77.62

Trankit[base]

94.57

95.49

94.57

92.86

92.44

94.56

92.42

93.99

84.58

83.38

75.60

73.67

75.14

Stanza

92.67

94.57

92.67

91.16

90.84

92.66

90.84

92.02

81.20

80.16

71.39

69.85

71.01

Kazakh-KTB

Trankit[large]

95.98

81.71

95.37

77.94

77.47

63.01

55.46

50.60

47.46

37.98

36.01

19.00

12.05

Trankit[base]

94.48

90.00

93.62

75.94

75.67

62.28

54.51

49.76

46.42

36.84

34.72

18.65

11.81

Stanza

93.46

88.56

94.16

56.23

56.10

42.73

36.96

52.12

44.33

25.21

20.28

7.63

10.01

Korean-GSD

Trankit[large]

98.57

98.08

98.57

95.71

90.88

98.35

88.90

91.93

89.87

87.22

85.97

83.63

79.65

Trankit[base]

98.63

97.67

98.63

95.63

90.32

98.43

88.26

91.96

88.48

85.77

84.26

81.98

78.08

Stanza

99.88

96.65

99.88

96.18

90.14

99.66

88.00

92.69

87.29

83.53

81.34

79.29

75.31

Korean-Kaist

Trankit[large]

98.70

99.87

98.70

95.13

88.07

98.70

88.07

92.36

90.00

88.22

86.37

83.56

80.16

Trankit[base]

98.79

99.14

98.79

94.99

87.62

98.79

87.62

92.44

88.72

86.96

84.99

81.84

78.90

Stanza

100.00

99.93

100.00

95.45

86.31

100.00

86.31

93.02

88.41

86.38

83.95

80.63

77.57

Kurmanji-MG

Trankit[large]

94.95

91.50

94.63

75.07

74.16

57.15

52.27

57.63

37.12

29.89

25.57

9.04

10.16

Trankit[base]

94.52

80.56

94.20

74.33

73.44

56.54

51.38

57.61

35.65

28.58

25.35

8.88

10.76

Stanza

94.81

87.43

94.49

57.17

55.91

43.02

38.41

56.13

32.01

21.91

16.35

3.84

5.84

Latin-ITTB

Trankit[large]

100.00

94.54

100.00

98.97

97.29

97.98

96.41

99.13

93.25

91.87

90.75

87.86

90.00

Trankit[base]

100.00

94.57

100.00

98.76

96.74

97.54

95.68

99.07

92.42

90.91

89.45

86.12

88.71

Stanza

99.99

80.66

99.99

98.09

95.38

96.43

93.80

98.90

87.61

85.36

84.23

80.28

83.60

Latin-Perseus

Trankit[large]

99.60

97.93

99.60

92.84

83.33

86.79

82.33

70.34

83.50

76.76

73.58

60.70

44.41

Trankit[base]

99.45

97.87

99.45

90.15

77.12

81.12

75.64

69.95

78.01

69.58

65.24

49.58

40.23

Stanza

100.00

98.24

100.00

90.63

78.42

82.42

77.74

83.08

71.94

61.99

57.89

45.28

47.28

Latin-PROIEL

Trankit[large]

99.85

66.10

99.85

97.79

97.75

93.22

92.53

97.21

86.43

83.33

81.62

73.62

79.55

Trankit[base]

99.82

58.16

99.82

96.80

96.83

91.28

90.27

96.88

82.23

78.58

76.36

67.10

74.43

Stanza

100.00

43.04

100.00

96.92

97.10

91.24

90.32

96.78

76.55

72.37

70.06

61.28

68.19

Latvian-LVTB

Trankit[large]

99.73

98.69

99.73

97.61

91.22

95.18

90.72

95.83

93.63

91.25

89.78

82.69

85.58

Trankit[base]

99.71

99.10

99.71

97.16

90.24

94.47

89.62

95.61

92.05

89.44

87.73

79.78

83.52

Stanza

99.82

99.01

99.82

96.03

88.25

93.46

87.73

95.55

87.84

84.44

82.16

73.91

78.25

Lithuanian-ALKSNIS

Trankit[large]

99.84

95.72

99.84

97.45

93.98

94.46

93.30

94.30

90.48

87.67

86.66

79.86

80.20

Trankit[base]

99.82

95.10

99.82

97.03

92.35

93.00

91.54

94.05

88.30

84.96

83.59

75.11

77.35

Stanza

99.87

88.79

99.87

93.37

85.67

87.84

84.84

92.51

78.54

73.11

70.66

60.81

65.53

Lithuanian-HSE

Trankit[large]

97.71

100.00

97.71

90.59

89.85

79.64

75.90

80.02

71.41

62.05

59.15

41.13

44.82

Trankit[base]

98.22

94.55

98.22

90.46

89.71

77.92

74.18

80.07

66.70

58.47

55.18

36.60

40.03

Stanza

97.53

51.11

97.53

81.08

80.04

70.72

66.44

76.90

48.10

37.45

32.37

21.10

24.86

Marathi-UFAL

Trankit[large]

99.20

69.31

97.22

87.79

97.22

70.62

67.47

81.50

72.79

63.36

59.67

36.63

46.50

Trankit[base]

99.20

60.87

95.25

82.83

95.25

69.43

66.02

79.17

60.90

54.08

52.19

28.81

40.50

Stanza

98.00

76.40

92.25

77.24

92.25

60.27

58.55

75.77

66.42

52.64

42.80

24.15

33.90

Norwegian_Nynorsk-Nynorsk

Trankit[large]

99.84

98.97

99.84

98.52

99.84

97.79

97.13

98.01

95.23

93.82

92.35

89.01

89.79

Trankit[base]

99.81

98.71

99.81

98.20

99.81

97.20

96.48

97.89

94.15

92.58

90.70

86.61

88.15

Stanza

99.97

94.85

99.97

97.92

99.97

96.88

96.03

97.90

91.87

89.73

87.28

82.86

84.78

Norwegian_Nynorsk-NynorskLIA

Trankit[large]

99.76

99.53

99.76

96.48

99.76

95.59

93.57

97.49

81.96

77.85

73.57

67.19

71.00

Trankit[base]

99.74

99.53

99.74

96.31

99.74

95.41

93.29

97.50

80.86

76.44

71.96

65.82

69.71

Stanza

100.00

99.69

100.00

95.92

100.00

94.82

92.70

97.72

77.82

72.94

67.56

61.32

65.54

Norwegian-Bokmaal

Trankit[large]

99.88

98.89

99.88

98.85

99.88

98.07

97.61

98.40

95.54

94.33

92.82

90.15

90.84

Trankit[base]

99.88

99.20

99.88

98.66

99.88

97.60

97.02

98.34

94.78

93.47

91.77

88.29

89.72

Stanza

99.99

97.17

99.99

98.29

99.99

97.17

96.41

98.36

92.57

90.69

88.32

84.41

86.33

Old_French-SRCMF

Trankit[large]

99.91

100.00

99.91

96.96

96.83

98.32

96.45

99.91

94.30

90.75

88.69

85.43

88.69

Trankit[base]

99.84

100.00

99.84

96.36

96.21

97.75

95.72

99.84

92.82

88.76

86.12

82.63

86.12

Stanza

100.00

100.00

100.00

96.05

96.09

97.74

95.56

100.00

91.38

86.35

83.39

80.05

83.39

Old_Russian-TOROT

Trankit[large]

98.87

51.91

98.87

94.70

94.63

89.61

88.02

90.87

78.64

74.60

71.82

63.30

66.49

Trankit[base]

98.44

42.22

98.44

92.63

92.66

86.75

84.52

90.00

74.14

68.92

65.57

55.81

60.56

Stanza

100.00

35.69

100.00

93.63

93.83

86.76

84.80

91.35

72.94

67.00

63.60

54.13

59.18

Persian-Seraji

Trankit[large]

99.26

99.25

99.20

97.78

97.67

97.70

97.35

97.35

92.24

89.58

86.86

84.97

84.90

Trankit[base]

99.22

99.25

99.11

97.35

97.24

97.36

96.90

97.29

91.38

88.68

85.92

83.86

84.08

Stanza

100.00

99.25

99.65

97.29

97.30

97.37

96.86

97.73

89.45

86.06

82.78

81.00

81.08

Polish-LFG

Trankit[large]

98.34

99.57

98.34

97.84

95.52

96.00

95.05

95.48

93.90

93.04

92.58

89.18

88.61

Trankit[base]

98.32

99.91

98.32

97.66

94.59

95.05

94.00

95.37

93.31

92.17

91.43

86.88

87.55

Stanza

99.95

99.83

99.95

98.55

94.66

95.84

94.07

96.86

95.80

93.94

92.35

87.62

88.64

Polish-PDB

Trankit[large]

99.93

98.71

99.92

99.16

96.92

97.11

96.54

97.56

96.43

94.88

93.77

89.78

90.53

Trankit[base]

99.91

98.53

99.89

99.06

96.29

96.44

95.77

97.52

95.52

93.86

92.50

87.67

89.34

Stanza

99.87

98.39

99.83

98.31

94.04

94.27

93.13

97.29

92.68

90.40

88.35

81.69

85.42

Portuguese-Bosque

Trankit[large]

99.75

97.18

99.67

97.52

99.67

96.50

95.17

97.97

93.31

90.91

87.75

81.09

85.36

Trankit[base]

99.70

97.48

99.65

97.27

99.65

96.50

94.95

97.89

92.76

90.25

86.96

80.03

84.52

Stanza

99.77

94.30

99.67

97.04

99.67

96.36

94.91

97.80

90.67

87.57

82.59

76.78

80.30

Portuguese-GSD

Trankit[large]

99.81

97.10

99.72

98.43

98.43

99.61

98.41

99.20

95.35

94.37

92.23

90.46

91.47

Trankit[base]

99.82

96.76

99.71

98.30

98.30

99.61

98.28

99.19

94.92

93.95

91.65

89.58

90.89

Stanza

99.96

98.00

99.87

98.18

98.18

99.79

98.17

95.83

92.83

91.36

87.44

85.87

86.75

Romanian-Nonstandard

Trankit[large]

98.74

98.00

98.74

96.23

91.64

90.51

89.07

94.66

90.68

86.89

83.13

70.10

78.50

Trankit[base]

98.68

98.57

98.68

96.04

91.48

90.33

88.89

94.57

90.14

86.40

82.40

69.46

77.93

Stanza

98.96

97.53

98.96

95.40

90.73

89.79

88.19

94.63

87.24

82.71

77.60

65.24

73.52

Romanian-RRT

Trankit[large]

99.60

98.49

99.60

97.90

97.35

97.43

97.11

97.98

93.60

89.50

86.61

82.78

84.60

Trankit[base]

99.72

97.67

99.72

97.87

97.25

97.44

97.01

98.05

93.14

89.04

85.93

82.02

84.01

Stanza

99.77

96.64

99.77

97.54

96.97

97.13

96.75

97.95

90.66

85.85

81.49

77.94

79.84

Russian-GSD

Trankit[large]

99.79

99.25

99.79

98.25

97.89

95.78

94.73

95.75

92.70

89.75

88.71

82.92

83.66

Trankit[base]

99.63

98.25

99.63

97.96

97.65

94.86

93.83

95.50

91.86

88.62

87.41

80.83

82.36

Stanza

99.65

97.16

99.65

97.38

97.18

93.11

92.22

95.34

88.97

84.83

82.37

75.16

77.75

Russian-SynTagRus

Trankit[large]

99.71

99.45

99.71

99.06

99.71

98.20

97.99

97.99

95.66

94.65

93.78

91.68

91.51

Trankit[base]

99.71

99.14

99.71

98.94

99.71

97.85

97.59

97.89

95.19

94.08

93.13

90.59

90.77

Stanza

99.57

98.86

99.57

98.20

99.57

95.91

95.59

97.51

92.38

90.60

89.01

85.04

86.78

Russian-Taiga

Trankit[large]

98.90

92.32

98.90

96.16

96.62

91.02

87.69

91.90

85.25

81.73

80.06

68.48

71.59

Trankit[base]

98.77

92.60

98.77

95.50

97.27

89.42

86.58

91.46

83.08

79.15

76.91

64.25

68.53

Stanza

97.11

85.79

97.11

92.25

94.70

85.76

82.61

89.28

72.09

66.00

61.80

51.94

55.64

Scottish_Gaelic-ARCOSG

Trankit[large]

99.43

57.46

99.42

94.70

88.65

90.71

87.13

95.48

81.60

76.46

70.62

61.71

66.95

Trankit[base]

99.26

54.10

99.25

92.98

85.47

88.25

83.78

95.06

79.48

73.09

66.41

56.27

62.83

Stanza

99.48

55.35

99.47

92.50

84.89

87.99

83.93

95.51

77.90

70.81

62.63

54.00

59.74

Serbian-SET

Trankit[large]

99.91

100.00

99.91

99.06

96.22

96.40

95.84

96.90

95.57

93.13

91.92

85.76

87.71

Trankit[base]

99.91

99.71

99.91

98.97

95.82

95.96

95.32

96.90

95.24

92.94

91.53

84.84

87.46

Stanza

100.00

99.33

100.00

98.44

94.26

94.55

93.86

96.34

91.79

88.78

86.50

79.48

82.38

Simplified_Chinese-GSDSimp

Trankit[large]

97.66

98.20

97.66

94.99

94.78

97.24

94.32

97.66

87.01

84.63

83.23

79.84

83.23

Trankit[base]

96.94

99.70

96.94

94.17

93.98

96.51

93.52

96.94

84.64

81.96

80.14

76.30

80.14

Stanza

92.92

99.10

92.92

89.05

88.84

92.12

88.03

92.92

73.44

70.44

67.69

64.07

67.69

Slovak-SNK

Trankit[large]

99.94

98.49

99.94

97.90

90.82

95.22

90.20

94.79

96.97

95.60

95.04

87.64

88.39

Trankit[base]

99.93

98.07

99.93

97.80

89.02

94.00

88.38

94.66

95.72

93.97

93.19

84.33

86.63

Stanza

99.97

90.93

99.97

96.34

87.15

91.59

86.34

94.73

89.96

86.82

84.74

75.39

79.35

Slovenian-SSJ

Trankit[large]

99.97

100.00

99.97

99.24

97.83

97.99

97.60

97.55

96.91

96.06

94.88

91.78

91.37

Trankit[base]

99.93

99.81

99.93

99.03

96.70

96.97

96.38

97.49

95.94

94.99

93.53

89.09

90.12

Stanza

99.91

91.60

99.91

98.29

95.08

95.37

94.56

97.34

91.63

89.60

87.18

82.35

84.37

Slovenian-SST

Trankit[large]

99.84

33.03

99.84

95.81

91.74

91.86

89.70

88.72

74.86

71.02

66.85

59.58

56.40

Trankit[base]

99.79

31.96

99.79

94.90

90.27

90.37

87.92

88.66

71.15

66.65

61.94

54.19

52.26

Stanza

100.00

26.59

100.00

93.66

88.09

88.06

85.27

94.78

63.13

56.50

51.34

44.81

48.96

Spanish-AnCora

Trankit[large]

99.91

99.39

99.91

99.06

99.00

98.85

98.36

99.15

94.77

93.29

91.10

89.23

90.14

Trankit[base]

99.94

99.13

99.93

99.02

98.94

98.80

98.27

99.17

94.11

92.41

89.66

87.60

88.71

Stanza

99.98

99.07

99.98

98.78

98.67

98.59

97.97

99.19

92.21

90.01

86.05

84.22

85.20

Spanish-GSD

Trankit[large]

99.93

99.30

99.88

97.38

99.88

96.96

95.23

98.59

93.47

91.56

88.58

80.91

86.50

Trankit[base]

99.91

98.94

99.88

97.41

99.88

96.88

95.23

98.62

92.66

90.50

87.01

79.83

85.04

Stanza

99.96

95.97

99.87

96.69

99.87

96.40

94.44

98.44

89.61

86.73

81.22

73.96

79.19

Swedish-LinES

Trankit[large]

99.89

90.64

99.89

98.05

95.77

90.98

88.52

96.94

91.74

88.73

87.89

74.52

83.97

Trankit[base]

99.73

90.57

99.73

97.60

95.23

90.50

87.93

96.72

90.45

87.36

86.11

72.78

82.24

Stanza

99.94

86.99

99.94

96.97

94.58

90.11

87.33

96.79

87.10

83.06

80.76

67.97

77.44

Swedish-Talbanken

Trankit[large]

99.91

99.26

99.91

99.06

98.09

98.04

97.43

97.87

94.41

92.97

92.08

88.67

89.28

Trankit[base]

99.87

99.38

99.87

98.76

97.77

97.73

97.03

97.82

93.61

91.87

90.72

86.97

88.03

Stanza

99.97

98.85

99.97

97.65

96.57

96.70

95.63

97.51

88.96

85.91

83.59

79.17

80.78

Tamil-TTB

Trankit[large]

98.33

100.00

94.44

87.64

83.92

87.19

82.56

88.85

72.34

67.66

66.19

58.16

61.43

Trankit[base]

98.02

100.00

93.64

86.18

82.09

86.43

80.27

88.09

68.37

63.67

61.78

52.73

57.32

Stanza

99.58

95.08

91.42

82.60

78.80

81.89

78.10

85.14

61.23

55.76

53.43

46.40

49.61

Telugu-MTG

Trankit[large]

98.89

98.62

98.89

94.31

94.31

98.33

94.31

98.89

93.19

87.08

84.46

80.65

84.46

Trankit[base]

98.89

98.62

98.89

94.32

94.32

97.92

94.32

98.89

91.97

84.35

81.10

78.44

81.10

Stanza

100.00

97.95

100.00

92.93

92.93

99.17

92.93

100.00

89.32

79.89

74.88

71.25

74.88

Turkish-IMST

Trankit[large]

99.84

98.32

98.91

96.20

95.33

92.78

91.19

96.21

78.94

73.14

71.15

63.81

69.11

Trankit[base]

99.86

98.18

98.68

95.15

94.35

92.02

89.94

95.80

76.59

70.75

68.28

60.61

66.24

Stanza

99.89

97.62

98.07

94.21

93.43

92.08

90.27

94.92

70.78

64.50

61.62

56.04

59.60

Ukrainian-IU

Trankit[large]

99.77

97.55

99.76

98.50

96.32

96.09

95.35

96.99

94.63

93.16

91.78

86.22

88.03

Trankit[base]

99.78

97.72

99.76

98.33

94.96

94.94

93.86

96.98

93.44

91.69

89.89

83.20

86.33

Stanza

99.81

96.65

99.79

96.77

92.49

92.53

91.31

96.49

87.11

83.86

80.51

73.38

77.28

Urdu-UDTB

Trankit[large]

99.75

97.67

99.75

94.43

92.77

81.91

78.35

95.44

88.52

83.26

78.30

57.41

75.87

Trankit[base]

99.66

98.32

99.66

94.15

92.66

83.04

79.29

95.33

87.81

82.51

77.31

57.57

74.83

Stanza

100.00

98.88

100.00

94.42

92.62

84.21

80.36

95.62

88.30

82.78

77.06

59.48

74.75

Uyghur-UDT

Trankit[large]

97.95

88.95

97.95

88.50

91.34

86.56

79.14

94.45

80.19

70.14

66.06

51.63

62.99

Trankit[base]

97.63

88.51

97.63

87.47

90.37

85.31

77.28

94.26

78.36

68.24

63.89

48.42

60.70

Stanza

99.79

86.90

99.79

89.45

91.92

87.92

80.54

96.16

75.55

63.61

57.00

46.06

54.39

Vietnamese-VTB

Trankit[large]

94.88

96.63

94.88

89.70

88.14

94.64

88.07

94.88

71.07

65.37

63.67

60.22

63.67

Trankit[base]

95.22

96.25

95.22

89.40

87.85

95.03

87.82

95.22

70.96

64.76

62.72

58.51

62.72

Stanza

87.25

93.15

87.25

79.50

77.90

87.02

77.87

87.20

53.63

48.16

44.88

42.17

44.85

Performance for Stanza, UDPipe, and spaCy is obtained using their public pretrained models. The overall performance for Trankit and Stanza is computed as the macro-averaged F1 over 90 treebanks using the official evaluation script of the CoNLL 2018 Shared Task.

Named Entity Recognition

Performance comparison between Trankit large, Trankit base, and Stanza v1.1.1 on the test sets of 11 public NER datasets. Performance is based on entity micro-averaged F1.

Language

Corpus

Trankit[large]

Trankit[base]

Stanza v1.1.1

Arabic

AQMAR

76.1

74.8

74.3

Chinese

OntoNotes

80.5

80

79.2

Dutch

CoNLL02

93.8

91.8

89.2

WikiNER

95.0

94.8

94.8

English

CoNLL03

92.5

92.1

92.1

OntoNotes

90.2

89.6

88.8

French

WikiNER

92.9

92.3

92.9

German

CoNLL03

85.3

84.6

81.9

GermEval14

89.4

86.9

85.2

Russian

WikiNER

93.2

92.8

92.9

Spanish

CoNLL02

89.2

88.9

88.1

Supported Languages

Trainable Languages

In the table below we show 100 languages that users can use the training data of such languages to train their own pipelines with Trankit.

Trainable Languages

Afrikaans

Estonian

Kyrgyz

Sindhi

Albanian

Filipino

Lao

Sinhala

Amharic

Finnish

Latin

Slovak

Arabic

French

Latvian

Slovenian

Armenian

Galician

Lithuanian

Somali

Assamese

Georgian

Macedonian

Spanish

Azerbaijani

German

Malagasy

Sundanese

Basque

Greek

Malay

Swahili

Belarusian

Gujarati

Malayalam

Swedish

Bengali

Hausa

Marathi

Tamil

Bengali

Hebrew

Mongolian

Tamil Romanized

Bosnian

Hindi

Nepali

Telugu

Breton

Hindi Romanized

Norwegian

Telugu Romanized

Bulgarian

Hungarian

Oriya

Thai

Burmese

Icelandic

Oromo

Turkish

Burmese

Indonesian

Pashto

Ukrainian

Catalan

Irish

Persian

Urdu

Chinese (Simplified)

Italian

Polish

Urdu Romanized

Chinese (Traditional)

Japanese

Portuguese

Uyghur

Croatian

Javanese

Punjabi

Uzbek

Czech

Kannada

Romanian

Vietnamese

Danish

Kazakh

Russian

Welsh

Dutch

Khmer

Sanskrit

Western Frisian

English

Korean

Scottish Gaelic

Xhosa

Esperanto

Kurdish (Kurmanji)

Serbian

Yiddish

Pretrained Languages & Their Code Names

Trankit provides 90 pretrained pipelines for 56 languages. Each pretrained pipeline is associated with a treebank that it is trained on. Below we show the 56 pretrained languages, their corresponding treebanks, and the code names to initialize pretrained pipelines. The pretrained pipelines can be directly downloaded by clicking on their code names in the table below.

Note that, the names of the default treebanks are put inside the brackets []. For example, English has 4 treebanks, which are UD_English-EWT, UD_English-GUM, UD_English-LinES, and UD_English-ParTUT. The treebank UD_English-EWT is put inside a bracket [], so it is the default treebank for English. Looking at the following table, we can select the appropriate code name and follow the instructions here to initialize a pipeline. For example, if we want to initialize a pipeline that is trained on the default treebank UD_English-EWT, we can use the code name english; to initialize a pipeline that is trained on a non-default treebank such as UD_English-GUM, we can use english-gum.

Language

Treebank

Code Name (for pipeline initialization)

Requires MWT expansion?

Treebank License

Treebank Documentation

Afrikaans

[UD_Afrikaans-AfriBooms]

afrikaans

https://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/af_afribooms/index.html

Ancient Greek

[UD_Ancient_Greek-PROIEL]

ancient-greek

http://creativecommons.org/licenses/by-nc-sa/3.0/

http://creativecommons.org/licenses/by-nc-sa/3.0/

UD_Ancient_Greek-Perseus

ancient-greek-perseus

http://creativecommons.org/licenses/by-nc-sa/2.5/

https://universaldependencies.org/treebanks/grc_perseus/index.html

Arabic

[UD_Arabic-PADT]

arabic

Yes

http://creativecommons.org/licenses/by-nc-sa/3.0/

https://universaldependencies.org/treebanks/ar_padt/index.html

Armenian

[UD_Armenian-ArmTDP]

armenian

Yes

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/hy_armtdp/index.html

Basque

[UD_Basque-BDT]

basque

http://creativecommons.org/licenses/by-nc-sa/3.0/

https://universaldependencies.org/treebanks/eu_bdt/index.html

Belarusian

[UD_Belarusian-HSE]

belarusian

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/be_hse/index.html

Bulgarian

[UD_Bulgarian-BTB]

bulgarian

http://creativecommons.org/licenses/by-nc-sa/3.0/

https://universaldependencies.org/treebanks/bg_btb/index.html

Catalan

[UD_Catalan-AnCora]

catalan

Yes

https://www.gnu.org/licenses/gpl-3.0.en.html

https://universaldependencies.org/treebanks/ca_ancora/index.html

Chinese (simplified)

[UD_Simplified_Chinese-GSDSimp]

chinese

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/zhs_gsdsimp/index.html

Chinese (traditional)

[UD_Chinese-GSD]

traditional-chinese

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/zh_gsd/index.html

Chinese (classical)

[UD_Classical_Chinese-Kyoto]

classical-chinese

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/lzh_kyoto/index.html

Croatian

[UD_Croatian-SET]

croatian

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/hr_set/index.html

Czech

UD_Czech-CAC

czech-cac

Yes

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/cs_cac/index.html

UD_Czech-CLTT

czech-cltt

Yes

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/cs_cltt/index.html

UD_Czech-FicTree

czech-fictree

Yes

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/cs_fictree/index.html

[UD_Czech-PDT]

czech

Yes

http://creativecommons.org/licenses/by-nc-sa/3.0/

https://universaldependencies.org/treebanks/cs_pdt/index.html

Danish

[UD_Danish-DDT]

danish

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/da_ddt/index.html

Dutch

[UD_Dutch-Alpino]

dutch

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/nl_alpino/index.html

UD_Dutch-LassySmall

dutch-lassysmall

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/nl_lassysmall/index.html

English

[UD_English-EWT]

english

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/en_ewt/index.html

UD_English-GUM

english-gum

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/en_gum/index.html

UD_English-LinES

english-lines

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/en_lines/index.html

UD_English-ParTUT

english-partut

Yes

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/en_partut/index.html

Estonian

[UD_Estonian-EDT]

estonian

http://creativecommons.org/licenses/by-nc-sa/3.0/

https://universaldependencies.org/treebanks/et_edt/index.html

UD_Estonian-EWT

estonian-ewt

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/et_ewt/index.html

Finnish

UD_Finnish-FTB

finnish-ftb

Yes

http://creativecommons.org/licenses/by/4.0/

https://universaldependencies.org/treebanks/fi_ftb/index.html

[UD_Finnish-TDT]

finnish

Yes

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/fi_tdt/index.html

French

[UD_French-GSD]

french

Yes

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/fr_gsd/index.html

UD_French-ParTUT

french-partut

Yes

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/fr_partut/index.html

UD_French-Sequoia

french-sequoia

Yes

http://infolingu.univ-mlv.fr/DonneesLinguistiques/Lexiques-Grammaires/lgpllr.html

https://universaldependencies.org/treebanks/fr_sequoia/index.html

UD_French-Spoken

french-spoken

Yes

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/fr_spoken/index.html

Galician

[UD_Galician-CTG]

galician

Yes

http://creativecommons.org/licenses/by-nc-sa/3.0/

https://universaldependencies.org/treebanks/gl_ctg/index.html

UD_Galician-TreeGal

galician-treegal

Yes

http://infolingu.univ-mlv.fr/DonneesLinguistiques/Lexiques-Grammaires/lgpllr.html

https://universaldependencies.org/treebanks/gl_treegal/index.html

German

[UD_German-GSD]

german

Yes

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/de_gsd/index.html

UD_German-HDT

german-hdt

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/de_hdt/index.html

Greek

[UD_Greek-GDT]

greek

Yes

http://creativecommons.org/licenses/by-nc-sa/3.0/

https://universaldependencies.org/treebanks/el_gdt/index.html

Hebrew

[UD_Hebrew-HTB]

hebrew

Yes

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/he_htb/index.html

Hindi

[UD_Hindi-HDTB]

hindi

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/hi_hdtb/index.html

Hungarian

[UD_Hungarian-Szeged]

hungarian

http://creativecommons.org/licenses/by-nc-sa/3.0/

https://universaldependencies.org/treebanks/hu_szeged/index.html

Indonesian

[UD_Indonesian-GSD]

indonesian

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/id_gsd/index.html

Irish

[UD_Irish-IDT]

irish

http://creativecommons.org/licenses/by-sa/3.0/

https://universaldependencies.org/treebanks/ga_idt/index.html

Italian

[UD_Italian-ISDT]

italian

Yes

http://creativecommons.org/licenses/by-nc-sa/3.0/

https://universaldependencies.org/treebanks/it_isdt/index.html

UD_Italian-ParTUT

italian-partut

Yes

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/it_partut/index.html

UD_Italian-PoSTWITA

italian-postwita

Yes

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/it_postwita/index.html

UD_Italian-TWITTIRO

italian-twittiro

Yes

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/it_twittiro/index.html

UD_Italian-VIT

italian-vit

Yes

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/it_vit/index.html

Japanese

[UD_Japanese-GSD]

japanese

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/ja_gsd/index.html

Kazakh

[UD_Kazakh-KTB]

kazakh

Yes

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/kk_ktb/index.html

Korean

[UD_Korean-GSD]

korean

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/ko_gsd/index.html

UD_Korean-Kaist

korean-kaist

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/ko_kaist/index.html

Kurmanji

[UD_Kurmanji-MG]

kurmanji

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/kmr_mg/index.html

Latin

[UD_Latin-ITTB]

latin

http://creativecommons.org/licenses/by-nc-sa/3.0/

https://universaldependencies.org/treebanks/la_ittb/index.html

UD_Latin-Perseus

latin-perseus

http://creativecommons.org/licenses/by-nc-sa/2.5/

https://universaldependencies.org/treebanks/la_perseus/index.html

UD_Latin-PROIEL

latin-proiel

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/la_proiel/index.html

Latvian

[UD_Latvian-LVTB]

latvian

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/lv_lvtb/index.html

Lithuanian

[UD_Lithuanian-ALKSNIS]

lithuanian

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/lt_alksnis/index.html

UD_Lithuanian-HSE

lithuanian-hse

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/lt_hse/index.html

Marathi

[UD_Marathi-UFAL]

marathi

Yes

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/mr_ufal/index.html

Norwegian (Bokmaal)

[UD_Norwegian-Bokmaal]

norwegian-bokmaal

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/no_bokmaal/index.html

Norwegian (Nynorsk)

[UD_Norwegian_Nynorsk-Nynorsk]

norwegian-nynorsk

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/nn_nynorsk/index.html

UD_Norwegian_Nynorsk-NynorskLIA

norwegian-nynorsklia

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/nn_nynorsklia/index.html

Old French

[UD_Old_French-SRCMF]

old-french

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/fro_srcmf/index.html

Old Russian

[UD_Old_Russian-TOROT]

old-russian

http://creativecommons.org/licenses/by-nc-sa/3.0/

https://universaldependencies.org/treebanks/orv_torot/index.html

Persian

[UD_Persian-Seraji]

persian

Yes

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/fa_seraji/index.html

Polish

UD_Polish-LFG

polish-lfg

https://www.gnu.org/licenses/gpl-3.0.en.html

https://universaldependencies.org/treebanks/pl_lfg/index.html

[UD_Polish-PDB]

polish

Yes

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/pl_pdb/index.html

Portuguese

[UD_Portuguese-Bosque]

portuguese

Yes

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/pt_bosque/index.html

UD_Portuguese-GSD

portuguese-gsd

Yes

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/pt_gsd/index.html

Romanian

UD_Romanian-Nonstandard

romanian-nonstandard

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/ro_nonstandard/index.html

[UD_Romanian-RRT]

romanian

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/ro_rrt/index.html

Russian

UD_Russian-GSD

russian-gsd

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/ru_gsd/index.html

[UD_Russian-SynTagRus]

russian

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/ru_syntagrus/index.html

UD_Russian-Taiga

russian-taiga

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/ru_taiga/index.html

Scottish Gaelic

[UD_Scottish_Gaelic-ARCOSG]

scottish-gaelic

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/gd_arcosg/index.html

Serbian

[UD_Serbian-SET]

serbian

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/sr_set/index.html

Slovak

[UD_Slovak-SNK]

slovak

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/sk_snk/index.html

Slovenian

[UD_Slovenian-SSJ]

slovenian

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/sl_ssj/index.html

UD_Slovenian-SST

slovenian-sst

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/sl_sst/index.html

Spanish

[UD_Spanish-AnCora]

spanish

Yes

https://www.gnu.org/licenses/gpl-3.0.en.html

https://universaldependencies.org/treebanks/es_ancora/index.html

UD_Spanish-GSD

spanish-gsd

Yes

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/es_gsd/index.html

Swedish

UD_Swedish-LinES

swedish-lines

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/sv_lines/index.html

[UD_Swedish-Talbanken]

swedish

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/sv_talbanken/index.html

Tamil

[UD_Tamil-TTB]

tamil

Yes

http://creativecommons.org/licenses/by-nc-sa/3.0/

https://universaldependencies.org/treebanks/ta_ttb/index.html

Telugu

[UD_Telugu-MTG]

telugu

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/te_mtg/index.html

Turkish

[UD_Turkish-IMST]

turkish

Yes

http://creativecommons.org/licenses/by-nc-sa/3.0/

https://universaldependencies.org/treebanks/tr_imst/index.html

Ukrainian

[UD_Ukrainian-IU]

ukrainian

Yes

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/uk_iu/index.html

Urdu

[UD_Urdu-UDTB]

urdu

http://creativecommons.org/licenses/by-nc-sa/4.0/

https://universaldependencies.org/treebanks/ur_udtb/index.html

Uyghur

[UD_Uyghur-UDT]

uyghur

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/ug_udt/index.html

Vietnamese

[VLSP + UD_Vietnamese-VTB]

vietnamese

https://vlsp.org.vn/sites/default/files/2019-06/VLSP2013%20User%20Agreement_0.pdf http://creativecommons.org/licenses/by-sa/4.0/

https://vlsp.org.vn/resources-vlsp2013 https://universaldependencies.org/treebanks/vi_vtb/index.html

UD_Vietnamese-VTB

vietnamese-vtb

http://creativecommons.org/licenses/by-sa/4.0/

https://universaldependencies.org/treebanks/vi_vtb/index.html

Sentence segmentation

NOTE: Quick examples might be helpful for using this function.

The sample code for performing sentence segmentation on a raw text is:

from trankit import Pipeline
# initialize a pipeline for English
p = Pipeline('english')

# a non-empty string to process, which can be a document or a paragraph with multiple sentences
doc_text = '''Hello! This is Trankit.'''

sentences = p.ssplit(doc_text)

print(sentences)

The output of the sentence segmentation module is a native Python dictionary with a list of the split sentences. For each sentence, we can access its span which is handy for retrieving the sentnece’s location in the original document. The output would look like this:

{
  'text': 'Hello! This is Trankit.',
  'sentences': [
    {
      'id': 1,
      'text': 'Hello!',
      'dspan': (0, 6)
    },
    {
      'id': 2,
      'text': 'This is Trankit.',
      'dspan': (7, 23)
    }
  ]
}

Tokenization

NOTE: Quick examples might be helpful for using this function.

Tokenization module of Trankit can work with both sentence-level and document-level inputs.

Document-level tokenization

For document inputs, trankit first performs tokenization and sentence segmentation jointly to obtain a list of tokenized sentences. Below is how we can use this function:

from trankit import Pipeline
# initialize a pipeline for English
p = Pipeline('english')

# a non-empty string to process, which can be a document or a paragraph with multiple sentences
doc_text = '''Hello! This is Trankit.'''

tokenized_doc = p.tokenize(doc_text)

print(tokenized_doc)

The returned tokenized_doc should look like this:

{
  'text': 'Hello! This is Trankit.',
  'sentences': [
    {
      'id': 1,
      'text': 'Hello!', 'dspan': (0, 6),
      'tokens': [
        {
          'id': 1,
          'text': 'Hello',
          'dspan': (0, 5),
          'span': (0, 5)
        },
        {
          'id': 2,
          'text': '!',
          'dspan': (5, 6),
          'span': (5, 6)
        }
      ]
    },
    {
      'id': 2,
      'text': 'This is Trankit.', 'dspan': (7, 23)
      'tokens': [
        {
          'id': 1,
          'text': 'This',
          'dspan': (7, 11),
          'span': (0, 4)
        },
        {
          'id': 2,
          'text': 'is',
          'dspan': (12, 14),
          'span': (5, 7)
        },
        {
          'id': 3,
          'text': 'Trankit',
          'dspan': (15, 22),
          'span': (8, 15)
        },
        {
          'id': 4,
          'text': '.',
          'dspan': (22, 23),
          'span': (15, 16)
        }
      ]
    }
  ]
}

For each sentence, trankit provides the information of its location in the document via 'dspan' field. For each token, there are two types of span that we can access: (i) Document-level span (via 'dspan') and (ii) Sentence-level span (via 'span'). Check this to know how these fields work.

Sentence-level tokenization

In some cases, we might already have the sentences and only want to do tokenization for each sentence. This can be achieved by setting the tag is_sent=True when we call the function .tokenize():

from trankit import Pipeline
# initialize a pipeline for English
p = Pipeline('english')

# a non-empty string to process, which can be a document or a paragraph with multiple sentences
sent_text = '''This is Trankit.'''

tokens = p.tokenize(sent_text, is_sent=True)

print(tokens)

This will return a list of tokens. The output will look like this:

{
  'text': 'This is Trankit.',
  'tokens': [
    {
      'id': 1,
      'text': 'This',
      'span': (0, 4)
    },
    {
      'id': 2,
      'text': 'is',
      'span': (5, 7)
    },
    {
      'id': 3,
      'text': 'Trankit',
      'span': (8, 15)
    },
    {
      'id': 4,
      'text': '.',
      'span': (15, 16)
    }
  ]
}

As the input is assumed to be a sentence, we only have the sentence-level span for each token.

Multi-word token expansion

In addition to tokenization, some languages also require Multi-word token expansion. That means, each token can be expanded into multiple syntactic words. This process is helpful for these languages when performing later tasks such as part-of-speech, morphological tagging, dependency parsing, and lemmatization. Below is an example for such case in French:

from trankit import Pipeline

p = Pipeline('french')

doc_text = '''Je sens qu'entre ça et les films de médecins et scientifiques fous que nous
avons déjà vus, nous pourrions emprunter un autre chemin pour l'origine. On
pourra toujours parler à propos d'Averroès de "décentrement du Sujet".'''

out_doc = p.tokenize(doc_text)
print(out_doc['sentences'][1])

For illustration purpose, we only show part of the second sentence:

{
    'text': 'Je sens qu\'entre ça et les films de médecins et scientifiques fous que nous\navons déjà vus, nous pourrions emprunter un autre chemin pour l\'origine. On\npourra toujours parler à propos d\'Averroès de "décentrement du Sujet".',
    'sentences': [
      ...
      ,
      {
        'id': 2,
        'text': 'On\npourra toujours parler à propos d\'Averroès de "décentrement du Sujet".',
        'dspan': (149, 222),
        'tokens': [
          ... 
          ,
          {
            'id': 11, 
            'text': 'décentrement',
            'dspan': (199, 211),
            'span': (50, 62)
          },
          {
            'id': (12, 13), # token index
            'text': 'du', # text form
            'expanded': [ # list of syntactic words
              {
                'id': 12, # token index
                'text': 'de' # text form
              },
              {
                'id': 13, # token index 
                'text': 'le' # text form
              }
            ],
            'span': (63, 65),
            'dspan': (212, 214)
          },
          {
            'id': 14,
            'text': 'Sujet',
            'dspan': (215, 220),
            'span': (66, 71)
          },
          {
            'id': 15,
            'text': '"',
            'dspan': (220, 221),
            'span': (71, 72)
          },
          {
            'id': 16,
            'text': '.',
            'dspan': (221, 222),
            'span': (72, 73)
          }
        ]
      }
    ]

The expanded tokens always have the indexes that are tuple objects, instead of integers as usual. In this example, the expanded token is the token with the index (12, 13). The tuple indicates that this token is expanded into the syntactic words with the indexes ranging from 12 to 13. The syntactic words are organized into a list stored in the field 'expanded' of the original token. Note that, part-of-speech, morphological tagging, dependency parsing, and lemmatization always work with the syntactic words instead of the original token in such case. That’s why we will only see additional features added to the syntactic words while the original token remains unchanged with only the information of its text form and spans. As a last note, Named Entity Recognition (NER) module by contrast only works with the original tokens instead of the syntactic words, so we will not see the NER tags for the syntactic words.

Part-of-speech, Morphological tagging and Dependency parsing

NOTE: Quick examples might be helpful for using this function.

In trankit, part-of-speech, morphological tagging, and dependency parsing are jointly performed. The module can work with either untokenized or pretokenized inputs, at both sentence and document level.

Document-level processing

Untokenized input

The sample code for this module is:

from trankit import Pipeline
# initialize a pipeline for English
p = Pipeline('english')

# a non-empty string to process, which can be a document or a paragraph with multiple sentences
doc_text = '''Hello! This is Trankit.'''

all = p.posdep(doc_text)

Trankit first performs tokenization and sentence segmentation for the input document, then performs the part-of-speech, morphological tagging, and dependency parsing for the tokenized document. The output of the whole process is a native Python dictionary with list of sentences, each sentence contains a list of tokens with the predicted part-of-speech, the morphological feature, the index of the head token, and the corresponding dependency relation for each token. The output would look like this:

{
  'text': 'Hello! This is Trankit.',  # input string
  'sentences': [ # list of sentences
    {
      'id': 1, 'text': 'Hello!', 'dspan': (0, 6), 'tokens': [...]
    },
    {
      'id': 2,  # sentence index
      'text': 'This is Trankit.',  'dspan': (7, 23), # sentence span
      'tokens': [ # list of tokens
        {
          'id': 1, # token index
          'text': 'This',  # text form of the token
          'upos': 'PRON',  # UPOS tag of the token
          'xpos': 'DT',    # XPOS tag of the token
          'feats': 'Number=Sing|PronType=Dem', # morphological feature of the token
          'head': 3,  # index of the head token
          'deprel': 'nsubj', # dependency relation for the token
          'dspan': (7, 11), # document-level span of the token
          'span': (0, 4) # sentence-level span of the token
        },
        {'id': 2...},
        {'id': 3...},
        {'id': 4...}
      ]
    }
  ]
}

Pretokenized input

In some cases, we might already have a tokenized document and want to use this module. Here is how we can do it:

pretokenized_doc = [
  ['Hello', '!'],
  ['This', 'is', 'Trankit', '.']
]

tagged_doc = p.posdep(pretokenized_doc)

Pretokenized inputs are automatically recognized by Trankit. That’s why we don’t have to specify any additional tag when calling the function .posdep(). The output in this case will be the same as in the previous case except that now we don’t have any span information.

Sentence-level processing

Sometimes we want to use this module for sentence inputs. To achieve that, we can simply set is_sent=True when we call the function .posdep():

Untokenized input

sent_text = '''This is Trankit.'''

tagged_sent = p.posdep(sent_text, is_sent=True)

Pretokenized input

pretokenized_sent = ['This', 'is', 'Trankit', '.']

tagged_sent = p.posdep(pretokenized_sent, is_sent=True)

Lemmatization

NOTE: Quick examples might be helpful for using this function.

Trankit supports lemmatization for both untokenized and pretokenized inputs, at both sentence and document level. Here are some examples:

Document-level lemmatization

In this case, the input is assumed to be a document.

Untokenized input

from trankit import Pipeline

p = Pipeline('english')

doc_text = '''Hello! This is Trankit.'''

lemmatized_doc = p.lemmatize(doc_text)

Trankit would first perform tokenization and sentence segmentation for the input document. Next, it assigns a lemma to each token in the sentences. The output would look like this:

{
  'text': 'Hello! This is Trankit.',  # input string
  'sentences': [ # list of sentences
    {
      'id': 1, 'text': 'Hello!', 'dspan': (0, 6), 'tokens': [...]
    },
    {
      'id': 2,  # sentence index
      'text': 'This is Trankit.',  'dspan': (7, 23), # sentence span
      'tokens': [ # list of tokens
        {
          'id': 1, # token index
          'text': 'This',
          'lemma': 'this', # lemma of the token
          'dspan': (7, 11), # document-level span of the token
          'span': (0, 4) # sentence-level span of the token
        },
        {'id': 2...},
        {'id': 3...},
        {'id': 4...}
      ]
    }
  ]
}

For illustration purpose, we only show the first sentence.

Pretokenized input

Pretokenized inputs are automatically recognized by Trankit. The following snippet performs lemmatization on a pretokenized document, which is a list of lists of strings:

from trankit import Pipeline

p = Pipeline('english')

pretokenized_doc = [
  ['Hello', '!'],
  ['This', 'is', 'Trankit', '.']
]

lemmatized_doc = p.lemmatize(pretokenized_doc)

The output will look slightly different without the spans of the sentences and the tokens:

{
  'sentences': [
    {
      'id': 1,
      'tokens': [
        {
          'id': 1,
          'text': 'Hello',
          'lemma': 'hello'
        },
        {
          'id': 2,
          'text': '!',
          'lemma': '!'
        }
      ]
    },
    {
      'id': 2,
      'tokens': [
        {
          'id': 1,
          'text': 'This',
          'lemma': 'this'
        },
        {
          'id': 2,
          'text': 'is',
          'lemma': 'be'
        },
        {
          'id': 3,
          'text': 'Trankit',
          'lemma': 'trankit'
        },
        {
          'id': 4,
          'text': '.',
          'lemma': '.'
        }
      ]
    }
  ]
}

Sentence-level lemmatization

Lemmatization module also accepts inputs as sentences. This can be done by setting the tag is_sent=True. The output would be a dictionary with a list of lemmatized tokens.

Untokenized input

from trankit import Pipeline

p = Pipeline('english')

sent_text = '''This is Trankit.'''

lemmatized_sent = p.lemmatize(sent_text, is_sent=True)

Pretokenized input

from trankit import Pipeline

p = Pipeline('english')

sent_text = '''This is Trankit.'''

pretokenized_sent = ['This', 'is', 'Trankit', '.']

lemmatized_sent = p.lemmatize(pretokenized_sent, is_sent=True)

Named entity recognition

NOTE: Quick examples might be helpful for using this function.

Currently, Trankit provides the Named Entity Recognition (NER) module for 8 languages which are Arabic, Chinese, Dutch, English, French, German, Russian, and Spanish. The NER module accepts the inputs that can be untokenized or pretokenized, at both sentence and document level. Below are some examples:

Document-level

Untokenized input

from trankit import Pipeline
# initialize a pipeline for English
p = Pipeline('english')

# a non-empty string to process, which can be a document or a paragraph with multiple sentences
doc_text = '''Hello! This is Trankit.'''

tagged_doc = p.ner(doc_text)

The output would look like this:

{
  'text': 'Hello! This is Trankit.',  # input string
  'sentences': [ # list of sentences
    {
      'id': 1, 'text': 'Hello!', 'dspan': (0, 6), 'tokens': [...]
    },
    {
      'id': 2,  # sentence index
      'text': 'This is Trankit.',  'dspan': (7, 23), # sentence span
      'tokens': [ # list of tokens
        {
          'id': 1, # token index
          'text': 'This',
          'ner': 'O', # ner tag of the token
          'dspan': (7, 11), # document-level span of the token
          'span': (0, 4) # sentence-level span of the token
        },
        {'id': 2...},
        {'id': 3...},
        {'id': 4...}
      ]
    }
  ]
}

Pretokenized input

from trankit import Pipeline

p = Pipeline('english')

pretokenized_doc = [
  ['Hello', '!'],
  ['This', 'is', 'Trankit', '.']
]

tagged_doc = p.ner(pretokenized_doc)

The output will look the same as in the untokenized case, except that now we don’t have the text form for the input document as well as the span information for the sentences and the tokens.

Sentence-level

To enable the NER module to work with sentence-level instead of document-level inputs, we can set the tag is_sent=True:

Untokenized input

from trankit import Pipeline

p = Pipeline('english')

sent_text = 'This is Trankit.'

tagged_sent = p.ner(sent_text, is_sent=True)

Pretokenized input

from trankit import Pipeline

p = Pipeline('english')

pretokenized_sent = ['This', 'is', 'Trankit', '.']

tagged_sent = p.ner(pretokenized_sent, is_sent=True)

Building a customized pipeline

NOTE: Please check out the list of supported languages to check if the language of your interest is already supported by Trankit.

Building customized pipelines are easy with Trankit. The training is done via the class TPipeline and the loading is done via the class Pipeline as usual. To achieve this, customized pipelines are currently organized into 4 categories:

  • customized-mwt-ner: a pipeline of this category consists of 5 models: (i) joint token and sentence splitter, (ii) multi-word token expander, (iii) joint model for part-of-speech tagging, morphologicial feature tagging, and dependency parsing, (iv) lemmatizer, and (v) named entity recognizer.

  • customized-mwt: a pipeline of this category doesn’t have the (v) named entity recognizer.

  • customized-ner: a pipeline of this category doesn’t have the (ii) multi-word token expander.

  • customized: a pipeline of this cateogry doesn’t have the (ii) multi-word token expander and the (v) named entity recognizer.

The category names are used as the special identities in the Pipeline class. Thus, we need to choose one of the categories to start training our customized pipelines. Below we show the example for training and loading a customized-mwt-ner pipeline. Training procedure for customized pipelines of other categories can be obtained by obmitting the steps related to the models that those pipelines don’t have. For data format, Tpipeline accepts training data in CONLL-U format for the Universal Dependencies tasks, and training data in BIO format for the NER task.

Training

Training a joint token and sentence splitter

import trankit

# initialize a trainer for the task
trainer = trankit.TPipeline(
    training_config={
    'category': 'customized-mwt-ner', # pipeline category
    'task': 'tokenize', # task name
    'save_dir': './save_dir', # directory for saving trained model
    'train_txt_fpath': './train.txt', # raw text file
    'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format for training
    'dev_txt_fpath': './dev.txt', # raw text file
    'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development
    }
)

# start training
trainer.train()

Training a multi-word token expander

import trankit

# initialize a trainer for the task
trainer = trankit.TPipeline(
    training_config={
    'category': 'customized-mwt-ner', # pipeline category
    'task': 'mwt', # task name
    'save_dir': './save_dir', # directory for saving trained model
    'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format  for training
    'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development
    }
)

# start training
trainer.train()

Training a joint model for part-of-speech tagging, morphologicial feature tagging, and dependency parsing

import trankit

# initialize a trainer for the task
trainer = trankit.TPipeline(
    training_config={
    'category': 'customized-mwt-ner', # pipeline category
    'task': 'posdep', # task name
    'save_dir': './save_dir', # directory for saving trained model
    'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format  for training
    'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development
    }
)

# start training
trainer.train()

Training a lemmatizer

import trankit

# initialize a trainer for the task
trainer = trankit.TPipeline(
    training_config={
    'category': 'customized-mwt-ner',  # pipeline category
    'task': 'lemmatize', # task name
    'save_dir': './save_dir', # directory for saving trained model
    'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format  for training
    'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development
    }
)

# start training
trainer.train()

Training a named entity recognizer

Training data for a named entity recognizer should be in the BIO format in which the first column contains the words and the last column contains the BIO annotations. Users can refer to this file to get the sample data in the required format.

import trankit

# initialize a trainer for the task
trainer = trankit.TPipeline(
    training_config={
    'category': 'customized-mwt-ner',  # pipeline category
    'task': 'ner', # task name
    'save_dir': './save_dir', # directory to save the trained model
    'train_bio_fpath': './train.bio', # training data in BIO format
    'dev_bio_fpath': './dev.bio' # training data in BIO format
    }
)

# start training
trainer.train()

A colab tutorial on how to train, evaluate, and use a custom NER model is also available here. Thanks @mrshu for contributing this to Trankit.

Loading

After the training steps, we need to verify that all the models of our customized pipeline are trained. The following code shows how to do it:

import trankit

trankit.verify_customized_pipeline(
    category='customized-mwt-ner', # pipeline category
    save_dir='./save_dir', # directory used for saving models in previous steps
    embedding_name='xlm-roberta-base' # embedding version that we use for training our customized pipeline, by default, it is `xlm-roberta-base`
)

If the verification is success, this would printout the following:

Customized pipeline is ready to use!
It can be initialized as follows:

from trankit import Pipeline
p = Pipeline(lang='customized-mwt-ner', cache_dir='./save_dir')

From now on, the customized pipeline can be used as a normal pretrained pipeline.

The verification would fail if some of the expected model files of the pipeline are missing. This can be solved via the handy function download_missing_files, which is created for borrowing model files from pretrained pipelines provided by Trankit. Suppose that the language of your customized pipeline is English, the function can be used as below:

import trankit

trankit.download_missing_files(
	category='customized-ner', 
	save_dir='./save_dir', 
	embedding_name='xlm-roberta-base', 
	language='english'
)

where category is the category that we specified for the customized pipeline, save_dir is the path to the directory that we saved the customized models, embedding_name is the embedding that we used for the customized pipeline (which is xlm-roberta-base by default if we did not specify this in the training process), and language is the language with the pretrained models that we want to borrow. For example, if we only trained a NER model for the customized pipeline, the snippet above would borrow the trained models for all the other pipeline components and print out the following message:

Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tokenizer.mdl
Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tagger.mdl
Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner.vocabs.json
Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner_lemmatizer.pt
http://nlp.uoregon.edu/download/trankit/v1.0.0/xlm-roberta-base/english.zip
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47.9M/47.9M [00:00<00:00, 114MiB/s]
Copying ./save_dir/xlm-roberta-base/english/english.tokenizer.mdl to ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tokenizer.mdl
Copying ./save_dir/xlm-roberta-base/english/english.tagger.mdl to ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tagger.mdl
Copying ./save_dir/xlm-roberta-base/english/english.vocabs.json to ./save_dir/xlm-roberta-base/customized-ner/customized-ner.vocabs.json
Copying ./save_dir/xlm-roberta-base/english/english_lemmatizer.pt to ./save_dir/xlm-roberta-base/customized-ner/customized-ner_lemmatizer.pt

After this, we can go back to do the verification step again.

Command-line interface

Starting from version v1.0.0, Trankit supports processing text via command-line interface. This helps users who are not familiar with Python programming language can use Trankit more easily.

Requirements

Users need to install Trankit via one of the following methods:

Pip:

pip install trankit==1.0.1

From source:

git clone https://github.com/nlp-uoregon/trankit
cd trankit
pip install -e .

Syntax

python -m trankit [OPTIONS] --embedding xlm-roberta-base --cpu --lang english --input path/to/your/input --output_dir path/to/your/output_dir

What this command does are:

  • Forcing Trankit to run on CPU (--cpu). Without this --cpu, Trankit will run on GPU if a GPU device is available.

  • Initializing an English pipeline with XLM-Roberta base as the multilingual embedding (--embedding xlm-roberta-base).

  • Performing all tasks on the input stored at path/to/your/input which can be a single input file or a folder storing multiple input files (--input path/to/your/input).

  • Writing the output to path/to/your/output_dir which stores the output files, each is a json file with the prefix is the file name of the processed input file (--output_dir path/to/your/output_dir).

In this command, we can put more processing options at [OPTIONS]. Detailed description of the options that can be used:

  • --lang

    Language(s) of the pipeline to be initialized. Check out this page to see the available language names.

    Example use:

    -Monolingual case:

      python -m trankit [other options] --lang english
    

    -Multilingual case with 3 languages:

      python -m trankit [other options] --lang english,chinese,arabic
    

    -Multilingual case with all supported languages:

      python -m trankit [other options] --lang auto
    

    In multilingual mode, trankit will automatically detect the language of the input file(s) to use corresponding models.

    Note that, language detection is done at file level.

  • --cpu

    Forcing trankit to run on CPU. Default: False.Example use:

      python -m trankit [other options] --cpu
    
  • --embedding

    Multilingual embedding for trankit. Default: xlm-roberta-base.

    Example use:

    -XLM-Roberta base:

      python -m trankit [other options] --embedding xlm-roberta-base
    

    -XLM-Roberta large:

      python -m trankit [other options] --embedding xlm-roberta-large
    
  • --cache_dir

    Location to store downloaded model files. Default: “cache/trankit”.

    Example use:

      python -m trankit [other options] --cache_dir your/cache/dir
    
  • --input

    Location of the input.

    If it is a directory, trankit will process each file in the input directory at a time.

    If it is a file, trankit will process the file only.

    Example use:

    -Input is a directory:

      python -m trankit [other options] --input some_dir_path
    

    -Input is a file:

      python -m trankit [other options] --input some_file_path
    
  • --input_format

    Indicating the input format.

    Case 1: Each input file is a single raw DOCUMENT string:

      python -m trankit [other options] --input_format plaindoc
    

    Case 2: Each input file contains multiple raw SENTENCE strings in each line:

      python -m trankit [other options] --input_format plainsen
    

    Case 3: Each input file contains pretokenized SENTENCES separated by “\n\n”, each sentence is organized into multiple lines, each line contains only a single word:

      python -m trankit [other options] --input_format pretok
    

    Sample inputs can be found here:

    plaindoc

    plainsen

    pretok

  • --output_dir

    Location of the output directory to store the processed files. Processed files will be in json format, with the naming convention as follows:

      processed_file_name = input_file_name + .processed.json
    

    Example use:

      python -m trankit [other options] --output_dir some_dir_path
    
  • --task

    Task to be performed for the provided input.

    Use cases:

    -Sentence segmentation, assuming each input file is a single raw DOCUMENT string (--input_format plaindoc).

      python -m trankit [other options] --task ssplit
    

    -Sentence segmentation + Tokenization, assuming each input file is a single raw DOCUMENT string (--input_format plaindoc).

      python -m trankit [other options] --task dtokenize
    

    -Tokenization only, assuming each input file contains multiple raw SENTENCE strings in each line (--input_format plainsen).

      python -m trankit [other options] --task stokenize
    

    -Sentence segmentation, Tokenization, Part-of-speech tagging, Morphological tagging, Dependency parsing.

    Assuming each input file is a single raw DOCUMENT string (--input_format plaindoc).

     python -m trankit [other options] --task dposdep
    

    -Tokenization only, Part-of-speech tagging, Morphological tagging, Dependency parsing.

    Assuming each input file contains multiple raw SENTENCE strings in each line (--input_format plainsen).

     python -m trankit [other options] --task sposdep
    

    -Part-of-speech tagging, Morphological tagging, Dependency parsing.

    Assuming each input file contains pretokenized SENTENCES separated by “\n\n”, each sentence is organized into multiple lines, each line contains only a single word (--input_format pretok).

     python -m trankit [other options] --task pposdep
    

    -Sentence segmentation, Tokenization, Lemmatization

    Assuming each input file is a single raw DOCUMENT string (--input_format plaindoc).

     python -m trankit [other options] --task dlemmatize
    

    -Tokenization only, Lemmatization

    Assuming each input file contains multiple raw SENTENCE strings in each line (--input_format plainsen).

     python -m trankit [other options] --task slemmatize
    

    -Lemmatization

    Assuming each input file contains pretokenized SENTENCES separated by “\n\n”, each sentence is organized into multiple lines, each line contains only a single word (--input_format pretok).

     python -m trankit [other options] --task plemmatize
    

    -Sentence segmentation, Tokenization, Named Entity Recognition.

    Assuming each input file is a single raw DOCUMENT string (--input_format plaindoc).

     python -m trankit [other options] --task dner
    

    -Tokenization only, Named Entity Recognition.

    Assuming each input file contains multiple raw SENTENCE strings in each line (--input_format plainsen).

     python -m trankit [other options] --task sner
    

    -Named Entity Recognition.

    Assuming each input file contains pretokenized SENTENCES separated by “\n\n”, each sentence is organized into multiple lines, each line contains only a single word (--input_format pretok).

     python -m trankit [other options] --task pner
    

    -Sentence segmentation, Tokenization, Part-of-speech tagging, Morphological tagging, Dependency parsing, Named Entity Recognition.

    Assuming each input file is a single raw DOCUMENT string (--input_format plaindoc).

     python -m trankit [other options] --task dall
    

    -Tokenization only, Part-of-speech tagging, Morphological tagging, Dependency parsing, Named Entity Recognition.

    Assuming each input file contains multiple raw SENTENCE strings in each line (--input_format plainsen).

     python -m trankit [other options] --task sall
    

    -Part-of-speech tagging, Morphological tagging, Dependency parsing, Named Entity Recognition.

    Assuming each input file contains pretokenized SENTENCES separated by “\n\n”, each sentence is organized into multiple lines, each line contains only a single word (--input_format pretok).

     python -m trankit [other options] --task pall