Part-of-speech, Morphological tagging and Dependency parsing

NOTE: Quick examples might be helpful for using this function.

In trankit, part-of-speech, morphological tagging, and dependency parsing are jointly performed. The module can work with either untokenized or pretokenized inputs, at both sentence and document level.

Document-level processing

Untokenized input

The sample code for this module is:

from trankit import Pipeline
# initialize a pipeline for English
p = Pipeline('english')

# a non-empty string to process, which can be a document or a paragraph with multiple sentences
doc_text = '''Hello! This is Trankit.'''

all = p.posdep(doc_text)

Trankit first performs tokenization and sentence segmentation for the input document, then performs the part-of-speech, morphological tagging, and dependency parsing for the tokenized document. The output of the whole process is a native Python dictionary with list of sentences, each sentence contains a list of tokens with the predicted part-of-speech, the morphological feature, the index of the head token, and the corresponding dependency relation for each token. The output would look like this:

{
  'text': 'Hello! This is Trankit.',  # input string
  'sentences': [ # list of sentences
    {
      'id': 1, 'text': 'Hello!', 'dspan': (0, 6), 'tokens': [...]
    },
    {
      'id': 2,  # sentence index
      'text': 'This is Trankit.',  'dspan': (7, 23), # sentence span
      'tokens': [ # list of tokens
        {
          'id': 1, # token index
          'text': 'This',  # text form of the token
          'upos': 'PRON',  # UPOS tag of the token
          'xpos': 'DT',    # XPOS tag of the token
          'feats': 'Number=Sing|PronType=Dem', # morphological feature of the token
          'head': 3,  # index of the head token
          'deprel': 'nsubj', # dependency relation for the token
          'dspan': (7, 11), # document-level span of the token
          'span': (0, 4) # sentence-level span of the token
        },
        {'id': 2...},
        {'id': 3...},
        {'id': 4...}
      ]
    }
  ]
}

Pretokenized input

In some cases, we might already have a tokenized document and want to use this module. Here is how we can do it:

pretokenized_doc = [
  ['Hello', '!'],
  ['This', 'is', 'Trankit', '.']
]

tagged_doc = p.posdep(pretokenized_doc)

Pretokenized inputs are automatically recognized by Trankit. That’s why we don’t have to specify any additional tag when calling the function .posdep(). The output in this case will be the same as in the previous case except that now we don’t have any span information.

Sentence-level processing

Sometimes we want to use this module for sentence inputs. To achieve that, we can simply set is_sent=True when we call the function .posdep():

Untokenized input

sent_text = '''This is Trankit.'''

tagged_sent = p.posdep(sent_text, is_sent=True)

Pretokenized input

pretokenized_sent = ['This', 'is', 'Trankit', '.']

tagged_sent = p.posdep(pretokenized_sent, is_sent=True)