Lemmatization

NOTE: Quick examples might be helpful for using this function.

Trankit supports lemmatization for both untokenized and pretokenized inputs, at both sentence and document level. Here are some examples:

Document-level lemmatization

In this case, the input is assumed to be a document.

Untokenized input

from trankit import Pipeline

p = Pipeline('english')

doc_text = '''Hello! This is Trankit.'''

lemmatized_doc = p.lemmatize(doc_text)

Trankit would first perform tokenization and sentence segmentation for the input document. Next, it assigns a lemma to each token in the sentences. The output would look like this:

{
  'text': 'Hello! This is Trankit.',  # input string
  'sentences': [ # list of sentences
    {
      'id': 1, 'text': 'Hello!', 'dspan': (0, 6), 'tokens': [...]
    },
    {
      'id': 2,  # sentence index
      'text': 'This is Trankit.',  'dspan': (7, 23), # sentence span
      'tokens': [ # list of tokens
        {
          'id': 1, # token index
          'text': 'This',
          'lemma': 'this', # lemma of the token
          'dspan': (7, 11), # document-level span of the token
          'span': (0, 4) # sentence-level span of the token
        },
        {'id': 2...},
        {'id': 3...},
        {'id': 4...}
      ]
    }
  ]
}

For illustration purpose, we only show the first sentence.

Pretokenized input

Pretokenized inputs are automatically recognized by Trankit. The following snippet performs lemmatization on a pretokenized document, which is a list of lists of strings:

from trankit import Pipeline

p = Pipeline('english')

pretokenized_doc = [
  ['Hello', '!'],
  ['This', 'is', 'Trankit', '.']
]

lemmatized_doc = p.lemmatize(pretokenized_doc)

The output will look slightly different without the spans of the sentences and the tokens:

{
  'sentences': [
    {
      'id': 1,
      'tokens': [
        {
          'id': 1,
          'text': 'Hello',
          'lemma': 'hello'
        },
        {
          'id': 2,
          'text': '!',
          'lemma': '!'
        }
      ]
    },
    {
      'id': 2,
      'tokens': [
        {
          'id': 1,
          'text': 'This',
          'lemma': 'this'
        },
        {
          'id': 2,
          'text': 'is',
          'lemma': 'be'
        },
        {
          'id': 3,
          'text': 'Trankit',
          'lemma': 'trankit'
        },
        {
          'id': 4,
          'text': '.',
          'lemma': '.'
        }
      ]
    }
  ]
}

Sentence-level lemmatization

Lemmatization module also accepts inputs as sentences. This can be done by setting the tag is_sent=True. The output would be a dictionary with a list of lemmatized tokens.

Untokenized input

from trankit import Pipeline

p = Pipeline('english')

sent_text = '''This is Trankit.'''

lemmatized_sent = p.lemmatize(sent_text, is_sent=True)

Pretokenized input

from trankit import Pipeline

p = Pipeline('english')

sent_text = '''This is Trankit.'''

pretokenized_sent = ['This', 'is', 'Trankit', '.']

lemmatized_sent = p.lemmatize(pretokenized_sent, is_sent=True)