Lemmatization
NOTE: Quick examples might be helpful for using this function.
Trankit supports lemmatization for both untokenized and pretokenized inputs, at both sentence and document level. Here are some examples:
Document-level lemmatization
In this case, the input is assumed to be a document.
Untokenized input
from trankit import Pipeline
p = Pipeline('english')
doc_text = '''Hello! This is Trankit.'''
lemmatized_doc = p.lemmatize(doc_text)
Trankit would first perform tokenization and sentence segmentation for the input document. Next, it assigns a lemma to each token in the sentences. The output would look like this:
{
'text': 'Hello! This is Trankit.', # input string
'sentences': [ # list of sentences
{
'id': 1, 'text': 'Hello!', 'dspan': (0, 6), 'tokens': [...]
},
{
'id': 2, # sentence index
'text': 'This is Trankit.', 'dspan': (7, 23), # sentence span
'tokens': [ # list of tokens
{
'id': 1, # token index
'text': 'This',
'lemma': 'this', # lemma of the token
'dspan': (7, 11), # document-level span of the token
'span': (0, 4) # sentence-level span of the token
},
{'id': 2...},
{'id': 3...},
{'id': 4...}
]
}
]
}
For illustration purpose, we only show the first sentence.
Pretokenized input
Pretokenized inputs are automatically recognized by Trankit. The following snippet performs lemmatization on a pretokenized document, which is a list of lists of strings:
from trankit import Pipeline
p = Pipeline('english')
pretokenized_doc = [
['Hello', '!'],
['This', 'is', 'Trankit', '.']
]
lemmatized_doc = p.lemmatize(pretokenized_doc)
The output will look slightly different without the spans of the sentences and the tokens:
{
'sentences': [
{
'id': 1,
'tokens': [
{
'id': 1,
'text': 'Hello',
'lemma': 'hello'
},
{
'id': 2,
'text': '!',
'lemma': '!'
}
]
},
{
'id': 2,
'tokens': [
{
'id': 1,
'text': 'This',
'lemma': 'this'
},
{
'id': 2,
'text': 'is',
'lemma': 'be'
},
{
'id': 3,
'text': 'Trankit',
'lemma': 'trankit'
},
{
'id': 4,
'text': '.',
'lemma': '.'
}
]
}
]
}
Sentence-level lemmatization
Lemmatization module also accepts inputs as sentences. This can be done by setting the tag is_sent=True
. The output would be a dictionary with a list of lemmatized tokens.
Untokenized input
from trankit import Pipeline
p = Pipeline('english')
sent_text = '''This is Trankit.'''
lemmatized_sent = p.lemmatize(sent_text, is_sent=True)
Pretokenized input
from trankit import Pipeline
p = Pipeline('english')
sent_text = '''This is Trankit.'''
pretokenized_sent = ['This', 'is', 'Trankit', '.']
lemmatized_sent = p.lemmatize(pretokenized_sent, is_sent=True)