Building a customized pipeline
NOTE: Please check out the list of supported languages to check if the language of your interest is already supported by Trankit.
Building customized pipelines are easy with Trankit. The training is done via the class TPipeline
and the loading is done via the class Pipeline
as usual. To achieve this, customized pipelines are currently organized into 4 categories:
customized-mwt-ner
: a pipeline of this category consists of 5 models: (i) joint token and sentence splitter, (ii) multi-word token expander, (iii) joint model for part-of-speech tagging, morphologicial feature tagging, and dependency parsing, (iv) lemmatizer, and (v) named entity recognizer.customized-mwt
: a pipeline of this category doesn’t have the (v) named entity recognizer.customized-ner
: a pipeline of this category doesn’t have the (ii) multi-word token expander.customized
: a pipeline of this cateogry doesn’t have the (ii) multi-word token expander and the (v) named entity recognizer.
The category names are used as the special identities in the Pipeline
class. Thus, we need to choose one of the categories to start training our customized pipelines. Below we show the example for training and loading a customized-mwt-ner
pipeline. Training procedure for customized pipelines of other categories can be obtained by obmitting the steps related to the models that those pipelines don’t have. For data format, Tpipeline
accepts training data in CONLL-U format for the Universal Dependencies tasks, and training data in BIO format for the NER task.
Training
Training a joint token and sentence splitter
import trankit
# initialize a trainer for the task
trainer = trankit.TPipeline(
training_config={
'category': 'customized-mwt-ner', # pipeline category
'task': 'tokenize', # task name
'save_dir': './save_dir', # directory for saving trained model
'train_txt_fpath': './train.txt', # raw text file
'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format for training
'dev_txt_fpath': './dev.txt', # raw text file
'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development
}
)
# start training
trainer.train()
Training a multi-word token expander
import trankit
# initialize a trainer for the task
trainer = trankit.TPipeline(
training_config={
'category': 'customized-mwt-ner', # pipeline category
'task': 'mwt', # task name
'save_dir': './save_dir', # directory for saving trained model
'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format for training
'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development
}
)
# start training
trainer.train()
Training a joint model for part-of-speech tagging, morphologicial feature tagging, and dependency parsing
import trankit
# initialize a trainer for the task
trainer = trankit.TPipeline(
training_config={
'category': 'customized-mwt-ner', # pipeline category
'task': 'posdep', # task name
'save_dir': './save_dir', # directory for saving trained model
'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format for training
'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development
}
)
# start training
trainer.train()
Training a lemmatizer
import trankit
# initialize a trainer for the task
trainer = trankit.TPipeline(
training_config={
'category': 'customized-mwt-ner', # pipeline category
'task': 'lemmatize', # task name
'save_dir': './save_dir', # directory for saving trained model
'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format for training
'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development
}
)
# start training
trainer.train()
Training a named entity recognizer
Training data for a named entity recognizer should be in the BIO format in which the first column contains the words and the last column contains the BIO annotations. Users can refer to this file to get the sample data in the required format.
import trankit
# initialize a trainer for the task
trainer = trankit.TPipeline(
training_config={
'category': 'customized-mwt-ner', # pipeline category
'task': 'ner', # task name
'save_dir': './save_dir', # directory to save the trained model
'train_bio_fpath': './train.bio', # training data in BIO format
'dev_bio_fpath': './dev.bio' # training data in BIO format
}
)
# start training
trainer.train()
A colab tutorial on how to train, evaluate, and use a custom NER model is also available here. Thanks @mrshu for contributing this to Trankit.
Loading
After the training steps, we need to verify that all the models of our customized pipeline are trained. The following code shows how to do it:
import trankit
trankit.verify_customized_pipeline(
category='customized-mwt-ner', # pipeline category
save_dir='./save_dir', # directory used for saving models in previous steps
embedding_name='xlm-roberta-base' # embedding version that we use for training our customized pipeline, by default, it is `xlm-roberta-base`
)
If the verification is success, this would printout the following:
Customized pipeline is ready to use!
It can be initialized as follows:
from trankit import Pipeline
p = Pipeline(lang='customized-mwt-ner', cache_dir='./save_dir')
From now on, the customized pipeline can be used as a normal pretrained pipeline.
The verification would fail if some of the expected model files of the pipeline are missing. This can be solved via the handy function download_missing_files
, which is created for borrowing model files from pretrained pipelines provided by Trankit. Suppose that the language of your customized pipeline is English, the function can be used as below:
import trankit
trankit.download_missing_files(
category='customized-ner',
save_dir='./save_dir',
embedding_name='xlm-roberta-base',
language='english'
)
where category
is the category that we specified for the customized pipeline, save_dir
is the path to the directory that we saved the customized models, embedding_name
is the embedding that we used for the customized pipeline (which is xlm-roberta-base
by default if we did not specify this in the training process), and language
is the language with the pretrained models that we want to borrow. For example, if we only trained a NER model for the customized pipeline, the snippet above would borrow the trained models for all the other pipeline components and print out the following message:
# Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tokenizer.mdl
# Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tagger.mdl
# Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner.vocabs.json
# Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner_lemmatizer.pt
# http://nlp.uoregon.edu/download/trankit/v1.0.0/xlm-roberta-base/english.zip
# Downloading: 100%|█| 47.9M/47.9M [00:00<00:00, 89.2MiB/s]
# Copying ./save_dir/xlm-roberta-base/english/english.tokenizer.mdl to ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tokenizer.mdl
# Copying ./save_dir/xlm-roberta-base/english/english.tagger.mdl to ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tagger.mdl
# Copying ./save_dir/xlm-roberta-base/english/english.vocabs.json to ./save_dir/xlm-roberta-base/customized-ner/customized-ner.vocabs.json
# Copying ./save_dir/xlm-roberta-base/english/english_lemmatizer.pt to ./save_dir/xlm-roberta-base/customized-ner/customized-ner_lemmatizer.pt
After this, we can go back to do the verification step again.