Building a customized pipeline

NOTE: Please check out the list of supported languages to check if the language of your interest is already supported by Trankit.

Building customized pipelines are easy with Trankit. The training is done via the class TPipeline and the loading is done via the class Pipeline as usual. To achieve this, customized pipelines are currently organized into 4 categories:

  • customized-mwt-ner: a pipeline of this category consists of 5 models: (i) joint token and sentence splitter, (ii) multi-word token expander, (iii) joint model for part-of-speech tagging, morphologicial feature tagging, and dependency parsing, (iv) lemmatizer, and (v) named entity recognizer.

  • customized-mwt: a pipeline of this category doesn’t have the (v) named entity recognizer.

  • customized-ner: a pipeline of this category doesn’t have the (ii) multi-word token expander.

  • customized: a pipeline of this cateogry doesn’t have the (ii) multi-word token expander and the (v) named entity recognizer.

The category names are used as the special identities in the Pipeline class. Thus, we need to choose one of the categories to start training our customized pipelines. Below we show the example for training and loading a customized-mwt-ner pipeline. Training procedure for customized pipelines of other categories can be obtained by obmitting the steps related to the models that those pipelines don’t have. For data format, Tpipeline accepts training data in CONLL-U format for the Universal Dependencies tasks, and training data in BIO format for the NER task.

Training

Training a joint token and sentence splitter

import trankit

# initialize a trainer for the task
trainer = trankit.TPipeline(
    training_config={
    'category': 'customized-mwt-ner', # pipeline category
    'task': 'tokenize', # task name
    'save_dir': './save_dir', # directory for saving trained model
    'train_txt_fpath': './train.txt', # raw text file
    'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format for training
    'dev_txt_fpath': './dev.txt', # raw text file
    'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development
    }
)

# start training
trainer.train()

Training a multi-word token expander

import trankit

# initialize a trainer for the task
trainer = trankit.TPipeline(
    training_config={
    'category': 'customized-mwt-ner', # pipeline category
    'task': 'mwt', # task name
    'save_dir': './save_dir', # directory for saving trained model
    'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format  for training
    'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development
    }
)

# start training
trainer.train()

Training a joint model for part-of-speech tagging, morphologicial feature tagging, and dependency parsing

import trankit

# initialize a trainer for the task
trainer = trankit.TPipeline(
    training_config={
    'category': 'customized-mwt-ner', # pipeline category
    'task': 'posdep', # task name
    'save_dir': './save_dir', # directory for saving trained model
    'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format  for training
    'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development
    }
)

# start training
trainer.train()

Training a lemmatizer

import trankit

# initialize a trainer for the task
trainer = trankit.TPipeline(
    training_config={
    'category': 'customized-mwt-ner',  # pipeline category
    'task': 'lemmatize', # task name
    'save_dir': './save_dir', # directory for saving trained model
    'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format  for training
    'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development
    }
)

# start training
trainer.train()

Training a named entity recognizer

Training data for a named entity recognizer should be in the BIO format in which the first column contains the words and the last column contains the BIO annotations. Users can refer to this file to get the sample data in the required format.

import trankit

# initialize a trainer for the task
trainer = trankit.TPipeline(
    training_config={
    'category': 'customized-mwt-ner',  # pipeline category
    'task': 'ner', # task name
    'save_dir': './save_dir', # directory to save the trained model
    'train_bio_fpath': './train.bio', # training data in BIO format
    'dev_bio_fpath': './dev.bio' # training data in BIO format
    }
)

# start training
trainer.train()

A colab tutorial on how to train, evaluate, and use a custom NER model is also available here. Thanks @mrshu for contributing this to Trankit.

Loading

After the training steps, we need to verify that all the models of our customized pipeline are trained. The following code shows how to do it:

import trankit

trankit.verify_customized_pipeline(
    category='customized-mwt-ner', # pipeline category
    save_dir='./save_dir', # directory used for saving models in previous steps
    embedding_name='xlm-roberta-base' # embedding version that we use for training our customized pipeline, by default, it is `xlm-roberta-base`
)

If the verification is success, this would printout the following:

Customized pipeline is ready to use!
It can be initialized as follows:

from trankit import Pipeline
p = Pipeline(lang='customized-mwt-ner', cache_dir='./save_dir')

From now on, the customized pipeline can be used as a normal pretrained pipeline.

The verification would fail if some of the expected model files of the pipeline are missing. This can be solved via the handy function download_missing_files, which is created for borrowing model files from pretrained pipelines provided by Trankit. Suppose that the language of your customized pipeline is English, the function can be used as below:

import trankit

trankit.download_missing_files(
	category='customized-ner', 
	save_dir='./save_dir', 
	embedding_name='xlm-roberta-base', 
	language='english'
)

where category is the category that we specified for the customized pipeline, save_dir is the path to the directory that we saved the customized models, embedding_name is the embedding that we used for the customized pipeline (which is xlm-roberta-base by default if we did not specify this in the training process), and language is the language with the pretrained models that we want to borrow. For example, if we only trained a NER model for the customized pipeline, the snippet above would borrow the trained models for all the other pipeline components and print out the following message:

Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tokenizer.mdl
Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tagger.mdl
Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner.vocabs.json
Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner_lemmatizer.pt
http://nlp.uoregon.edu/download/trankit/v1.0.0/xlm-roberta-base/english.zip
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47.9M/47.9M [00:00<00:00, 114MiB/s]
Copying ./save_dir/xlm-roberta-base/english/english.tokenizer.mdl to ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tokenizer.mdl
Copying ./save_dir/xlm-roberta-base/english/english.tagger.mdl to ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tagger.mdl
Copying ./save_dir/xlm-roberta-base/english/english.vocabs.json to ./save_dir/xlm-roberta-base/customized-ner/customized-ner.vocabs.json
Copying ./save_dir/xlm-roberta-base/english/english_lemmatizer.pt to ./save_dir/xlm-roberta-base/customized-ner/customized-ner_lemmatizer.pt

After this, we can go back to do the verification step again.