# Building a customized pipeline *NOTE*: Please check out the list of [supported languages](https://trankit.readthedocs.io/en/latest/pkgnames.html#trainable-languages) to check if the language of your interest is already supported by Trankit. Building customized pipelines are easy with Trankit. The training is done via the class `TPipeline` and the loading is done via the class `Pipeline` as usual. To achieve this, customized pipelines are currently organized into 4 categories: - `customized-mwt-ner`: a pipeline of this category consists of 5 models: (i) joint token and sentence splitter, (ii) multi-word token expander, (iii) joint model for part-of-speech tagging, morphologicial feature tagging, and dependency parsing, (iv) lemmatizer, and (v) named entity recognizer. - `customized-mwt`: a pipeline of this category doesn't have the (v) named entity recognizer. - `customized-ner`: a pipeline of this category doesn't have the (ii) multi-word token expander. - `customized`: a pipeline of this cateogry doesn't have the (ii) multi-word token expander and the (v) named entity recognizer. The category names are used as the special identities in the `Pipeline` class. Thus, we need to choose one of the categories to start training our customized pipelines. Below we show the example for training and loading a `customized-mwt-ner` pipeline. Training procedure for customized pipelines of other categories can be obtained by obmitting the steps related to the models that those pipelines don't have. For data format, `Tpipeline` accepts training data in [CONLL-U format](https://github.com/UniversalDependencies/UD_English-EWT) for the Universal Dependencies tasks, and training data in [BIO format](https://www.clips.uantwerpen.be/conll2003/ner/) for the NER task. ## Training ### Training a joint token and sentence splitter ```python import trankit # initialize a trainer for the task trainer = trankit.TPipeline( training_config={ 'category': 'customized-mwt-ner', # pipeline category 'task': 'tokenize', # task name 'save_dir': './save_dir', # directory for saving trained model 'train_txt_fpath': './train.txt', # raw text file 'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format for training 'dev_txt_fpath': './dev.txt', # raw text file 'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development } ) # start training trainer.train() ``` ### Training a multi-word token expander ```python import trankit # initialize a trainer for the task trainer = trankit.TPipeline( training_config={ 'category': 'customized-mwt-ner', # pipeline category 'task': 'mwt', # task name 'save_dir': './save_dir', # directory for saving trained model 'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format for training 'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development } ) # start training trainer.train() ``` ### Training a joint model for part-of-speech tagging, morphologicial feature tagging, and dependency parsing ```python import trankit # initialize a trainer for the task trainer = trankit.TPipeline( training_config={ 'category': 'customized-mwt-ner', # pipeline category 'task': 'posdep', # task name 'save_dir': './save_dir', # directory for saving trained model 'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format for training 'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development } ) # start training trainer.train() ``` ### Training a lemmatizer ```python import trankit # initialize a trainer for the task trainer = trankit.TPipeline( training_config={ 'category': 'customized-mwt-ner', # pipeline category 'task': 'lemmatize', # task name 'save_dir': './save_dir', # directory for saving trained model 'train_conllu_fpath': './train.conllu', # annotations file in CONLLU format for training 'dev_conllu_fpath': './dev.conllu' # annotations file in CONLLU format for development } ) # start training trainer.train() ``` ### Training a named entity recognizer Training data for a named entity recognizer should be in the BIO format in which the first column contains the words and the last column contains the BIO annotations. Users can refer to [this file](https://github.com/nlp-uoregon/trankit/tree/master/docs/source/sample-data.bio) to get the sample data in the required format. ```python import trankit # initialize a trainer for the task trainer = trankit.TPipeline( training_config={ 'category': 'customized-mwt-ner', # pipeline category 'task': 'ner', # task name 'save_dir': './save_dir', # directory to save the trained model 'train_bio_fpath': './train.bio', # training data in BIO format 'dev_bio_fpath': './dev.bio' # training data in BIO format } ) # start training trainer.train() ````` A colab tutorial on how to train, evaluate, and use a custom NER model is also available [here](https://github.com/nlp-uoregon/trankit/blob/master/examples/colab/trankit_ner_GermEval14.ipynb). Thanks [@mrshu](https://github.com/mrshu) for contributing this to Trankit. ## Loading After the training steps, we need to verify that all the models of our customized pipeline are trained. The following code shows how to do it: ```python import trankit trankit.verify_customized_pipeline( category='customized-mwt-ner', # pipeline category save_dir='./save_dir', # directory used for saving models in previous steps embedding_name='xlm-roberta-base' # embedding version that we use for training our customized pipeline, by default, it is `xlm-roberta-base` ) ``` If the verification is success, this would printout the following: ```python Customized pipeline is ready to use! It can be initialized as follows: from trankit import Pipeline p = Pipeline(lang='customized-mwt-ner', cache_dir='./save_dir') ``` From now on, the customized pipeline can be used as a normal pretrained pipeline. The verification would fail if some of the expected model files of the pipeline are missing. This can be solved via the handy function `download_missing_files`, which is created for borrowing model files from pretrained pipelines provided by Trankit. Suppose that the language of your customized pipeline is English, the function can be used as below: ``` import trankit trankit.download_missing_files( category='customized-ner', save_dir='./save_dir', embedding_name='xlm-roberta-base', language='english' ) ``` where `category` is the category that we specified for the customized pipeline, `save_dir` is the path to the directory that we saved the customized models, `embedding_name` is the embedding that we used for the customized pipeline (which is `xlm-roberta-base` by default if we did not specify this in the training process), and `language` is the language with the pretrained models that we want to borrow. For example, if we only trained a NER model for the customized pipeline, the snippet above would borrow the trained models for all the other pipeline components and print out the following message: ``` # Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tokenizer.mdl # Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tagger.mdl # Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner.vocabs.json # Missing ./save_dir/xlm-roberta-base/customized-ner/customized-ner_lemmatizer.pt # http://nlp.uoregon.edu/download/trankit/v1.0.0/xlm-roberta-base/english.zip # Downloading: 100%|█| 47.9M/47.9M [00:00<00:00, 89.2MiB/s] # Copying ./save_dir/xlm-roberta-base/english/english.tokenizer.mdl to ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tokenizer.mdl # Copying ./save_dir/xlm-roberta-base/english/english.tagger.mdl to ./save_dir/xlm-roberta-base/customized-ner/customized-ner.tagger.mdl # Copying ./save_dir/xlm-roberta-base/english/english.vocabs.json to ./save_dir/xlm-roberta-base/customized-ner/customized-ner.vocabs.json # Copying ./save_dir/xlm-roberta-base/english/english_lemmatizer.pt to ./save_dir/xlm-roberta-base/customized-ner/customized-ner_lemmatizer.pt ``` After this, we can go back to do the verification step again.