Command-line interface

Starting from version v1.0.0, Trankit supports processing text via command-line interface. This helps users who are not familiar with Python programming language can use Trankit more easily.

Requirements

Users need to install Trankit via one of the following methods:

Pip:

pip install trankit==1.1.0

From source:

git clone https://github.com/nlp-uoregon/trankit
cd trankit
pip install -e .

Syntax

python -m trankit [OPTIONS] --embedding xlm-roberta-base --cpu --lang english --input path/to/your/input --output_dir path/to/your/output_dir

What this command does are:

  • Forcing Trankit to run on CPU (--cpu). Without this --cpu, Trankit will run on GPU if a GPU device is available.

  • Initializing an English pipeline with XLM-Roberta base as the multilingual embedding (--embedding xlm-roberta-base).

  • Performing all tasks on the input stored at path/to/your/input which can be a single input file or a folder storing multiple input files (--input path/to/your/input).

  • Writing the output to path/to/your/output_dir which stores the output files, each is a json file with the prefix is the file name of the processed input file (--output_dir path/to/your/output_dir).

In this command, we can put more processing options at [OPTIONS]. Detailed description of the options that can be used:

  • --lang

    Language(s) of the pipeline to be initialized. Check out this page to see the available language names.

    Example use:

    -Monolingual case:

      python -m trankit [other options] --lang english
    

    -Multilingual case with 3 languages:

      python -m trankit [other options] --lang english,chinese,arabic
    

    -Multilingual case with all supported languages:

      python -m trankit [other options] --lang auto
    

    In multilingual mode, trankit will automatically detect the language of the input file(s) to use corresponding models.

    Note that, language detection is done at file level.

  • --cpu

    Forcing trankit to run on CPU. Default: False.Example use:

      python -m trankit [other options] --cpu
    
  • --embedding

    Multilingual embedding for trankit. Default: xlm-roberta-base.

    Example use:

    -XLM-Roberta base:

      python -m trankit [other options] --embedding xlm-roberta-base
    

    -XLM-Roberta large:

      python -m trankit [other options] --embedding xlm-roberta-large
    
  • --cache_dir

    Location to store downloaded model files. Default: “cache/trankit”.

    Example use:

      python -m trankit [other options] --cache_dir your/cache/dir
    
  • --input

    Location of the input.

    If it is a directory, trankit will process each file in the input directory at a time.

    If it is a file, trankit will process the file only.

    Example use:

    -Input is a directory:

      python -m trankit [other options] --input some_dir_path
    

    -Input is a file:

      python -m trankit [other options] --input some_file_path
    
  • --input_format

    Indicating the input format.

    Case 1: Each input file is a single raw DOCUMENT string:

      python -m trankit [other options] --input_format plaindoc
    

    Case 2: Each input file contains multiple raw SENTENCE strings in each line:

      python -m trankit [other options] --input_format plainsen
    

    Case 3: Each input file contains pretokenized SENTENCES separated by “\n\n”, each sentence is organized into multiple lines, each line contains only a single word:

      python -m trankit [other options] --input_format pretok
    

    Sample inputs can be found here:

    plaindoc

    plainsen

    pretok

  • --output_dir

    Location of the output directory to store the processed files. Processed files will be in json format, with the naming convention as follows:

      processed_file_name = input_file_name + .processed.json
    

    Example use:

      python -m trankit [other options] --output_dir some_dir_path
    
  • --task

    Task to be performed for the provided input.

    Use cases:

    -Sentence segmentation, assuming each input file is a single raw DOCUMENT string (--input_format plaindoc).

      python -m trankit [other options] --task ssplit
    

    -Sentence segmentation + Tokenization, assuming each input file is a single raw DOCUMENT string (--input_format plaindoc).

      python -m trankit [other options] --task dtokenize
    

    -Tokenization only, assuming each input file contains multiple raw SENTENCE strings in each line (--input_format plainsen).

      python -m trankit [other options] --task stokenize
    

    -Sentence segmentation, Tokenization, Part-of-speech tagging, Morphological tagging, Dependency parsing.

    Assuming each input file is a single raw DOCUMENT string (--input_format plaindoc).

     python -m trankit [other options] --task dposdep
    

    -Tokenization only, Part-of-speech tagging, Morphological tagging, Dependency parsing.

    Assuming each input file contains multiple raw SENTENCE strings in each line (--input_format plainsen).

     python -m trankit [other options] --task sposdep
    

    -Part-of-speech tagging, Morphological tagging, Dependency parsing.

    Assuming each input file contains pretokenized SENTENCES separated by “\n\n”, each sentence is organized into multiple lines, each line contains only a single word (--input_format pretok).

     python -m trankit [other options] --task pposdep
    

    -Sentence segmentation, Tokenization, Lemmatization

    Assuming each input file is a single raw DOCUMENT string (--input_format plaindoc).

     python -m trankit [other options] --task dlemmatize
    

    -Tokenization only, Lemmatization

    Assuming each input file contains multiple raw SENTENCE strings in each line (--input_format plainsen).

     python -m trankit [other options] --task slemmatize
    

    -Lemmatization

    Assuming each input file contains pretokenized SENTENCES separated by “\n\n”, each sentence is organized into multiple lines, each line contains only a single word (--input_format pretok).

     python -m trankit [other options] --task plemmatize
    

    -Sentence segmentation, Tokenization, Named Entity Recognition.

    Assuming each input file is a single raw DOCUMENT string (--input_format plaindoc).

     python -m trankit [other options] --task dner
    

    -Tokenization only, Named Entity Recognition.

    Assuming each input file contains multiple raw SENTENCE strings in each line (--input_format plainsen).

     python -m trankit [other options] --task sner
    

    -Named Entity Recognition.

    Assuming each input file contains pretokenized SENTENCES separated by “\n\n”, each sentence is organized into multiple lines, each line contains only a single word (--input_format pretok).

     python -m trankit [other options] --task pner
    

    -Sentence segmentation, Tokenization, Part-of-speech tagging, Morphological tagging, Dependency parsing, Named Entity Recognition.

    Assuming each input file is a single raw DOCUMENT string (--input_format plaindoc).

     python -m trankit [other options] --task dall
    

    -Tokenization only, Part-of-speech tagging, Morphological tagging, Dependency parsing, Named Entity Recognition.

    Assuming each input file contains multiple raw SENTENCE strings in each line (--input_format plainsen).

     python -m trankit [other options] --task sall
    

    -Part-of-speech tagging, Morphological tagging, Dependency parsing, Named Entity Recognition.

    Assuming each input file contains pretokenized SENTENCES separated by “\n\n”, each sentence is organized into multiple lines, each line contains only a single word (--input_format pretok).

     python -m trankit [other options] --task pall