# Command-line interface

Starting from version v1.0.0, Trankit supports processing text via command-line interface. This helps users who are not familiar with Python programming language can use Trankit more easily.

## Requirements
Users need to install Trankit via one of the following methods:

Pip:
```
pip install trankit==1.1.0
```

From source:
```
git clone https://github.com/nlp-uoregon/trankit
cd trankit
pip install -e .
```

## Syntax
```
python -m trankit [OPTIONS] --embedding xlm-roberta-base --cpu --lang english --input path/to/your/input --output_dir path/to/your/output_dir
```
What this command does are:
- Forcing Trankit to run on CPU (`--cpu`). Without this `--cpu`, Trankit will run on GPU if a GPU device is available.
- Initializing an English pipeline with XLM-Roberta base as the multilingual embedding (`--embedding xlm-roberta-base`).
- Performing all tasks on the input stored at `path/to/your/input` which can be a single input file or a folder storing multiple input files (`--input path/to/your/input`).
- Writing the output to `path/to/your/output_dir` which stores the output files, each is a json file with the prefix is the file name of the processed input file (`--output_dir path/to/your/output_dir`).

In this command, we can put more processing options at `[OPTIONS]`. Detailed description of the options that can be used:

* `--lang`
    
    Language(s) of the pipeline to be initialized. Check out this [page](https://trankit.readthedocs.io/en/latest/pkgnames.html#pretrained-languages-their-code-names) to see the available language names.
    
    Example use:
    
    -Monolingual case:
    
        python -m trankit [other options] --lang english
    
    -Multilingual case with 3 languages:
    
        python -m trankit [other options] --lang english,chinese,arabic
    
    -Multilingual case with all supported languages:
    
        python -m trankit [other options] --lang auto
    
    In multilingual mode, trankit will automatically detect the language of the input file(s) to use corresponding models.
    
    Note that, language detection is done at file level.
 
* `--cpu`
    
    Forcing trankit to run on CPU. Default: False.Example use:
    
        python -m trankit [other options] --cpu

* `--embedding`
    
    Multilingual embedding for trankit. Default: xlm-roberta-base.
    
    Example use:
    
    -XLM-Roberta base:
        
        python -m trankit [other options] --embedding xlm-roberta-base
    
    -XLM-Roberta large:
        
        python -m trankit [other options] --embedding xlm-roberta-large
    
* `--cache_dir`
    
    Location to store downloaded model files. Default: "cache/trankit".
    
    Example use:
    
        python -m trankit [other options] --cache_dir your/cache/dir

* `--input`
    
    Location of the input.
    
    If it is a directory, trankit will process each file in the input directory at a time.
    
    If it is a file, trankit will process the file only.
    
    Example use:
    
    -Input is a directory:
    
        python -m trankit [other options] --input some_dir_path
    
    -Input is a file:
        
        python -m trankit [other options] --input some_file_path
    
* `--input_format`
    
    Indicating the input format.
    
    Case 1: Each input file is a single raw DOCUMENT string:
        
        python -m trankit [other options] --input_format plaindoc
    
    Case 2: Each input file contains multiple raw SENTENCE strings in each line:
    
        python -m trankit [other options] --input_format plainsen
        
    Case 3: Each input file contains pretokenized SENTENCES separated by "\n\n", each sentence is organized into multiple lines, each line contains only a single word:
    
        python -m trankit [other options] --input_format pretok
    
    
    Sample inputs can be found here:
    
    [plaindoc](https://github.com/nlp-uoregon/trankit/blob/master/examples/commandline/plaindoc.txt)
    
    [plainsen](https://github.com/nlp-uoregon/trankit/blob/master/examples/commandline/plainsen.txt)
    
    [pretok](https://github.com/nlp-uoregon/trankit/blob/master/examples/commandline/pretok.txt)

* `--output_dir`
    
    Location of the output directory to store the processed files. Processed files will be in json format, with the naming convention as follows:
    
        processed_file_name = input_file_name + .processed.json
    
    Example use:
    
        python -m trankit [other options] --output_dir some_dir_path

* `--task`
    
    Task to be performed for the provided input.
    
    Use cases:
    
    -Sentence segmentation, assuming each input file is a single raw DOCUMENT string (`--input_format plaindoc`).
    
        python -m trankit [other options] --task ssplit
    
    -Sentence segmentation + Tokenization, assuming each input file is a single raw DOCUMENT string (`--input_format plaindoc`).
    
        python -m trankit [other options] --task dtokenize
    
    -Tokenization only, assuming each input file contains multiple raw SENTENCE strings in each line (`--input_format plainsen`).
    
        python -m trankit [other options] --task stokenize
    
    -Sentence segmentation, Tokenization, Part-of-speech tagging, Morphological tagging, Dependency parsing.
     
     Assuming each input file is a single raw DOCUMENT string (`--input_format plaindoc`).
      
       python -m trankit [other options] --task dposdep
    
    -Tokenization only, Part-of-speech tagging, Morphological tagging, Dependency parsing.
     
     Assuming each input file contains multiple raw SENTENCE strings in each line (`--input_format plainsen`).
     
       python -m trankit [other options] --task sposdep
    
    -Part-of-speech tagging, Morphological tagging, Dependency parsing.
     
     Assuming each input file contains pretokenized SENTENCES separated by "\n\n", each sentence is organized into multiple lines, each line contains only a single word (`--input_format pretok`).
     
       python -m trankit [other options] --task pposdep
    
    -Sentence segmentation, Tokenization, Lemmatization
     
     Assuming each input file is a single raw DOCUMENT string (`--input_format plaindoc`).
     
       python -m trankit [other options] --task dlemmatize
    
    -Tokenization only, Lemmatization
     
     Assuming each input file contains multiple raw SENTENCE strings in each line (`--input_format plainsen`).
       
       python -m trankit [other options] --task slemmatize
    
    -Lemmatization
     
     Assuming each input file contains pretokenized SENTENCES separated by "\n\n", each sentence is organized into multiple lines, each line contains only a single word (`--input_format pretok`).
       
       python -m trankit [other options] --task plemmatize
    
    -Sentence segmentation, Tokenization, Named Entity Recognition.
     
     Assuming each input file is a single raw DOCUMENT string (`--input_format plaindoc`).
      
       python -m trankit [other options] --task dner
    
    -Tokenization only, Named Entity Recognition.
     
     Assuming each input file contains multiple raw SENTENCE strings in each line (`--input_format plainsen`).
       
       python -m trankit [other options] --task sner
    
    -Named Entity Recognition.
     
     Assuming each input file contains pretokenized SENTENCES separated by "\n\n", each sentence is organized into multiple lines, each line contains only a single word (`--input_format pretok`).
       
       python -m trankit [other options] --task pner
    
    -Sentence segmentation, Tokenization, Part-of-speech tagging, Morphological tagging, Dependency parsing, Named Entity Recognition.
    
    Assuming each input file is a single raw DOCUMENT string (`--input_format plaindoc`).
       
       python -m trankit [other options] --task dall
    
    -Tokenization only, Part-of-speech tagging, Morphological tagging, Dependency parsing, Named Entity Recognition.
     
     Assuming each input file contains multiple raw SENTENCE strings in each line (`--input_format plainsen`).
       
       python -m trankit [other options] --task sall
    
    -Part-of-speech tagging, Morphological tagging, Dependency parsing, Named Entity Recognition.

	 Assuming each input file contains pretokenized SENTENCES separated by "\n\n", each sentence is organized into multiple lines, each line contains only a single word (`--input_format pretok`).
       
       python -m trankit [other options] --task pall