The number of unique tokens is predetermined Here are the high level differences from other implementations. SentencePiece supports two segmentation algorithms, byte-pair-encoding (BPE) and unigram language model. SentencePiece is a re-implementation of sub-word units, an effective way to alleviate the open vocabulary Note that BPE algorithm used in WordPiece is slightly different from the original BPE. Comparisons with other implementations Feature NFKC-based normalization: SentencePiece performs NFKC-based text normalization.įor those unfamiliar with SentencePiece as a software/algorithm, one can read a gentle introduction here. ![]() Direct vocabulary id generation: SentencePiece manages vocabulary to id mapping and can directly generate vocabulary id sequences from raw sentences.Self-contained: The same tokenization/detokenization is obtained as long as the same model file is used.Fast and lightweight: Segmentation speed is around 50k sentences/sec, and memory footprint is around 6MB.Subword regularization: SentencePiece implements subword sampling for subword regularization and BPE-dropout which help to improve the robustness and accuracy of NMT models.Multiple subword algorithms: BPE and unigram language model are supported.Language independent: SentencePiece treats the sentences just as sequences of Unicode characters.Pre-tokenization ( Moses tokenizer/ MeCab/ KyTea) is not always required. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |