tokenizers - PyPI
https://pypi.org/project/tokenizers28.02.2022 · Main features: Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions). Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU. Easy to use, but also extremely versatile.
text.BertTokenizer | Text - TensorFlow
www.tensorflow.org › python › textMar 25, 2022 · tokenizer = BertTokenizer( vocab_lookup_table='/tmp/tok_vocab.txt') text_inputs = tf.constant( ['greatest'.encode('utf-8')]) tokenizer.detokenize( [ [4, 5]]) <tf.RaggedTensor [ [b'greatest']]> Returns A RaggedTensor with dtype string and the same rank as the input token_ids . split View source split( input ) Alias for Tokenizer.tokenize.