SentencePiece — The NLP Architect’s Tool for Building Bridges Between Languages

Divyanshu Dimri
9 min readJan 7, 2024

--

Overview

Building on our pervious Blog, where we explored the attention mechanism, this article will explores SentencePiece, a powerful subword tokenization technique empowering seamless language translation within natural language processing (NLP) tasks.

By delving into its capabilities, we will demonstrate the SentencePiece capability to generating tokens and its different functionality to see how it can help bridging the gap between the languages that it has never seen before by aligning it with the language model of our choice (In our case the LM that we will be focusing on will be NLLB(No Language Left Behind) which will further be replaced by one of talking points nowadays(LLMs for our end2end pipeline, which we will discuss in our next blog post).

NLLB has shown a great tokenization results for sentence's that are in Hindi(written in english). Given that both our input (Hindi) and output (Garhwali) are also written in English script, NLLB delivers impressive average token per word ratios of 1.436759 (Hindi) and 1.371028 (Garhwali), which itself is a great enough ratio for language translation model & is comparable with that of a high-resource language that NLLB supports with big margins. These ratios itself considered highly favorable for machine translation tasks, and is comparable with most of the high-resource language that NLLB supports with big margins (NLLB also supports 200 different languages). While Google-MT5 is also another option in LMs category, NLLB’s slightly superior tokenization for our specific languages, along with its wider language support and less input lengths not exceeding 512 tokens perfectly fits our problem statement ultimately led us to select it for the first phase of our language translation pipeline.

Through in-depth exploration and practical examples, we will demonstrates SentencePiece all in all capability in language communication, paving the way for a future where language barriers crumble and cross-linguistic understanding becomes effortless.

Note: We are incorporating NLLB primarily as a supporting element in our exploration of SentencePiece. It’s important to note that for these languages(Not in their native written form currently but will be in the future.), given their excellent average token per word ratios, the utilization of SentencePiece may not be essential. But we can still incorporate it in our project for better results.

Let’s Understand SentencePiece

SentencePiece is an language independent tokenization model for NLP task in deep learning, we can use SentencePiece models to Fine-Tune our own custom tokenizer then make it compatible with any other LM’s that support SentencePiece format. Unlike traditional word-level tokenization, SentencePiece breaks down words into smaller, semantic units, enabling finer-grained analysis and better capturing the nuances of language.

It uses the SentencePieceTrainer.Train module to to train the the text file with your embedded textual information. It takes in that all the textual information from the input .txt file and generate the .spm file.

This .spm format is nothing but a binary file that stores the trained sentence-piece model. It Contains the vocabulary, model parameters( e.g., BPE merge operations), special tokens( e.g., <unk>, <s>, & </s>), and normalization rules & is created as the output of the training process.

Our aim of using SentencePiece is make the model understand the tokens of the languages that’s new to it, that we are basically trying to expand the model vocabulary. Especially this expansion generally comes into play when new language dataset has out-of vocabulary characters, and the average token length is greater than 4 or would rather is not in range 2–3.

If this is not that case than there is not need to expand the vocabulary of then new model since its already been able to formulate the meaning out of the tokens and need not need extra pre-processing steps for tokenization.

So what would be the ideal scenario!, yes its 1(e.g. 1 word per token).

Given the tokensPreprocessing class, it will help you identify the tokenized properties that includes:

  1. Calculating the number of the tokens per word that the tokenizer is capable of,
  2. Average words per tokens for that language corpus.

[For both languages]

def word_tokenize(text):
# a very naive word tokenizer for languages with English-like orthography
return re.findall('(\w+|[^\w\s])', text)

class tokensPreprocessing:
def avg_tokens_word(self, pairs):
tokens = tokenizer.tokenize(pairs[0])
word_tokens = word_tokenize(pairs[0])

word2tokens = {}
for itr_, word in enumerate(tokens):
if itr_ == 0:
first_index = itr_
elif word[0] == "▁":
word2tokens["".join(tokens[first_index : itr_])[1: ]] = itr_-first_index
first_index = itr_

word2tokens["".join(tokens[first_index :])[1:]] = (itr_-first_index + 1)

# return len(tokens) / len(list( word2tokens.keys())), word2tokens
return len(tokens) / len(word_tokens), word2tokens

def __call__(self, languages, pairs):
avg_tokens = {}

for lang in languages.keys():
lng_idx = languages[lang]

hind = [pair[lng_idx] for pair in pairs]
elem_wise_avg = list(map(self.avg_tokens_word, zip(hind)))
avg = sum(sublist[0] for sublist in elem_wise_avg)/ len(elem_wise_avg)

word2token_U = {}
for sent in elem_wise_avg:
word2token = sent[1]
for itr, word in enumerate(word2token):
# print(word)
if word not in word2token_U:
word2token_U[word] = ""
word2token_U[word] = word2token[word]

dict_ = pd.DataFrame({'words': list(word2token_U.keys()), 'tokens': list(word2token_U.values())})

if lang == "hind":
dict_.to_csv("hind_tokens.csv", index=False)
avg_tokens["hind"] = avg
elif lang == "garh":
dict_.to_csv("garh_tokens.csv", index=False)
avg_tokens["garh"] = avg

pd.DataFrame({'language': list(avg_tokens.keys()), 'avg_tokens': list(avg_tokens.values())}).to_csv("avg_tokens.csv", index=False)


# result in 3 files {hind_tokens.csv, hind_tokens.csv, avg_tokens.csv}
tokens_preprocess = tokensPreprocessing()

languages = {"hind": 0, "garh": 1}
# pairs = [["hindi"], ["garhwali"]]
tokens_preprocess(languages, pairs)

This function will take the input as languages(dict), and pairs in the format as specified above in the code structure, which you can change according to your use case. And the output will be in the form of the .csv file with the necessary information.

Once we train and load the tokenizer we will add its tokens to the existing NLLB tokenizer, which is our main module that we are trying to use for our language translation. Lets get into the code and see how we do it.

Training SentencePiece

There are actually 2 ways of training your custom tokenizer:
Using the model-prefix parameters.

import sentencepiece as spm

spm.SentencePieceTrainer.train(
input='sentences.txt',
model_prefix='hi-garh',
model_type="bpe",
vocab_size=2**6,
)

This help you provide a detail output format consists of .model .vocab and .model.spm.id files where,

  1. .model: The main .spm model file.
  2. .vocab: A text file containing the vocabulary. &
  3. .model.spm.id: A text file with model metadata.

Using the model-write which takes in 2 formats:
.spm file pointer: Writes the model directly to a specified .spm file.

def train_sentence_piece_bytes(size):

with open("model.spm", "wb") as model_file:
spm.SentencePieceTrainer.train(
input='sentences.txt',
model_writer=model_file,
vocab_size=size,
)
proto = train_sentence_piece_bytes(2**6)

io.BytesIO(): Writes the model to an in-memory buffer for later usage.

import io
model_writer = io.BytesIO()
spm.SentencePieceTrainer.train(
input='sentences.txt',
model_writer=model_writer,
model_type="bpe",
vocab_size=2**6,
)

# Access model data from model_writer
model_data = model_writer.getvalue()

# Optionally write to a file
with open('my_model.model', 'wb') as f:
f.write(model_data)

Here is one interested thing about the sentence tokenizer, if you look into the vocabulary that it created for you it contains tokens w.r.t subword, Special characters, characters, Case Sensitivity, Model type used(and its variation), Empty Lines or Whitespace, etc. thus the vocab size will always be >= (required_chars_.size() + meta_pieces.size()).

Lets try to understand it with the help of an example, consider I have list of sentences as our corpus “mera naam divyanshu ha.”, “aap kese ho!”, & “wail aanu cha”.

Then if you run the trainer of our SentencePiece and say(you use 1. method), if u see the .vocab file you will see the min_vocab size consists of unique characters present in the our corpus(which is required_chars_ here) plus the meta_pieces (<unk>, <s>, & </s> in our case).

These meta_pieces refers to the special tokens or pieces that serve specific purposes within the model.

Thus based on the functionality/ properties and size of the custom data corpus vocabulary density varies.

wait functionality/ properties?

It defines how well versed the model is trained on the different set of languages that it has seen and learned e.g. say while fine-tuning on our own dataset we provide him a word “hello” and vocab_size is also limited (say’s required_chars_.size() + meta_pieces.size() + 2(focus on this 2 here)) since SentencePiece is trained on multiple languages it will first try to fit in all the unique_character + some special_tokens + will then see if some remaining words remains(2 in this) it will fill in “lo” which it might have learned from the hindi sentence(say “cha lo”(meaning “lets go”) ). This thus defines the model properties meaning how well that model is been trained on.

hope this makes the things clearer.

Loading SentencePiece

SentencePieceProcessor is one of the important class of SentencePiece which is responsible of loading the model via file_path.

Now once we have trained the model we are now ready to use it in any way possible, and for this we use SentencePieceProcessor library which is responsible for loading the model using file_path.

either using the load function by first creating SentencePieceProcessor object then calling the load such as:

hi_garh_spm_processor = spm.SentencePieceProcessor()
hi_garh_spm_processor.load("model.spm") # Use ".spm" or ".model" extension

also referred as Load after creation.

OR
directly from SentencePieceProcessor object via model_file argument as:

hi_garh_spm_processor = spm.SentencePieceProcessor(model_file='./hi-garh.model')

also referred as Load during creation.

Vocabulary Expansion

Once we load the trained model we will extract the model data which is in Protobuf format. for this we will first create a placeholder object to hold the serialized model data in Protobuf format using:

new_vocab_spm_proto = sp_pb2_model.ModelProto()

then extract the trained model serialized protobuf data and Parses data into Protobuf object:

new_vocab_spm_proto.ParseFromString(hi_garh_spm_processor.serialized_model_proto())

we will do the same for our pre-trained NLLB tokenized model.

from transformers import NllbTokenizer

nllb_tokenizer = NllbTokenizer.from_pretrained('facebook/nllb-200-distilled-600M')

base_spm_proto = sp_pb2_model.ModelProto()
base_spm_proto.ParseFromString(nllb_tokenizer.sp_model.serialized_model_proto())

Once this is done we are all ready to add the missing tokens to the NLLB SentencePiece model using below formatted code lines.

so what we are doing in here is to parse through each tokens that we have created via SentencePiece and if they exists in the NLLB protobuf
object we simple ignore it else we add a new piece(pieces here are nothing but the tokens that our tokenizer has stores in its memory ) to
this entry.

we also modify the scores for the new pieces accordingly.

when this is done we have new modified protobuf object with new & old entry in it, we then save it as new .model file.

base_vocab_tokens = {p.piece for p in base_spm_proto.pieces}
min_score_in_base_vocab = base_spm_proto.pieces[-1].score

new_tokens = []
for p in new_vocab_spm_proto.pieces:
if p.piece not in base_vocab_tokens:
new_tokens.append(piece.piece)
new_piece = sp_model_pb2.ModelProto().SentencePiece()
new_piece.piece = p.piece
new_piece.score = p.score + min_score_in_base_vocab # Lower priority for new tokens
base_spm_proto.pieces.append(new_piece)

# **Save combined model:**
combined_spm_model_name = 'hiGarh_nllb.model'
with open(combined_spm_model_name, 'wb') as f:
f.write(base_spm_proto.SerializeToString())

Two tokenizers are then loaded: the original one linked to the pre-trained model and a custom tokenizer with an expanded vocabulary. The pre-trained model is then loaded. To ensure compatibility between the model’s embedding layer and the custom vocabulary, the embedding layer of the model is resized to align with the length of the custom tokenizer’s vocabulary. This crucial step ensures smooth integration of the new tokens during model operations.

from transformers import AutoModelForSeq2SeqLM, NllbTokenizer

model_name = 'facebook/nllb-200-distilled-600M'

# **Load and resize model for new vocabulary:**
combined_tokenizer = NllbTokenizer.from_pretrained(os.path.join("tokenizer_dir", combined_spm_model_name))

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.resize_token_embeddings(len(combined_tokenizer))

While resizing accommodates new tokens, their initial embeddings are typically assigned random weights. This can potentially hinder model
performance during early training stages, as the model must learn these embeddings from scratch.

In order for the smother adaption the pre-trained model to the expanded vocabulary we can do re-initialization of the embedded vectors
before the training, for better understanding of this see David Dale article on the same from where I have taken inspiration to write this article on.

See below example of the two tokenizer before and after been trained on custom dataset, as you can clearly sees how “divyanshu” is been treated differently by the two tokenizer.

def text2tokens(text, tokenizer=None):    
return {
"word": text,
"ids": tokenizer(text).input_ids,
"tokens": tokenizer.convert_ids_to_tokens(tokenizer(text).input_ids)
}

Text = "divyanshu"

# base one:
print(text2tokens(Text, tokenizer=tokenizer_old))
{'word': 'divyanshu', 'ids': [256047, 150, 3490, 621, 300, 2], 'tokens': ['eng_Latn', '▁di', 'vy', 'ans', 'hu', '</s>']}

# combined one
print(text2tokens(Text, tokenizer=tokenizer_new))
{'word': 'divyanshu', 'ids': [256053, 256006, 2], 'tokens': ['eng_Latn', '▁divyanshu', '</s>']}

Conclusion

This blog post represents the second stage of our journey in creating a robust, customized language translation model. In our previous post, we explored the attention mechanism, a key component in many neural networks. Now, we delve into SentencePiece, a subword tokenization technique that enhances language representation.

We not only understood SentencePiece but also expanded its vocabulary and integrated it with NLLB, a versatile language model. This collaboration enables efficient handling of previously unseen languages like Hindi and Garhwali, as seen in favorable token-to-word ratios.

The series concludes in Blog Post 3, where we’ll leverage NLLB and our expanded SentencePiece vocabulary to develop a high-performing translation model for Hindi-Garhwali.

I truly appreciate your open communication and feedback. It means a lot to me! Stay tuned for further updates, and feel free to reach out or connect with me via my LinkedIn page.

References

--

--

No responses yet