Decoding the Secrets of Language Translation with PyTorch: A Hands-on Journey to Seq2Seq Model.

18 min readDec 28, 2023

Abstract

In this article we will explores the inner workings of sequence-to-sequence (seq2seq) models with self-attention for machine translation, specifically focusing on Hindi to Garhwali language translation using PyTorch with one of the variant of recurrent neural networks (RNNs). It covers key concepts including:

Encoder-decoder architecture: The fundamental components of seq2seq models and their roles in encoding input sequences and generating output sequences.
Attention mechanisms: Why attention is crucial for improved translation accuracy, particularly in longer sequences, and how it enables the decoder to focus on relevant parts of the input.
Output states vs. hidden states: The rationale behind using output states for attention calculation, as they better preserve individual word contributions compared to hidden states.
Teacher forcing: A training strategy that accelerates model convergence and stability by using ground truth values as input instead of model predictions.
Transformer models: A brief overview of their motivations, unique architecture, and impact on seq2seq tasks.

The work delves into the implementation of these concepts in PyTorch(especially the sequence model), providing a step-by-step breakdown of information flow and sequence processing within RNNs with self-attention. We will also aims to shed light on the transformative power of attention mechanisms and transformers for language translation a bit just to grasp thing around.

We will be using GRU as one of the RNNs to understand the concept further in detail.

Understanding Basic Concept

In a machine translation we have two things one is encoder and other is decoder. The work of the encoder is to encode the sentence(A sentence is a series of word in a sequence) to a vector called context vector which is an input to the decoder unit, which in turn decode this context vector to generator a sequence of words in other language.

Here I am trying to drives into the internal workings of encoder-decoder models with self-attention used for language translation.

An encoder is a sequence model. In language translation applications that employ self-attention mechanisms, the encoder does not discard the output states. Instead, it utilizes all of the encoder’s outputs as entry points to the decoder model, in conjunction with the encoder’s final hidden state.

Now the question that might comes to your mind is Why not use the hidden intermediate state of the encoder at every time step instead (since they too capture the information of each input word )?, OR Why use the output state at every time step ?

Output states vs. hidden states

So lets go back a bit, the reason of introducing attention in the first place(especially in a seq2seq model) is to make sure that the output of the decoder must be aware of what percentage of the word is contributing in its making so that the word/ sentence generated has comes out of from the more meaningful understanding of the input sequence especially longer input sequences. Because In longer sequences, standard encoder-decoder models using a fixed-length context vector to summarize the entire input struggle to maintain information about earlier parts. This leads to inaccurate translations, in other word a normal sequence-to-sequence model uses a fixed length context vector & this vector is used to carry forward the understanding of the input sequence to the decoder model which used this limited amount of memory information to generate the output sequence.

This is true & will definitely work for smaller set of sequences but will not be able to extend its understanding beyond some point because these fixed smaller sized context vector Often has forgotten the earlier parts of the sequence once it has processed the entire the sequence. which is where attention comes into play, Attention tells the output word at every decoder time-step, which input word it has to focus on (Remember the term word mentioned here.) thus making sure that each word generated in the output sequence keeping in mind the right percentage contribution of input word while generating the output word. Attention model does this by assigning some weighted score to the each input word in the sequence w.r.t each output word in the generating sequence.

How these scores are generated ?

Will look into this later but for now assume an imaginary dictionary of output word where each word has a list of weight score for each word in the input sequence, telling how much percentage contribution that input word has to that output word.

Hence the answer….the input word that we mentioned, is the reason of using the output states, as each output in the encoder carries the information about that word at each time step which is what our attention focus on(e.g. Attention mechanisms are designed to selectively focus on specific parts of the input that are most relevant to the current task.

). Which means it emphasises on the individual component not the grouped components of words & that’s what the hidden state store, a cumulative information of its previous time step. Hidden state by focusing on the information from all previous time steps, forms a progressively richer representation of the overall context means, it doesn’t preserve the individual contributions of each word or element in the sequence as cleanly as the output sequence does. Attention, however, often needs to focus on specific parts of the input and make fine-grained decisions about relevance.

And also as discussed before by the time it reaches the final state, hidden state has blended information from the entire input sequence making it difficult for attention to distinguish the specific parts that are most relevant to the current task. And this line goes for each time step for encoder module(except for first few step, but here we are talking about the larger deeper networks).

While it’s theoretically possible to use hidden state outputs for attention, the vast majority of research and practical implementations in deep learning frameworks rely on the output sequence for attention weight calculation. This convention has been found to be effective and efficient across various tasks involving attention.

In this sequence-to-sequence modelling journey, we’ll delve into the intricacies of using recurrent neural networks with self-attention mechanisms in PyTorch. We’ll dissect the information flow step-by-step, gaining a deep understanding of how sequences are processed and transformed. Diving further, we’ll explore the motivations behind the revolutionary shift towards transformer models and how their unique architecture has reshaped the landscape of sequence-to-sequence tasks. Through key architectural insights, we’ll demystify the transformative power of transformers.

Teacher Forcing

Before we move forward I would also like your attention on one of the concept called Teacher Forcing?

Problem with the traditional Training is that the output of the current time-step as an input to the subsequent time- step leading to:

Slow convergence.
Model in-stability
Poor skill

Teacher forcing is the strategy that allow RNNs/ Transformer architecture to use the ground truth/ actual values as an input rather than using the model output from the prior time-step as an input, making the training of RNNs/ Transformer network faster. But there is a slight variant while using the two in RNNs and in Transformers.

Unlike recurrent models, where teacher forcing involves feeding the previous ground-truth word as input, transformers use the ground-truth output from the previous time step as the “query” for the current self-attention layer.

The Encoder Module

A recurrent network process the input sequence sequentially at each time step(in an incremental fashion), & the subscripts track time steps to maintain order.

where 0 represents the start of the sequence and subsequent values (1, 2, 3, …) representing subsequent time steps.

The hidden state accumulates information from all previous time steps, forming a progressively richer understanding of the input sequence’s overall meaning, and we called the last hidden state of the encoder the context vector(ctx) which is responsible for carrying forward the information to the decoder unit. These hidden states enables the recurrent networks to capture long-range dependencies within sequences, even if those dependencies span multiple time steps.

To ensure consistency between input and output dimensions, we constructed an input/output vector with number 11, corresponding to the max word sequence length in our entire data corpus.

class sequenceEncoder(nn.Module):

    def __init__(self, input_size, hidden_size, droup_out=0.1):
        super(sequenceEncoder, self).__init__()

        self.input_embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.droupout = nn.Dropout(droup_out)

    def forward(self, X):
        if torch.all(input_tensor >= 0) and torch.all(input_tensor < embedding.num_embeddings):
            embed_vector = self.droupout(self.input_embedding(X)) 
        else:                
            return [], []
        
        embed_vector = self.droupout(self.input_embedding(X))        
        output, hidden = self.gru(embed_vector)
        
        return output, hidden

Let’s delve into this process with a time-step 0 in focus, if we zoom in a bit the network at 0th time-step will look like this, lets try to understand this with an example:

**Recurrent Network at Encoder Time-Step_0**

Consider we have input sequence as “mera naam divyanshu hai.” When focusing on time-step 0, the input sequence “mera” undergoes word-wise indexing as part of the label encoding process. In this approach, each word in the sequence is assigned a unique integer identifier to distinguish it within the training corpus. For our instance, “mera” is labeled as 1, “naam” is 2, and so on. This is called word-wise indexing or referred to label encoding.

The labeled word “mera” is then represented as a tensor of size (batch_size, 1) and passed through an embedding layer. This layer transforms the label encoding into a more semantically meaningful vector representation with a size of (batch_size, 1, 128). Here, the ‘1’ represents the specific word in focus, and ‘128’ signifies the output embedding dimension. When applied to the entire sentence, the tensor size becomes (batch_size, 11, 128), indicating that there are 11 words, each with a 128-dimensional embedding representation.

Subsequently, this embedding vector is fed into the GRU unit to generate an output and hidden vector. Specifically, for a single time-step (e.g., time-step 0), the output and hidden vectors are both of size (batch_size, 1, 128). When considering the entire time sequence, the output vector becomes (batch_size, 11, 128), and the hidden vector remains (batch_size, 1, 128). This process provides a comprehensive understanding of the sequential information encoded in the input sentence.

We also have a special hidden vector called the context vector(the last of the hidden state vector of an encoder module) and is one of the input to out decoder as it carries the summarised information of the entire sequence in a vector format it has seen throughout previously each time steps.

Note, there is also a vector h0 typically initialised to zero as the model has not yet started to read the input at this point in time & will have same size as the other hidden sate, in our case it will be (batch_size, 1, 128).

Encoder model understanding of recurrent network(GRU) with input sequence: “mera naam divyanshu hai”.

Input: "mera"
Hidden State (Initial): A vector of zeros, indicating no prior context.
Recurrent Processes: Combines input and hidden state, capturing basic information about "mera" and its potential role as a determiner.
Output: Represents the GRU's understanding of "mera" in isolation, its potential meanings, and its role as a sentence starter.
Updated Hidden State: Captures the meaning of "mera" and its potential role as a sentence starter.

Input: "naam"
Hidden State: Now contains information about "mera"
GRU Processes: Integrates input with hidden state, understanding "naam" in the context of "mera"
Output: Represents the GRU's understanding of "naam" in the context of "mera"
Updated Hidden State: Captures the phrase "mera naam" and its potential meaning.

Similar process continues for "divyanshu":
* Input is the current word.
* Hidden state carries information from previous words.
* GRU processes input and hidden state, updating both.
* Output represents GRU's understanding at this step.
* Updated hidden state captures accumulated context.
This process will repeat in itself for each sequence word in the sentence till it reached the end word("hai").

Input: "hai"
Hidden State: Contains information about the entire sentence up to "divyanshu"
GRU Processes: Integrates input with hidden state, understanding "hai" in the full context.
Output: Represents the GRU's final understanding of "hai" in the sentence.
Updated Hidden State: Holds the complete representation of the entire sentence "mera naam divyanshu hai"

The Decoder(Attention) Module

**Decoder Seq2Seq Module via Self Attention**

Now, turning our attention to the decoder sequence, it incorporates a mechanism that considers both the output from the encoder unit,
denoted as the “output sequence” and the “context vector (ctx),” along with the decoder input at the current time step (represented by
the output sequence).

The distinctive features of the decoder with attention, as compared to a regular decoder, can be outlined as follows:

Output of the encoder are taken into consideration at every decoder time step, the importance of output sequence is already
explained in detail in Encoder part
The context vector (ctx) is not a conventional representation of the last hidden state of the encoder. Instead, it is computed as
the weighted sum of the encoder’s output sequence, where the weights are determined by the attention mechanism with respect to the
corresponding input word at the current decoder time step.

To elaborate further, the attention weights are computed prior to being introduced into the decoder. These weights are derived from a
mechanism that assesses the relevance of each element in the encoder’s output sequence to the current decoding step. The calculation
involves considering not only the hidden units from previous time steps but also the output of the encoder.

This approach is particularly effective in capturing and preserving long-term dependencies within the sequential data.

See below code for detail understanding of the Decoder Module with Self-Attention.

class decoderAttention(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1):
        super(decoderAttention, self).__init__()

        self.output_size = output_size
        
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.attention = attentionMechanism(hidden_size).to(device)
        self.gru = nn.GRU(2 * hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, encoder_outputs, encoder_hidden, input_tensor=None):
        
        # First Input Formatting
        
        ## get the batch_size from the encoder output as it should be match the consistency across the network
        batch_size = encoder_outputs.size(0)
        
        ## decoder_input is the decoder input at that time-step, so it will be the label-encoding of the single word & will have the \
        ## dimension (batch_size, 1).
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long, device=device).fill_(SOS_token)

        # Second Input Formatting
        decoder_hidden = encoder_hidden
        decoder_outputs = []
        attentions = []
        
        ## for all the time step(for all the output sequence), we process each word/ time-step using for loop till MAX_LENGTH token.
        for time_seq in range(MAX_LENGTH):
            ## here we are calling in the <seq_step> function which will do the evaluation per word
            decoder_output, decoder_hidden, attn_weights = self.seq_step(decoder_input, decoder_hidden, encoder_outputs)
            decoder_outputs.append(decoder_output)
            attentions.append(attn_weights)
            ## will introduce the concept of "teacher-forcing"(alreday explained); the input to the decoder is the actual output of the
            ## decoder from the previous time-step.

            ## this "input_tensor" will be the "decoder_input" in the time-step to follow, we use the unsqueeze(1) in order to maintain
            ## input dimension as (batch_size, 1) as we are dealing in batches
            if input_tensor is not None:
                decoder_input = input_tensor[:, time_seq].unsqueeze(1) 
            else:
                # Without teacher forcing: use its own predictions as the next input
                _, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze(-1).detach()  # detach from history as input

        ## we then concatenate the decoder_ouptuts that we accumalated over different time-steps to the desire format of size
        ## (batch_size, MAX_LENGTH, hidden_size); MAX_LENGTH is the time sequence(e.g. the predicted_output word at that time-step) and
        ## hidden_size rep. the "hidden_scores".
        decoder_outputs = torch.cat(decoder_outputs, dim=1)

        ## same as we did for out decoder_outputs
        attentions = torch.cat(attentions, dim=1)
        
        ## so inorder to get the correct predicted result we, do the log_softmax to that "hidden_score", maximum of which give us the index
        ## for the predicted_word at that time-step
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        
        ## We return `None` for consistency in the training loop
        return decoder_outputs, decoder_hidden, attentions

    def seq_step(self, input, hidden, encoder_outputs):
        embedded = self.dropout(self.embedding(input))

        ## out of the recurrent network is (1, batch_size, Hidden_size), so to make it (batch_size, 1, Hidden_size) for mathematical 
        ## convenince we use permute(1, 0, 2) which swap in the 0 and 1  dimension for us.
        query = hidden.permute(1, 0, 2)
        
        ## we call the attention_layer
        context, attn_weights = self.attention(query, encoder_outputs)

        ## As explained before in detail the context vector here is not the simple context_vector but our attention scores, and is feed
        ## to the next docoder input seqeunce along with the decoder_input (e.g. [decoder_input + cxt] resulting in a hidden input size
        ## of 2*hidden_size) which is feed to out recurrent network as an input.
        input_gru = torch.cat((embedded, context), dim=2)

        output, hidden = self.gru(input_gru, hidden)
        output = self.out(output)

        return output, hidden, attn_weights

Here is a closer look at how the Attention Mechanism looks like:

class attentionMechanism(nn.Module):
    def __init__(self, hidden_size):
        super(attentionMechanism, self).__init__()
        self.Wa = nn.Linear(hidden_size, hidden_size)
        self.Ua = nn.Linear(hidden_size, hidden_size)
        self.Va = nn.Linear(hidden_size, 1)

    def forward(self, query, key):
        # IMPORTANT for Training ATTENTION Weights
        # query: represent the input of the decoder(source input), keys: encoder_outputs
        scores = self.Va(torch.tanh(self.Wa(query) + self.Ua(key)))

        # dimension (batch_size, MAX_LENGTH, 1), rep. the softmax score for each input words in the sequence; 
        # This softmax score determines how much each word will be expressed at this position. Clearly the word at this position will
        # have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.
        weights = F.softmax(scores, dim=1)
        
        # dimension (batch_size, MAX_LENGTH, hidden_size), multiple softmax_score with its coresponding input vectors; The intuition 
        # here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by 
        # tiny numbers like 0.001, for example). 
        
        # This will give us a vectors with releveant scoring wrt the query word
        context = key * weights

        # We then sum up the each hidden_vector and get the "context_vector" in response with dimenstion (batch_size, 1, hidden_size), which
        # along with the "decoder_input" act as an input to the decoder.
        context = torch.sum(context, dim=1, keepdim=True)

        return context, weights

Having possessing only 111 training samples for the Garhwali language, which is limited compared to common training datasets, the model equipped with attention mechanism still demonstrates promising language understanding. Despite been a complex language though semantically incorrect translations result, the attention mechanism facilitates capturing phonetic elements of the language to some extend. This provides insightful glimpses into the potential of the network when combined with attention.

With increased training data in the future, we can expect significant improvements in translation quality. By feeding the network with a richer corpus of Garhwali sentences, the attention mechanism will have more information to draw upon, allowing it to:

Refine semantic understanding: Capture not only phonetics but also the deeper meaning and grammatical structure of the language.
Reduce semantic errors: Generate more accurate and nuanced translations, minimizing semantic deviations from the source text.
Enhance model generalizability: Adapt to diverse linguistic patterns and translate even complex sentences effectively.

Therefore, while the current results with limited data are encouraging, they serve as a stepping stone towards achieving high-quality Garhwali translations through further training and leveraging the power of attention mechanism.

def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

def train_epoch(dataloader, encoder, decoder, encoder_optimizer,
          decoder_optimizer, criterion):

    total_loss = 0
    for data in dataloader:
        input_tensor, target_tensor = data

        encoder_optimizer.zero_grad()
        decoder_optimizer.zero_grad()

        encoder_outputs, encoder_hidden = encoder(input_tensor)
        decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, target_tensor)

        loss = criterion(
            decoder_outputs.view(-1, decoder_outputs.size(-1)),
            target_tensor.view(-1)
        )
        loss.backward()

        encoder_optimizer.step()
        decoder_optimizer.step()

        total_loss += loss.item()

    return total_loss / len(dataloader)

def test_epoch(dataloader, encoder, decoder, criterion):
    total_loss = 0
    with torch.no_grad():  # Disable gradient calculation for testing
        for data in dataloader:
            input_tensor, target_tensor = data

            encoder_outputs, encoder_hidden = encoder(input_tensor)
            decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, target_tensor)

            loss = criterion(
                decoder_outputs.view(-1, decoder_outputs.size(-1)),
                target_tensor.view(-1)
            )
            total_loss += loss.item()

    return total_loss / len(dataloader)

def train(train_dataloader, encoder, decoder, n_epochs, learning_rate=0.001,
               print_every=100, plot_every=100):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)
    criterion = nn.NLLLoss()

    for epoch in range(1, n_epochs + 1):
        # Training phase
        loss = train_epoch(train_dataloader, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if epoch % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, epoch / n_epochs),
                                        epoch, epoch / n_epochs * 100, print_loss_avg))

        if epoch % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0
            
        # Evaluation phase
        test_loss = test_epoch(test_dataloader, encoder, decoder, criterion)
        print(f'Epoch {epoch}, Test Loss: {test_loss:.4f}')  # Print test loss
    
    return plot_losses

Here are some result that we get while training it for 1000 epochs, 0.1 dropout, & using Single Training batch getting us to training and validation loss of 0.005 & 0.0021 respectively.

Since we only have less data sample at 5% split is done, here is the code for the same.

def get_dataloaders(batch_size):
    input_lang, output_lang, pairs, MAX_LENGTH = prepareData('hin_eng', 'garh')

    n = len(pairs)
    
    input_ids = np.zeros((n, MAX_LENGTH), dtype=np.int32)
    target_ids = np.zeros((n, MAX_LENGTH), dtype=np.int32)

    for idx, (inp, tgt) in enumerate(pairs):
        inp_ids = indexesFromSentence(input_lang, inp)
        tgt_ids = indexesFromSentence(output_lang, tgt)
        inp_ids.append(EOS_token)
        tgt_ids.append(EOS_token)
        input_ids[idx, :len(inp_ids)] = inp_ids
        target_ids[idx, :len(tgt_ids)] = tgt_ids

    dataset = TensorDataset(torch.LongTensor(input_ids).to(device),
                            torch.LongTensor(target_ids).to(device))

    # Split into training and test sets
    test_size = int(0.05 * len(dataset))  # 5% for testing
    train_size = len(dataset) - test_size
    train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

    train_sampler = RandomSampler(train_dataset)
    test_sampler = SequentialSampler(test_dataset)  # Ensure deterministic order for testing

    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=batch_size)
    test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=batch_size)
    
    return input_lang, output_lang, train_dataloader, test_dataloader

While some test results deviate from expectations, the model demonstrates a partial ability to capture certain relationships. Notably, in Input_4/Output_4, the model correctly maps the words “raaste” and “batu” which convey the same meaning.

Same is also true for sample Input_1/Output_1 where “hu” and “chaun” implies same thing, and we also have Input_2/Output_2 with correct response along with Input_3/Output_3 in some way at-least semantically.

test_idx = 0
decoded_words = []
for idx in X_test[test_idx]:
    if idx.item() == 0 or idx.item() == 1:
        pass
    else:
        decoded_words.append(input_lang.index2word[idx.item()])

test_input = " ".join(decoded_words)
input_sequence = test_input

sentence = normalizeString(input_sequence)
print(input_sequence)
with torch.no_grad():
    input_tensor = tensorFromSentence(input_lang, sentence)
    encoder_outputs, encoder_hidden = encoder(input_tensor)
    decoder_outputs, decoder_hidden, decoder_attn = decoder(encoder_outputs, encoder_hidden)

    _, topi = decoder_outputs.topk(1)
    decoded_ids = topi.squeeze()

    decoded_words = []
    for idx in decoded_ids:
        if idx.item() == 0 or idx.item() == 1:
            pass
        else:
            decoded_words.append(output_lang.index2word[idx.item()])
            
print(" ".join(decoded_words))

Input_1 :  me theek hu, tum kese ho
Output_1:  me theek theek theek chaun theek chaun kai theek theek chaun

Input_2 :  tumhara naam kya ha
Output_2:  tyar naam kya cha ?

Input_3 :  bhaarat ke kis raajy kee janasankhya sabase adhik hai ?
Output_3:  kai india rajya ki jaansakha sabsey bandya cha ?

Input_4 :  aap kis raaste se aaye ?
Output_4:  kai batu aao ? tum kiley aiya ?

Why Transformer?

Although with the introduction of attention mechanism we do see the enhance improvement in the recurrent network it still far from what Transformer had brought since its introduction.

Here’s a breakdown of why transformer models have become increasingly favoured over recurrent seq2seq networks for language translation tasks, even though both utilize attention mechanisms:

Parallel Processing and Speed

Recurrent networks process sequences sequentially, one element at a time, as opposed to transformers leading to slower training and inference

Long-Range Dependencies

While self-attention mechanisms enhanced recurrent networks’ ability to handle longer sequences, their inherent sequential nature still poses challenges for capturing long-range dependencies effectively, especially as sequence length increases. Transformers, in contrast, revolutionise sequence modelling through multi-head attention, not only maintaining long-range dependencies effortlessly but also introducing multifaceted representations that capture information from diverse perspectives.

Global Context

Recurrent networks build context incrementally, with each word’s representation influenced by only those that came before it whereas in transformers, each word’s representation is informed by all other words in the input simultaneously, providing a richer, more comprehensive understanding of the global context.

Improved Performance

Transformers have consistently demonstrated superior performance in various language translation tasks, achieving better accuracy and fluency compared to recurrent seq2seq models.

Easier to Parallelize

The parallelizable nature of transformers makes them more efficient for training on large datasets and deployment on multi-GPU or TPU systems.

Transformer Architecture

Encoder/Decoder Self-Attention

Query, Key, & Value belong to the same source/target sequence. Where each row(rep. the query) provide how relevant that word is wrt all the word in the source/target sequence to follow.

Encoder-Decoder Attention

Query, & (Key, Value) belong to the different sequence, were query belongs to the source and (key, value) pair belongs to the target sequence however each row(rep. the source query) provide how relevant that word is wrt all the word in the target sequence to follow.

Multi Head Attention Mechanism

It helps the model the ability to focus more on the different position wrt the word in focus. For example If we are translating the sentence say “Divyanshu is a Senior Data Scientist, whose expertise lies in the field of computer vision & NLP Domain”, it would be useful for us to know which word “whose” refers to.
It helps provide a different representative sub-space of the input sequence , thus gives a way to view the input sequence in a diversified way. Here are some benefits:

Improve Generalisation: Model will be able to translate the input sequence that it has very un-likely seen, by keeping in prospect the different view-points
Enhances expressive power: The model can learn more complex patterns and representations by combining multiple subspaces.
Capture Diverse relationships: Attention heads can focus on different types of relationships within the input, such as word order, semantic similarity, or syntactic structure.

Decoder is the Auto-aggressive model that generate the word in the sequence so we need to prevent it from conditioning to future tokens, which is done my masking.

Conclusion

This article explored the intricacies of seq2seq models with self-attention for Hindi to Garhwali(English format for both) translation using PyTorch and GRU networks. We delved into key concepts like encoder-decoder architecture, attention mechanisms, and output states, highlighting their vital roles in accurate and context-aware translation. We also see the detailed descriptive code of the Encoder/Decoder Sequence via Self-Attention, its performance with the limited available data corpus. Furthermore, we also briefly introduced transformer models and its advancement over recurrent network.

Following our exploration of seq2seq models with self-attention, the next blog will shift gears to evaluate the performance of the Fine-Tuned LMs (Language Models) & LLMs(Large Language Models) on the Hindi-to-Garhwali translation task. These transformer-based models have garnered significant attention, and we’ll be putting their capabilities to the test on the same dataset, which you will see giving us far better results on the same dataset.

Also do checkout my next blog on SentencePiece to delve dive into the tokenization for added vocabulary.

I truly appreciate your open communication and feedback. It means a lot to me! Stay tuned for further updates, and feel free to reach out or connect with me via my LinkedIn page.