Here's most important BERT and its variants

In this short blog, we will look into basics of BERT and its variants and is helpful for reviewing underlying concept. This blog intended towards those who have previous experience in NLP and know concepts like embedding and vectorization. In the end, we will see the implementations of these models in using Hugging Face.

Contents hide

1 BERT

1.1 Bidirectional Encoder Representations from Transformers

1.2 Implementation

2 RoBERTa

2.1 Robustly Optimized BERT Pretraining Approach

2.2 Implementation

3 ELECTRA

3.1 PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS

3.2 Implementation

4 ALBERT

4.1 A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

4.2 Implementation

5 Why are these models better?

5.1 References:

5.2 Share this:

5.3 Like this:

5.4 Discover more from Arshad Kazi

BERT

Bidirectional Encoder Representations from Transformers

BERT is a transformer-based model which is designed for NLP tasks. It uses the Encoder part of the transformer. Most of the time it uses MLM (Masked Language Modelling) technique to train and predict the masked word (or sub-word). There’s another way, NSP (Next Word Prediction) but to make this blog concise we are only comparing MLM.

The masked word is predicted using the contextual information around it. Similar to the encoder transformer layer, the contextual information is represented by embedding word vectors and positional encoding vectors combined. This makes the BERT model bidirectional as the context is drawn from both directions (from both, words occurring after the mask and before the mask).

The ability of this model to find out the contextual meaning of the given language makes it a perfect choice for most of the NLP tasks such as question answering, summarisation and many more.

This Bidirectional nature of BERT has an advantage over Bidirectional LSTMs. Here, we can process the entire corpus (or a large amount of data) in one go, as the nature of transformer architecture is parallel operations(e.g. self-attention mechanism can work parallelly: multi-head self-attention), making BERT free from attention loss.

Implementation

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

RoBERTa

Robustly Optimized BERT Pretraining Approach

One of the variations of BERT is RoBERTa architecture which is a faster version of BERT. The model, RoBERTa, is trained longer, on larger batches with more training data, and it drops the NSP task. This way it is faster than BERT. Training over larger batches and more data gives RoBERTa more accuracy than BERT because with more context to process the MLM accuracy increases.

One more difference between BERT and RoBERTa is static vs dynamic masking. In BERT, the elements are masked once during training (15% of total words) and the same masked inputs are given throughout training. This can lead to false predictions as the data can have duplications with the same masked word with different meanings. To resolve it, RoBERTa uses dynamic masking, where they perform masking each time they fed data to the model.

Here’s what you can do with RoBERTa

How to implement NER using HuggingFace Models?

Implementation

from transformers import RobertaConfig, RobertaModel

# Initializing a RoBERTa configuration
configuration = RobertaConfig()

# Initializing a model (with random weights) from the configuration
model = RobertaModel(configuration)

# Accessing the model configuration
configuration = model.config

ELECTRA

PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS

BERT models perform MLM tasks, updating only the representations of masked words, not the unmasked ones, during training. ELECTRA resolves this by employing a two-model approach. One model functions as an MLM model, and its output feeds into another binary classifier model. This classifier predicts the originally masked word. The second model, termed the discriminator, discerns between masked and unmasked words based on the MLM model’s output.

ELECTRA model — ELECTRA simple architecture (source)

Implementation

from transformers import ElectraConfig, ElectraModel

# Initializing a ELECTRA electra-base-uncased style configuration
configuration = ElectraConfig()

# Initializing a model (with random weights) from the electra-base-uncased style configuration
model = ElectraModel(configuration)

# Accessing the model configuration
configuration = model.config

ALBERT

A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

The original BERT model has more than 100M parameters to train. As we increase the size of data and parallel operations to train, we can hit the limits of GPU/CPU memory. To address this issue, ALBERT was created. To reduce the number of parameters, it decouples the embedding dimension from the hidden dimension in the network. This allows the embedding dimension to be smaller, making it easier to train especially when the corpus is huge.

All the layers in the model share the same parameters, hence reducing the training parameters.

The faster training and separated embedding dimensions make ALBERT better at NLU tasks (Next sentence prediction). Thus, with fewer parameters compared to original model, ALBERT provides similar results.

Implementation

from transformers import AlbertTokenizer, TFAlbertModel

# tokenizer for albert
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2'')

# pretrained model
model = TFAlbertModel.from_pretrained("albert-base-v2)

Why are these models better?

We have described the answer in the previous sections already. In general, BERT is a bigger model designed for MLM and NSP tasks. The models described above give more flexibility while increasing accuracy. To sum it up in one-liner, here are my observations about each model over original model.

RoBERTa: Dynamic masking, bigger, better data training make it faster and better than BERT in MLM tasks. (I found this is good for self-supervised models as well)
ELECTRA: Dual model approach gives more accurate contextual representation over BERT.
ALBERT: ALBERT uses various techniques to reduce the number of trainable parameters making it faster than BERT while retaining similar accuracies. It reduces the total parameters by 18x compared to original model.

References:

For implementations: Hugging Face
RoBERTa overview RoBERTa: An optimized method for pretraining self-supervised NLP systems
RoBERTa Paper: https://arxiv.org/pdf/1907.11692.pdf
ELECTRA Paper: https://arxiv.org/pdf/2003.10555.pdf
My blog on embeddings: Word2Vec: An Overview – Journey of Curiosity
BERT Overview: What is BERT (Language Model) and How Does It Work?
ALBERT paper: https://arxiv.org/pdf/1909.11942.pdf
Transformer Paper: [1706.03762] Attention Is All You Need – arXivhttps://arxiv.org › cs
Youtube source for transformers and embeddings: AI Language Models & Transformers – Computerphile

Discover more from Arshad Kazi

Subscribe to get the latest posts sent to your email.

Here’s most important BERT and its variants

BERT

Bidirectional Encoder Representations from Transformers

Implementation

RoBERTa

Robustly Optimized BERT Pretraining Approach

Implementation

ELECTRA

PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS

Implementation

ALBERT

A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

Implementation

Why are these models better?

References:

Like this:

Discover more from Arshad Kazi

Leave a Reply/Feedback :)Cancel reply