BERT in nutshell

A beginner’s guide to understanding how BERT works

3 min readMay 15, 2021

A few months ago, I started working on a project which involved text classification. I had previously only worked with basic NLP techniques to prepare text data and applied simple ML algorithms for classification. However, I was aware of the state-of-the-art (SOTA) results that Transformer-based NLP models — such as BERT, GPT-3, T5, and RoBERTa.

So i decide to write a quick nutshell that explain how BERT works. it’s a very quick introduction about BERT hope it will be helpful for the people that seek for high level explanation.

BERT Stands for Bidirectional Encoder Representations from Transformers; It was released by Google in October 2018. And achieved state-of-the-art results in many natural language understanding (NLU) tasks. BERT is based on a multi-layer bidirectional Transformer and is trained on plain text for masked word prediction and next sentence prediction tasks. BERT Combine the Transformers and Bi-directional concept by using:

Masked language model (MLM)
Next sentence prediction

Masked Language Model Before feeding word sequences into BERT, 15% of the tokens are replaced with a [MASK] token. and the model tries to predict the original value of the masked words, based on the context. In technical terms, the prediction of the output words requires:

Adding a classification layer on top of the encoder output.
Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
Calculating the probability of each word in the vocabulary with softmax.

Next Sentence Prediction (NSP) the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. Model distinguish between the two sentences in training, using the following way

A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embedding are similar in concept to token embedding.
A positional embedding is added to each token to indicate its position in the sequence.

To predict if the second sentence is indeed connected to the first, the following steps are performed

The entire input sequence goes through the Transformer model.
The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple classification layer (learned matrices of weights and biases).
Calculating the probability of IsNextSequence with softmax.

If you want to go more in-depth I recommend Chris Mc Cormick’s channel. He made a video on this topic that was very helpful.

BERT in nutshell

A beginner’s guide to understanding how BERT works

Written by Karim Omaya

No responses yet