How to implement NER using HuggingFace Models?

In this blog, we will see how to implement a NER (named entity recognition) model using hugging face library. By the end, we will know,

  1. What is NER (Named Entity Recognition)?
  2. How to tag entities?
  3. How to import dataset from hugging face?
  4. Import any model from Hugging Face
  5. Implement Trainer function to train the model
  6. Inference using Hugging Face pipeline.

Feel free to skip,

This blog is not meant for reading in one go. It is a code-along blog. You may get tired in between, but this is just by nature. I would recommend you to only refer parts in which you are interested.

What is NER?

Named entities are the special kind of tags that are given to specific entities (words/sentences/subwords) inside the paragraph (text). To understand how they work. We can imagine POS tags. POS tags are used to detect the grammatical importance of the words inside the sentence. We mark words as Nouns, Pronouns and so on (with subdivisions) in POS tagging.

Named entities are the same, but they are more useful as there are no grammatical rules needed to tag words. Here, we can detect places, people, names of organisations, values etc. This type of tagging is helpful when we work on data extraction projects.

NER in the Real world

For example, suppose that we are working on financial documents and need to extract the spendings and returns this specific organisation has done in the last quarter. As you might imagine, our data points will be scattered across the document. One way of doing this is to extract only the values from the text (using regex etc.). But we will not be able to detect specific data points with their values e.g. How can we know the amount spent on marketing vs the amount spent on sales?

To tackle this issue, we can have a model where we have past documents as our input. In these past documents, we can mark individual values with their respective tags. e.g. ‘spent_on_marketing’, another value would have tag ‘spent_on_sales’ and so on. This way our model with figure out where these values come in the document (most of the historic docs have a similar structure) and in which sentence we get these values.

Again, we use a transformer model for this extraction, hence, we will have contextual information about the text, hence better accuracy!

NER implementation

If you don’t like reading and can understand code directly, you can jump directly to the code here,

GitHub

Dataset from Hugging face

For this blog, as with many others, we are going to use the conll2003 dataset from hugging face. To load the dataset we need the following code. This blog is not just for the pre-fetched dataset, we will have a general idea of how to handle any text data by the end.

Importing libraries

# visualization libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# pytorch libraries
import torch # the main pytorch library
import torch.nn as nn 
import torch.optim as optim 

# huggingface's transformers library
from transformers import RobertaForTokenClassification, RobertaTokenizer, pipeline, AutoTokenizer, AutoModelForTokenClassification
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification
from transformers import XLMRobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel
from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel

# huggingface's datasets library
from datasets import load_dataset

# the tqdm library used to show the iteration progress
import tqdm
tqdmn = tqdm.notebook.tqdm

from seqeval.metrics import f1_score, classification_report

Importing data

dataset = load_dataset("conll2003")

We can peak into the dataset. As we will see, the dataset is not in a tabular format (a data frame) its in the dataDict format, which we can treat as a list.

Peeking into data

print(dataset) 

# output
DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

The hugging face dataset comes with pre-split data. There are three obvious splits with each containing similar columns. To look into a specific example, we can directly put an index of the data just like below,

print(dataset['train'][0])

# ouput
{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

The first column contains the id, which is self-explanatory. The second columns are the tokens, words in the sentence. We have given pre-splitted words but even if we don’t have such data we can always use string.split() method to get tokens. Other three columns, we will soon know what they represent. Not to forget our target column is ner_tags.

Subword Tokenisation

For NER, we are using BERT based model, RoBERTa which is a Robustly optimised variant of BERT. BERT based models are MLMs (masked language models). If you don’t know what MLM stands for, don’t need to google it just yet. The only reason for bringing that up is to understand the sub-word tokenisation that is happening in the RoBERTa model.

So let’s start with the ner_tags column. Here as we can see, only a few positions have non-zero values. These are NER tags for the specific words encoded in the form of integers. To get the original features we can use the following command.

# Extracting the features from the dataset
dataset['train'].features['ner_tags'].feature

# ouput
ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None)

The hugging face dataset has conveniently given us tags and their mapping. It is pretty straightforward to understand these. PER stands for person, ORG stands for organisation, LOC stands for location and MISC stands for Miscellaneous. O stands for other keywords. This part was easy. The exciting thing is, there are two types of each tag, B and I type. These represent Beginning and Ignore.

The BERT based models are subword (although our dataset only has words) tokenisers, meaning they divide the words into subwords before tokenising them. The following image gives a better idea.

tokenised output

Here, we have printed the respective tag and the words(tokens). ‘World Series of Golf’ is marked as miscellaneous. Only the first word ‘World’ is marked as B-MISC and others are marked as I-MISC. This is because while training the model, BERT ignores these tokens, as we only need the position of the specific data point which can be given by the first token itself. Just to clarify, other words which are not for training are marked as ‘O’ for other.

Tokenisation And Arranging Labels

Now that we have these tokens in our hands, we can begin to build our model. First, we need to define the label to id and id to label relation to our model to understand how we are using these tokens. We can use the following code for it.

label2id = { k: labels.str2int(k) for k in labels.names } # Created these for the model config 
id2label = { v: k for k, v in label2id.items() }. # Created these for the model config

Nothing fancy here, these are simple dictionaries shown below,

print(label2id) 

# output
{'O': 0,
 'B-PER': 1,
 'I-PER': 2,
 'B-ORG': 3,
 'I-ORG': 4,
 'B-LOC': 5,
 'I-LOC': 6,
 'B-MISC': 7,
 'I-MISC': 8}

###############

print(id2label)

# output
{0: 'O',
 1: 'B-PER',
 2: 'I-PER',
 3: 'B-ORG',
 4: 'I-ORG',
 5: 'B-LOC',
 6: 'I-LOC',
 7: 'B-MISC',
 8: 'I-MISC'}

Optional step, if you want to see the tokens (in str format) like above. You can use the following code.

def create_tag_names(batch):
  return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}

dataset_x = dataset.map(create_tag_names)

# This code will add an additional column 'ner_tags_str' in our data.

Tokenisation using tokeniser

We have the NER tags, and tokens (words). First of all, we need to convert those tokens (words) into actual tokens. For this, we can use the following code.

tokenized_inputs = tokenizer(examples["tokens"], truncation=False, is_split_into_words=True)

For the sentence used above, we can have the following output.

tokenizer(dataset['train'][0]['tokens'], truncation=False, is_split_into_words=True)

# ouput
{
'input_ids': [101, 5439, 1011, 7644, 2012, 2088, 2186, 1997, 5439, 1012, 102], 
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}

The input_ids column represents the tokenised words. We have also mentioned truncation and defined that words are already been split.

Loading pre-trained model from hugging face

We have yet to define our tokeniser. Hugging face provides a plethora of pre-trained models and we can choose any model from them. For this task, I have chosen the following model. You can import any model using following syntax.

checkpoint = 'prajjwal1/bert-small' # I have loaded this model as it was pretrained small-bert
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Read about RoBERTa model here,

BERT and its variants

Aligning the tokens

Although our dataset is already aligned with the NER tags. But for real-world data it won’t be the same. For this, we can use the following steps.

Now that we have tokens and respective NER tags, let’s use the following code to do this same,

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=False, is_split_into_words=True) 
    
    # New labels for the whole batch
    labels = []
    
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        
        # New labels for individual example in a batch
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)

            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
        
    return tokenized_inputs

A lot to unpack here, first this is the same as what we have done earlier. Now we need to iterate over ner_tags to align them with our tokens. First, we will have word_ids. These are nothing but giving ids to the respective words. like below,

print(tokenized_inputs.word_ids())
[None, 0, 1, 2, 3, 4, 5, 6, 7, 8, None]

In addition, it ads None tags at the beginning and the end of sentence. This is helpful in further steps.

Once we have word_ids now we can mark the respective words with tokens. First, if we see None, it is either at the beginning or end of a sentence. Hence, we will give it a label which is not possible through our tokeniser, -100. The only tokens we are giving are the tokens which are present in our dataset already, the rest of the tokens (other than ‘Other’ tags) are marked as -100.

Finalising Tokenisation

Let’s process the entire data. We have already set up the logic for creating batches. So we just need to map this function to our dataset like below,

batch_size_num = 8
dataset_encoded = dataset.map(tokenize_and_align_labels, batched = True, batch_size = batch_size_num)

If we see the dataset now, we will have the following,

print(dataset_encoded["train"].column_names)

# output
['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels']

As you can see, our tokeniser has added a few columns to our data. First of all, input_ids and attention_mask are the outputs generated by the tokeniser step, and token_type_ids are generated by our previous step.

We can drop unwanted columns as below,

dataset_encoded = dataset_encoded.remove_columns(["tokens","ner_tags", "id", "chunk_tags"])

Setting up Metrics

Before moving to the model, we need to set up a few things,

  1. first of all, we need to understand our model output and our target (label column in our dataset).
  2. We need to define metrics

For the first step, we can use the following code,

def align_predictions(predictions, label_ids):
    preds = np.argmax(predictions, axis=-1)
    batch_size, seq_len = preds.shape
    labels_list, preds_list = [], []
    
    for batch_idx in range(batch_size):
        example_labels, example_preds = [], []
        for seq_idx in range(seq_len):
            # Ignore label IDs = -100
            if label_ids[batch_idx, seq_idx] != -100:
                example_labels.append(id2label[label_ids[batch_idx][seq_idx]])
                example_preds.append(id2label[preds[batch_idx][seq_idx]])
                
        labels_list.append(example_labels)
        preds_list.append(example_preds)
    return preds_list, labels_list

There are total 9 targets for our model. So our model would predict the probability of each tag for each token in the sentence (softmax output). To keep things simple, we are using the argmax function which returns the tag with the highest probability. Then we will compare these sequences with the actual output using the previously defined dictionary id2lable.

In the end, we will have a list of predictions and actual outputs (in the list format)

It’s time to define metrics, here I am using F1 metrics as this is a multi-class classification, one way of doing it is as follows,

def compute_metrics(eval_pred):
    y_pred, y_true = align_predictions(eval_pred.predictions,eval_pred.label_ids)
    detailed_report = classification_report(y_true, y_pred, output_dict = True)
    detailed_report = pd.DataFrame(detailed_report).T
    print(detailed_report)
    
    return {"f1": f1_score(y_true, y_pred)}

Training the model

Let’s load the model,

model = AutoModelForTokenClassification.from_pretrained(checkpoint, num_labels = num_labels, id2label=id2label, label2id=label2id)

It’s pretty self-explanatory, we are passing checkpoint (tokeniser), our conversion dictionaries and letting the model know how many labels are there (no of classes for classification).

The hugging face has the Trainer function which we can use to train our model. This will handle all the loops for us. But before that, let’s set up a few hyper-parameters.

total_epochs = 10 
batch_size_num = 16
gradient_accumulation_steps = 2 #finding gradient twice per batch
effective_batch_size = batch_size_num * gradient_accumulation_steps # hence the effective batch size becomes twice as batch size

model_name = f"{new_checkpoint}-ner_{total_epochs}_epochs_{effective_batch_size}_batch_size"

Based on our GPU memory sizes we can tweak these parameters later.

Now there are a few more things that we need to define for our trainer function, these are as follows,

raining_args = TrainingArguments(
    output_dir = "../models/NER/" + model_name,
    per_device_train_batch_size = batch_size_num,
    gradient_accumulation_steps = gradient_accumulation_steps,
    num_train_epochs = total_epochs,
    learning_rate = 1e-5,
    weight_decay = 0.01,
    evaluation_strategy = "epoch",
    per_device_eval_batch_size = batch_size_num,
    save_strategy = 'epoch',
    logging_strategy = 'epoch',
    log_level = "error"
)

Just one more thing remaining, to add data_collector. This will handle all the padding and truncation for us. Making our inputs consistent and generating the iterator for our model.

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer) 

Done! now let’s train our model using the trainer function.

trainer = Trainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    train_dataset = dataset_encoded['train'],
    eval_dataset = dataset_encoded['validation'],
    compute_metrics = compute_metrics,
    tokenizer = tokenizer
)
result = trainer.train()

We can even evaluate our model using the following. The only difference is, we won’t have to back-propagation off.

trainer.evaluate()

For the demo purpose, I have trained the model just for one epoch, but still achieved descent accuracy.

							precision    recall  f1-score  support
LOC            0.846708  0.896026  0.870669   1837.0
MISC           0.707751  0.604121  0.651843    922.0
ORG            0.643186  0.728561  0.683217   1341.0
PER            0.922513  0.956569  0.939232   1842.0
micro avg      0.802273  0.831706  0.816725   5942.0
macro avg      0.780040  0.796319  0.786240   5942.0
weighted avg   0.802715  0.831706  0.815664   5942.0
{'eval_loss': 0.1240999847650528, 'eval_f1': 0.8167245083457281, 'eval_runtime': 5.812, 'eval_samples_per_second': 559.187, 'eval_steps_per_second': 139.883, 'epoch': 1.0}
{'eval_loss': 0.1240999847650528,
 'eval_f1': 0.8167245083457281,
 'eval_runtime': 5.812,
 'eval_samples_per_second': 559.187,
 'eval_steps_per_second': 139.883,
 'epoch': 1.0}

Model Inferences using Hugging Face pipeline

Once we have trained our model, the hugging face provides a pipeline which handles all the tasks above and gives direct output.

from transformers import pipeline
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Arshad and I live in Mumbai"

ner_results = nlp(example)
print(ner_results)

# ouput
[{'entity': 'B-PER', 'score': 0.84195936, 'index': 4, 'word': 'arshad', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.9583987, 'index': 9, 'word': 'mumbai', 'start': 34, 'end': 40}]

Looks good!

You can save this model to your local machine like the below,

trainer.save_model("PATH/my_model")

Conclusion

I hope this was helpful. We saw the implementation of various Hugging Face inbuilt tools such as trainer, and pipeline as well as we loaded pre-trained models and trained them on text data. This approach should work for any text data.

If you have any queries or doubts or improvements, please hit me anytime at [email protected]!


Discover more from Arshad Kazi

Subscribe to get the latest posts sent to your email.

Leave a Reply/Feedback :)