Sarcasm Detection Using GRU, LSTM

Yes, we are gonna build a machine learning model for sarcasm detection. If you are just getting started in machine learning then this is a solid start for you.

Before starting this tutorial, you should have basic understanding of neural networks and you should be familiar with pandas library. Don’t worry, if you are not that comfortable with those libraries yet, this is exactly for you.

Also, I would highly recommend you to do this course on Coursera if you want to learn deep learning.

Let’s go!


The outline for this tutorial is here

  1. Collect the dataset from kaggle and setting up colab
  2. Tokenisation of each word in the sentence
  3. Padding the token sequence
  4. Training a model using tensorflow
  5. Saving the model so that we can use it later

Use google colab for this tutorial, it’s a free platform (a jupyter notebook) which keeps us away from the hassles of installing all those dependancies.

Downloading kaggle dataset into colab

First of all, open a colab notebook, if you are familiar with jupyter notebooks you will feel right at home. Let’s first download a dataset from kaggle into colab.

for that you would need to download an API key from your kaggle profile. That would be just under your ‘account’ menu.

Kaggle API

Now, run the following code,

! pip install kaggle

from google.colab import files 
files.upload()

It will give you option to upload a file into colab. grab that API you have just downloaded and upload it here. You can see that in the files folder on colab.

Now that’s done, it’s time to download the dataset from the kaggle,

Run the following command in the next cell,

! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! kaggle datasets download -d rmisra/news-headlines-dataset-for-sarcasm-detection
! unzip news-headlines-dataset-for-sarcasm-detection.zip -d sarcasm

The first tow lines are to copy your api into kaggle directory. The third line actually downloads the data(a zip file) from the kaggle. In this case, we are using this data from kaggle. You can get this command for any kaggle data by simply going to that particular dataset and copying the api command. Look at the following image…

API command

The fourth line unzips the into the directory with name which we provide, in this case, sarcasm.

Peeking at data

Now that we have downloaded the dataset into colab, let’s look at the dataset itself. You will see that there are two .json files in the dataset ( v1 and v2). You can see more details about dataset here. Let’s load them one by one.

first of all let’s import all the libraries. Run the following,

import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, Dense, Bidirectional, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

We will be making a pandas dataframe out of those json files. you can do this by using following command,

data = pd.read_json("/content/sarcasm/Sarcasm_Headlines_Dataset.json", lines = True)
data2 = pd.read_json("/content/sarcasm/Sarcasm_Headlines_Dataset_v2.json", lines = True)

To look at any of the dataframe use the following command. data.head() will give you top rows from the dataframe.

print(data.head(10) # prints top 10 rows from data

here is our output,

data

As you can see, there are three columns, article link, headline and the is_sarcastic. Headline column contains the actual headline and is_sarcastic contains the lable for the headline (1 or 0)

If you look at data2.head(10) you will find that the arrangement of the columns are different but the dataset is of same type. We can combine these two datasets i.e data and data2 into one dataframe which we can use for training.

Some useful Pandas Commands Data Scientists

To rearrange the columns in the data2 use the following command

data2 = data2[["article_link","headline", "is_sarcastic"]]

to concatenate tow dataframes use the following command,

data = pd.concat([data, data2])

Don’t forget to reset the index.

data.reset_index(drop = True, inplace =True)

While at it, let’s remove the unwanted column, article_link. We don’t need that for sarcasm detection.

data.drop(['article_link'], axis= 1, inplace = True)

Finally we have a dataframe which has 55328 rows and 2 columns. Use following code to view data and its shape.

data.head(10)
data.shape # prints shape of dataframe (55328, 2)

Tokenisation of words

We are using text data to train our sarcasm detection model. Computers don’t understand the words, just numbers. So, we should tokenise each word with certain number. (For example, ‘I want to eat an apple’ can be written as [10, 11, 2, 6, 2] each word has different number)

Don’t worry, we don’t have to assign them manually, keras has inbuilt Tokenizer which can do this for us, just use the following code,

tokenizer = Tokenizer()
tokenizer.fit_on_texts(data['headline']) # will tokenise all the words in headlines
tokenizer.word_index # will print all the words and their tokens

Until now, we have tokenised the individual words, we have more than 30,000 words in our dataset ( we call it corpus). Now let’s put these tokens in the sentences.

def applyToken(s): # function to make tokens for senteces 
  tokens = tokenizer.texts_to_sequences(s)[0] 
  return tokens
  
# creating a new column in data which contains tokens for each headline
data['token'] = [applyToken([x]) for x in data['headline']] 

for better understanding we can see the following,

tokenisation

Padding the token sequences

Our headlines have variable lengths. We need a lists with constant length so that we can use them to train our model. We can do this by setting length of each sentence to the length of largest sentence in the data. For example, in out data, the largest sentence has length 157, so we will set all the sentences to the length 157 by adding zeros to them.

shown below,

padding

Train test split

Let’s split the dataset into training and testing,

split_train = int(0.8* len(padded)) # 0.8 is 80% 
train_x = padded[:split_train] # 80% train data
val_x = padded[split_train:] # 20% validation data
train_y =  data['is_sarcastic'][:split_train] # same with lables
val_y = data['is_sarcastic'][split_train:]

Building a Model

Now, that we have all the padded sentences, it’s time to build a model for sarcasm detection,

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(total_words +1, 16),
    tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32)),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

Embedding Layer

We have Embedding layer as a first layer. To understand embedding you can read this blog. But in simple terms, using embeddings, we find the similarity between words. For example, the words ‘cat’ and ‘cats’ have lot more similarity than words ‘cat’ and ‘movie’. Similarly, ‘beautiful’ and ‘spectacular’ have much similarity than ‘beautiful’ and ‘depressing’. In short, we get the context in which words are being used. This is done by using vectors. The more similar the words are, the closer the word vectors would be.

We are passing total_words (total no of words in the corpus) plus one, this plus one is for that word which is used for unknown words (i.e this particular word(token) is used for the tokens which are not present in corpus)

GRU Layer

The second layer is GRU (bidirectional) to understand this, read this blog.

Then we have couple of dense layers with ReLU activation. The last layer has just 1 unit as we are using sigmoid activation.

Train and fit the model

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
history = model.fit(train_x, train_y
                              ,epochs=5, validation_data=(val_x, val_y),verbose=1)

We are using binary_crossentropy as we are doing binary classification (0 or 10).

We can see the model summary as this,

model

If ran just for 5 epochs, we can achieve an accuracy of 98% which is pretty good.

Save the model

Once the we have trained our model for sarcasm detection we can save it, so that we won’t have to run it again while using it.

model.save('my_checkpoint.h5') # saves the model with given name 

For testing purpose lets try out some sarcastic quotes,

s1 = 'Cows lose their jobs as milk prices drop'
tokenizer.fit_on_texts(s1)
s1 = tokenizer.texts_to_sequences([s1])
s1 = pad_sequences(s1, maxlen = max_len, padding = 'pre')
if model.predict(s1)>= 0.75:
  print("Sarcastic headline!")
else:
  print('Not sarcastic')

# output 
Sarcastic headline!

Done!

In the end, feel free to try out these things, I would highly recommend you to read those external sources, I have provided here. I hope you got inspired to build more models like this. There are many things you can try. This is just a starting point!

You can see this entire notebook on my GitHub.

Do you have any questions or suggestions? Please free to reach out on [email protected] or hit me anytime on Twitter, LinkedIn!

Leave a Reply

Your email address will not be published. Required fields are marked *

*