Sarcasm Detection Using GRU, LSTM
Yes, we are gonna build a machine learning model for sarcasm detection. If you are just getting started in machine learning then this is a solid start for you.
Before starting this tutorial, you should have basic understanding of neural networks and you should be familiar with pandas library. Don’t worry, if you are not that comfortable with those libraries yet, this is exactly for you.
Also, I would highly recommend you to do this course on Coursera if you want to learn deep learning.
Let’s go!
The outline for this tutorial is here
- Collect the dataset from kaggle and setting up colab
- Tokenisation of each word in the sentence
- Padding the token sequence
- Training a model using tensorflow
- Saving the model so that we can use it later
Use google colab for this tutorial, it’s a free platform (a jupyter notebook) which keeps us away from the hassles of installing all those dependancies.
Downloading kaggle dataset into colab
First of all, open a colab notebook, if you are familiar with jupyter notebooks you will feel right at home. Let’s first download a dataset from kaggle into colab.
for that you would need to download an API key from your kaggle profile. That would be just under your ‘account’ menu.
Now, run the following code,
! pip install kaggle
from google.colab import files
files.upload()
It will give you option to upload a file into colab. grab that API you have just downloaded and upload it here. You can see that in the files folder on colab.
Now that’s done, it’s time to download the dataset from the kaggle,
Run the following command in the next cell,
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! kaggle datasets download -d rmisra/news-headlines-dataset-for-sarcasm-detection
! unzip news-headlines-dataset-for-sarcasm-detection.zip -d sarcasm
The first tow lines are to copy your api into kaggle directory. The third line actually downloads the data(a zip file) from the kaggle. In this case, we are using this data from kaggle. You can get this command for any kaggle data by simply going to that particular dataset and copying the api command. Look at the following image…
The fourth line unzips the into the directory with name which we provide, in this case, sarcasm.
Peeking at data
Now that we have downloaded the dataset into colab, let’s look at the dataset itself. You will see that there are two .json files in the dataset ( v1 and v2). You can see more details about dataset here. Let’s load them one by one.
first of all let’s import all the libraries. Run the following,
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, Dense, Bidirectional, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
We will be making a pandas dataframe out of those json files. you can do this by using following command,
data = pd.read_json("/content/sarcasm/Sarcasm_Headlines_Dataset.json", lines = True)
data2 = pd.read_json("/content/sarcasm/Sarcasm_Headlines_Dataset_v2.json", lines = True)
To look at any of the dataframe use the following command. data.head() will give you top rows from the dataframe.
print(data.head(10) # prints top 10 rows from data
here is our output,
As you can see, there are three columns, article link, headline and the is_sarcastic. Headline column contains the actual headline and is_sarcastic contains the lable for the headline (1 or 0)
If you look at data2.head(10) you will find that the arrangement of the columns are different but the dataset is of same type. We can combine these two datasets i.e data and data2 into one dataframe which we can use for training.
Some useful Pandas Commands Data Scientists
To rearrange the columns in the data2 use the following command
data2 = data2[["article_link","headline", "is_sarcastic"]]
to concatenate tow dataframes use the following command,
data = pd.concat([data, data2])
Don’t forget to reset the index.
data.reset_index(drop = True, inplace =True)
While at it, let’s remove the unwanted column, article_link. We don’t need that for sarcasm detection.
data.drop(['article_link'], axis= 1, inplace = True)
Finally we have a dataframe which has 55328 rows and 2 columns. Use following code to view data and its shape.
data.head(10)
data.shape # prints shape of dataframe (55328, 2)
Tokenisation of words
We are using text data to train our sarcasm detection model. Computers don’t understand the words, just numbers. So, we should tokenise each word with certain number. (For example, ‘I want to eat an apple’ can be written as [10, 11, 2, 6, 2] each word has different number)
Don’t worry, we don’t have to assign them manually, keras has inbuilt Tokenizer which can do this for us, just use the following code,
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data['headline']) # will tokenise all the words in headlines
tokenizer.word_index # will print all the words and their tokens
Until now, we have tokenised the individual words, we have more than 30,000 words in our dataset ( we call it corpus). Now let’s put these tokens in the sentences.
def applyToken(s): # function to make tokens for senteces
tokens = tokenizer.texts_to_sequences(s)[0]
return tokens
# creating a new column in data which contains tokens for each headline
data['token'] = [applyToken([x]) for x in data['headline']]
for better understanding we can see the following,
Padding the token sequences
Our headlines have variable lengths. We need a lists with constant length so that we can use them to train our model. We can do this by setting length of each sentence to the length of largest sentence in the data. For example, in out data, the largest sentence has length 157, so we will set all the sentences to the length 157 by adding zeros to them.
shown below,
Train test split
Let’s split the dataset into training and testing,
split_train = int(0.8* len(padded)) # 0.8 is 80%
train_x = padded[:split_train] # 80% train data
val_x = padded[split_train:] # 20% validation data
train_y = data['is_sarcastic'][:split_train] # same with lables
val_y = data['is_sarcastic'][split_train:]
Building a Model
Now, that we have all the padded sentences, it’s time to build a model for sarcasm detection,
model = tf.keras.Sequential([
tf.keras.layers.Embedding(total_words +1, 16),
tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32)),
tf.keras.layers.Dense(6, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
Embedding Layer
We have Embedding layer as a first layer. To understand embedding you can read this blog. But in simple terms, using embeddings, we find the similarity between words. For example, the words ‘cat’ and ‘cats’ have lot more similarity than words ‘cat’ and ‘movie’. Similarly, ‘beautiful’ and ‘spectacular’ have much similarity than ‘beautiful’ and ‘depressing’. In short, we get the context in which words are being used. This is done by using vectors. The more similar the words are, the closer the word vectors would be.
We are passing total_words (total no of words in the corpus) plus one, this plus one is for that word which is used for unknown words (i.e this particular word(token) is used for the tokens which are not present in corpus)
GRU Layer
The second layer is GRU (bidirectional) to understand this, read this blog.
Then we have couple of dense layers with ReLU activation. The last layer has just 1 unit as we are using sigmoid activation.
Train and fit the model
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
history = model.fit(train_x, train_y
,epochs=5, validation_data=(val_x, val_y),verbose=1)
We are using binary_crossentropy as we are doing binary classification (0 or 10).
We can see the model summary as this,
If ran just for 5 epochs, we can achieve an accuracy of 98% which is pretty good.
Save the model
Once the we have trained our model for sarcasm detection we can save it, so that we won’t have to run it again while using it.
model.save('my_checkpoint.h5') # saves the model with given name
For testing purpose lets try out some sarcastic quotes,
s1 = 'Cows lose their jobs as milk prices drop'
tokenizer.fit_on_texts(s1)
s1 = tokenizer.texts_to_sequences([s1])
s1 = pad_sequences(s1, maxlen = max_len, padding = 'pre')
if model.predict(s1)>= 0.75:
print("Sarcastic headline!")
else:
print('Not sarcastic')
# output
Sarcastic headline!
Done!
In the end, feel free to try out these things, I would highly recommend you to read those external sources, I have provided here. I hope you got inspired to build more models like this. There are many things you can try. This is just a starting point!
You can see this entire notebook on my GitHub.
Do you have any questions or suggestions? Please free to reach out on [email protected] or hit me anytime on Twitter, LinkedIn!
Discover more from Arshad Kazi
Subscribe to get the latest posts sent to your email.
Leave a Reply/Feedback :)