NLP in TensorFlow — All You Need for a Kickstart

Ever heard of Natural Langauge Processing? Whether your answer to this is a Yes or a No, this blogpost is for you!

NLP,Natural Language Processing, is a growing field in Artificial Intelligence, something that holds a lot of value in the Tech World. But even after years of work, getting results close to human level performance has been a challenge. With this field being one of the subjects for the center of attention in Deep Learning nowadays, it opens up great opportunity for researchers and developers at all levels to dive into it.

In this blogpost we’ll cover the following topics to get a kickstart on doing NLP in TensorFlow:

  • A tiny bit of History
  • Natural Language Processing
  • TensorFlow Overview
  • Creating a Sarcasm Detector in TensorFlow
  • Embedding Projector — Visualize the Embedding Learnt
  • Takeaways

So let’s begin

A tiny bit of History

The very first chatbot was developed back in 1960s at MIT by Joseph Weizenbaum. As perhaps the world’s first chatbot — and a direct ancestor of the likes of Alexa and Siri, ELIZA could only communicate through text. She couldn’t talk like Alexa and was not capable of learning from her conversations with humans. But it did pave way for the later efforts in the field of Natural Language Processing. But what is this field, what do we mean when we say Natural Language Processing?

Natural Language Processing

Natural Language Processing, usually shortened as NLP, is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language.

The main objective of NLP is to read, decipher, understand, and make sense of the natural human language in a way that is valuable. But unlike a computer language, which is highly structure, human speech however is not always precise and structured — it holds ambiguity and complex variables may occur in its linguistic structure such as slang, regional dialects and social context.

NLP is used in Google Translate, to translate between different languages. It’s also used in Personal Assistant Applications such as Siri, Alexa, Bixby — to make conversations and understand speech.

Because computers understand everything in numbers, there’s a bit of processing involved in how words are represented. We’ll come to that soon. Let’s first look a bit into TensorFlow.

TensorFlow Overview

TensorFlow is an open-source Machine Learning Library. Created by Google back in 2015, it is now one of the most used tools in Machine Learning. It’s higher level API, Keras is easy to use and has a smooth learning curve.

In this blog post we’ll see how we can create a classifier based on text model. When we are dealing with images, the pixel values are numbers and computers can understand numbers. But what do we do with text?

We can take character encodings for each word in the text, for example, ASCII values but it’s not the best choice. Let’s say we have the word LISTEN and we ASCII encode it as follows:

But the problem with this is that the semantics of the word are not encoded in individual letters. This can be seen from the word SILENT which will be ASCII encoded as follows:

Silent has a very different meaning, almost opposite, to Listen but has exactly the same letters so training a neural network with just the encoded letters will be a hard task.

Now let’s see what happens when we use complete words and assign each word a number to encode it. For example we have the sentence I love my dog, this can then be encoded as

  • I : 1
  • love : 2
  • my : 3
  • dog: 4

So you give each word a unique identifier (number). Now if I have the sentence I love my cat, that should look like:

  • I : 1
  • love : 2
  • my : 3
  • cat: 5

As cat is an additional word, it will be encoded as 5 and the rest of the already existing words follow their unique identifier number.

The two sentences will hence be encoded as:

  • I love my dog : [1,2,3,4]
  • I love my cat : [1,2,3,5]

I hope you are following up til now. Let’s move forward. Fortunately, TensorFlow and Keras have an API that can be used to assign numbers to words in your text. Let’s see how it can be used.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
'I love my dog',
'I love my cat'
]
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

We are importing tensorflow and keras. And then we are importing the Tokenizer API which can be used to create word encodings. It takes text as input and assigns a unique integer value to all the words in the text passed. However, we can use num_words to limit the vocabulary and ask tokenizer to only encode the most frequent/common X words. In the above example we are limiting the vocabulary created to just 100 words(X=100).

We first initialize Tokenizer with the num_words parameter to limit vocabulary. We then call the fit_on_texts method which takes in data and encodes it.

To retrieve the mapping we can use word_index. word_index property of the tokenizer is a dictionary with key value pairs, where keys are words and values are the encoded integers assigned. This is how word_index from the above example would look like:

word_index = {
'I' : 1,
'love' : 2,
'my' : 3,
'dog' : 4,
'cat' : 5
}

If I add another sentence “I like my garden”, now we have two new words like and garden and our sentences will look like this:

sentences = [
'I love my dog',
'I love my cat',
'I like my garden'
]

If we fit out tokenizer on this then this is the word_index we’ll get:

word_index = {
'I' : 1,
'love' : 2,
'my' : 3,
'dog' : 4,
'cat' : 5,
'like' : 6,
'garden' : 7
}

You can see that the new words have also been given unique identifiers in the word_index. Now that we have our word_index and our tokenizer has been fit, we can use these to encode the sentences. It simply converts the text in each sentence to the corresponding encoded values — integer value for each word in the sentence. As a result ‘I love my dog’ will become [1,2,3,4] and ‘I like my garden’ will become [1,6,3,5].

This is easily done with just one line of code:

sequences = tokenizer.texts_to_sequences(sentences)

For the above collection of sentences, this is how the encoded sequences will look like:

[[1,2,3,4], [1,2,3,5], [1,6,3,7]]

If we now have test data as follows:

test_sentences = ['I love my sister', 
'my cat loves my dog'
]

When we convert this to sequences using the tokenizer we fit on the train data, this is what we’ll get:

[[1,2,3], [3,5,3]]

You can see that it only encoded the words that were already present in the word_index of the tokenizer and it ignored the unseen words. In order to cater for this we use something called an ‘out of vocabulary token’. But at the same time we need a large training data to get a good vocabulary otherwise we’ll have a lot of out of vocabulary words. Whenever an unseen word is encountered we can use a special value to refer to it instead of completely ignoring it. We can add this as property of the tokenizer like this:

tokenizer = Tokenizer(num_words=100, oov_token= '<OOV>')

You can use whatever you like in place of ‘<OOV>’ but just make sure it is unique and is not something that can be confused with a word. Now with this adjustment, our test sentences will look something like this:

[[2,3,4,1], [4,6,1,4,1]]

Note that oov_token will take the integer 1 as its encoded value so our word_index will be as follows:

{'<OOV>' : 1,'i':2, 'love':3, 'my':4, 'dog':5, 'cat':6, 'like':7, 'garden':8}  

Out train sentences will now look like the following when encoded:

[[2,3,4,5], [2,3,4,6], [2,7,4,8]]

Now we’ll need to manipulate these lists of sequences to make the sentences the same length otherwise it will be hard to train the neural network with them. For this we’ll use padding. We’ll define a maximum length for sentences and then we’ll add 0s at the end or start of the sentence(if required) to make it equal to the specified maximum length. This way we’ll have all our sentences of the same length, max_length. Fortunately, TensorFlow includes an API that handles this.

In order to use the padding function, we’ll have to import pad_sequences as follows:

from tensorflow.keras.preprocessing.sequence import pad_sequences

Then once our sentences have been passed through the tokenizer to create sequences, we can pad the sequences. We’ll pass them to the pad_sequences function for this:

padded_sequences = pad_sequences(sequences, max_length=6)

As a result, the sequences will be padded with zeros at the beginning if the length is less than 5. By default the padding is ‘pre’, we can change this to ‘post’ by passing the argument padding=’post’ in pad_sequences.

To understand padding we’ll add another sentence to our train data.

sentences = [
'I love my dog',
'I love my cat',
'I like my garden',
'My cat plays in my garden'
]

With the default setting we get the following results:

[[0,0,2,3,4,5], [0,0,2,3,4,6], [0,0,2,7,4,8], [4,6,9,10,4,8]]

Here’s the word_index for reference:

{'<OOV>' : 1,'i':2, 'love':3, 'my':4, 'dog':5, 'cat':6, 'like':7, 'garden':8, 'plays':9, 'in':10}

Now that we have padded sequences, we are all set to train our model :D

Up til now this is our final code:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
sentences = [
'I love my dog',
'I love my cat',
'I like my garden',
'My cat plays in my garden'
]
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
padded_sequences = pad_sequences(sequences, max_length=6)
print(padded_sequences)

Output:

{'<OOV>' : 1,'i':2, 'love':3, 'my':4, 'dog':5, 'cat':6, 'like':7, 'garden':8, 'plays':9, 'in':10}[[0,0,2,3,4,5], [0,0,2,3,4,6], [0,0,2,7,4,8], [4,6,9,10,4,8]]

But not with this News Headlines Dataset, we’ll look into how we can create a classifier to classify text as sarcastic and not_sarcastic.

So let’s get digging!

Creating a Sarcasm Detector in TensorFlow

So we’ll follow similar steps to classify our sarcasm dataset.

  • Fit a tokenizer
  • Convert train sentences to sequences (pass the sentences through the tokenizer)
  • Pad the sequences to have uniform length sequences

We’ll not go through it all over again but I’ll provide the link to the code at the end of the blogpost. When we perform testing we’ll pass our testing data through the same tokenizer(trained on training sentences) and then pad the sequnces. For now what’s important to remember is that after tokenization and padding we’ll have:

  • training data: padded_sequences, training_labels_final
  • testing data: testing_padded, testing_labels_final

Once the above is done, we’ll define our model architecture as follows:

model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(6, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
]
)

If you have seen a neural network architecture before, then the above may seem familiar, apart from the first layer — Embedding. This layer is the key to text sentiment analysis in TensorFlow, in short this is where all the magic really happens.

Embedding consists of vectors for each word with their associated sentiments. Now what does this means really? for example we have a movie review dataset, now depending on our labels, the embedding layer will learn ‘meaning’ of different words in a way that similar words will have similar vectors — good and nice will have similar vectors, bad and awful will have similar vectors. The embedding_dim defines the dimension of vector representing each word. if it’s 16 then we are looking at each word in higher space dimension of 16. It can be 64, 150, 300 even. The results of the embedding will be a 2D array with the length of the sentence and the embedding dimension for example 16 as its size.

Our last layer is a Dense output of single node with 0 or 1; 1 will mean sarcastic and 0 will mean not_sarcastic.

Because we only have two classes (sarcastic and not_sarcastic) so we’ll use binary_crossentropy for loss. We’ll use Adam optimizer.

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

We can use model.summary() to print the model architecture details:

Model: "sequential_6"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_8 (Embedding) (None, 32, 16) 160000
_________________________________________________________________
global_average_pooling1d_1 ( (None, 16) 0
_________________________________________________________________
dense_12 (Dense) (None, 24) 408
_________________________________________________________________
dense_13 (Dense) (None, 1) 25
=================================================================
Total params: 160,433
Trainable params: 160,433
Non-trainable params: 0

Let’s train the model for 10 epochs and call model.fit() to start training. We’ll pass in our padded_sequences as inputs and labels as outputs.

num_epochs = 10
model.fit(
padded_sequences,
training_labels_final,
epochs=num_epochs,
validation_data = (testing_padded, testing_labels_final)
)

show output here

Embedding Projector — Visualize Embedding Learnt

We have trained our model and learned embedding for our text corpus. There’s a good way to visualize embedding to understand the sentiments learned and to see to some extent, if we have learned it right.

For visualization we’ll use TensorFlow projector which takes files in a tsv format. We’ll need meta data file and a vector file.

  • meta data file : includes all the words from word_index
  • vector file : includes corresponding vectors

First let’s get our learnt embedding layer after training

e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape = (vocab_size, embedding_dim)

Output: (10000,16)

To save words into our meta_data file, we can write a simple helper function to convert our word_index into words as follows:

reverse_word_index = dict([(value,key) for (key,value) in word_index.items()])

Now we’ll traverse over our vocabulary, store words in our meta file and their corresponding vectors in vectors file as follows:

import ioout_v = io.open('data/vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('data/meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
word = reverse_word_index[word_num]
embeddings = weights[word_num]
out_m.write(word + '\n')
out_v.write('\t'.join(str(x) for x in embeddings) + '\n')
out_m.close()
out_v.close()

Now we are all set to visualize our embedding, so let’s open TensorFlow embedding projector, this is how it’ll look:

On the left side, click on the button Load

Choose/upload file for vectors and meta data accordingly

Select Sphereize data to normalize the vectors for better visualization

You can type in search a word to find its neighbors; words with similar vectors.

You can see in the bottom right corner, a list of words that have similar vectors in the embedding.

Here’s another example

Extras — bonus visualization for better understanding

I also tried the TensorFlow Embedding Projector on the famous IMDB dataset. IMDB is a movie review dataset with reviews and labels(positive review or negative review). We would expect words that show positive review to be cluttered together in embedding and similarly for words that signal negative reviews to have similar vectors. Here are the results for the embedding learnt:

Let’s first search the word ‘boring’ :

As you can see boring has neighbors — dreadful, pointless etc.

Let’s search a positive word ‘exciting’ :

You can see exciting occurring close to words like adventure, lovable etc.

I recently came across this tool, and not many people know about this. But I think it’s a very useful tool when you are working in NLP and it’s very very easy to use. So what are the takeaways from this blogpost:

  • Familiarity with NLP
  • How are sentences preprocessed before they are fed into a Neural Network of any kind
  • Deeper dive into how TensorFlow can be used to preprocess data
  • Creating a sarcasm detector in TensorFlow
  • Visualizing embedding learnt using TensorFlow Embedding Projector
  • And as promised here is the complete code:

Hope this was a good read, Thank you for reading till the end. Cheers!

Tuning my hyper-parameters everyday to dive deeper into the foray of deep learning.