Dealing with Small Datasets — Get More From Less — TensorFlow 2.0 — Part 1
There are a lot of huge datasets available on the internet for building machine learning models. But often times, we come across a situation where we have less data. With a small dataset, it becomes very easy to overfit in trying to achieve good accuracy. We end up with a model which would not then generalize well to the unseen data. If you have come across such situations, then this post is for you!
The best approach is to get more data to get a good accuracy without overfitting your model. But it’s understandable that getting more data is not always possible or feasible. In such cases, there are other approaches that can be used to deal with small datasets. We’ll explore a few of them in this post to get you started:
- Data Augmentation using Image Generators in TensorFlow 2.0
- Transfer learning in TensorFlow 2.0
In statistics, overfitting is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”.
The green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and in the process of fitting the training set, it is overlooking the generalized trend of the data. Hence, it is likely to have a higher error rate on the unseen data. On the other hand,the black line is mis-classifying a few training dataset points, but it has picked up the general trend well and is more likely to perform better on the unseen data.
In the above example, a noisy (but roughly linear) data is fitted to a linear function and a polynomial function. By looking at the graph we can see that the polynomial function fits the data perfectly, but it is the linear function that would generalize better: if the two functions were used to extrapolate beyond the fitted data, the linear function should make better predictions since it has picked up the general trend of the dataset.
The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.
An overfitted model, extracts features/trends from the dataset that do not actually represent the generalization of the data structure. As a result of this, when unseen data is unable to fit in the expected structure, the model fails in predictions. Think of it as being very good at spotting something from a limited dataset, but getting confused when you see something that doesn’t match your expectations.
For example, you have a dataset of different objects and one of these is shoes. In your dataset, the shoes are as follows:
You train your model on the training set and you are classifying input into ‘shoe’ and ‘not shoe’.
Now when you input a test case as follows:
You are most likely to classify this as ‘not shoe’. That is because You have overfit in your understanding of what a shoe looks like. You weren’t flexible enough to see this high-heel as a shoe because all of your training and all of your experience in what shoes look like are these hiking boots. Now, this is a common problem in training classifiers, particularly when you have limited data.
If you think about it, you would need an infinite dataset to build a perfect classifier; to create a perfect understanding of what something looks like. But that is not possible.
The opposite of overfitting is underfitting. Underfitting occurs when there is still room for improvement on the test data. This can happen for a number of reasons:
- model is not powerful enough
- model is over-regularized
- model has not been trained long enough(the network was unable to learn relevant patterns in the training data)
If you train for too long though, the model may start to overfit and learn patterns from the training data that don’t generalize to the test data. We need to strike a balance. Understanding how to train for an appropriate number of epochs, as we’ll explore below, is a useful skill.
In order to prevent overfitting, the best solution is using more training data. A model trained on more diverse data will naturally generalize better. When that is no longer possible, the next best solution is to use techniques like regularization. These place constraints on the quantity and type of information your model can store. If a network can only afford to memorize a small number of patterns, the optimization process will force it to focus on the most prominent patterns, which have a better chance of generalizing well. This can be understood (in simpler terms) in a comparison between larger and smaller models. A larger model has a larger capacity to learn patterns whereas a smaller model will have lesser capacity. Depending on the capacity, the model will try to focus on most prominent patterns as it tries to optimize itself in training process.
What is data augmentation?
Data augmentation is a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data.
- We will see how we can use Tensorflow Image Generator to significantly augment our data
- We’ll look into how the Image Generator works
Image augmentation is one of the most widely used concepts in deep learning to increase your dataset size and make your neural networks perform better.
So for example, if we have a training dataset containing images of cat and in all/most images the cat is upright with ears up, the model may not be able to spot a cat that’s lying down.
But with augmentation, we can rotate, skew image or perform other transforms that would effectively generate data to cover these missing bits in the training data. So you rotate or skew the image and toss that into the training set. But there’s an important trick to how you do this in TensorFlow as well, to not take an image, warp it, skew it, and then blow up the memory requirements. (we’ll come to this trick later)
When using convolutional neural networks, we pass convolutions over an image to learn particular features. It can be pointy ears for the cat, two legs for a human and shapes/colors for fruits. Convolutions are good at spotting these features if they’re clear and distinct. We can use data augmentation to transform an upright cat image in such a way that it matches other pictures of cat where the ears are oriented differently.
Image Augmentation is a very simple, but very powerful tool to help you avoid overfitting your data. The concept is very simple though: If you have limited data, then the chances of you having data to match potential future predictions is also limited, and logically, the less data you have, the less chance you have of getting accurate predictions for data that your model hasn’t yet seen. To put it simply, if you are training a model to spot cats, and your model has never seen what a cat looks like when lying down, it might not recognize that in future.
So if the network was never trained for an image of a cat reclining like in Figure 5, it may not recognize it. If you don’t have the data for a cat reclining, then you could end up in an overfitting situation. But through image generators if your images are fed into the training with augmentation such as a rotation, the feature might then be spotted, even if you don’t have a cat reclining, the upright cat image from your dataset when rotated, could end up looking similar.
Augmentation simply amends your images on-the-fly while training using transforms like rotation. So, it could ‘simulate’ an image of a cat lying down by rotating a ‘standing’ cat by 90 or 180 degrees. As such you get a cheap way of extending your dataset beyond what you have already.
By rotating one image, we have captured the orientation of ears in another image. This helps to broaden the feature location set and to help make the model understand that the certain feature (pointed ears in this case) can be present in different locations of the image and in different orientations. As a result of this, the model will not overfit on a specific feature location/orientation.
Augmentation simply amends your images on-the-fly while training using transforms like rotation. So, it could ‘simulate’ an image of a cat lying down by rotating a ‘standing’ cat by 90 degrees. As such you get a cheap way of extending your dataset beyond what you have already.
Let’s look at some examples of how augmenting data helps to identify unseen scenarios:
So we can see that by applying shear on the left image, we have ended up in a similar pose to the right image.
By horizontally flipping the image, we can make the left image more structurally similar to the right image and as a result of this our model will not overfit to right-arm raisers and has more chances of classifying correctly the unseen image on the right of a left-arm raiser.
Let’s have a look at how we can do data augmentation using image generators.
A great tool in TensorFlow is ImageGenerator. Let’s dive in to see what it really is:
- A tool in TensorFlow, which can be used for data augmentation, easy loading of images, can convert images to batches of tensors, used for memory optimization etc
- Generates batches of tensor image data with real-time data augmentation
- Can load images directly from a directory with automatic labeling and augmentation
For example, you want to create a classifier to classify images of malaria infected and uninfected cell images. This is how the directory should look:
One feature of image generator is that you can point it to a directory and the sub-directorates of that directory will automatically generate labels for you. so for example consider this directory structure. There’s a main directory containing two folders; Training and Testing. These further have sub-directories Infected & Uninfected. Now when you place corresponding images in these sub-directories, the generator can create a feeder for these images and auto-label them for you. So if you point the generator to the Training directory, images will be loaded and labelled accordingly as Infected & Uninfected depending on in which folder they are in.
Note: You have to point the generator to it’s train/test/validation directories. It’s a common mistake to point the generator to the sub-directories. You should always point it at the directory that contains the sub-directories.
You can import ImageDataGenerator from keras preprocessing module.
from tensorflow.keras.preprocessing.image import ImageDataGenerator
You can then instantiate the image generator by passing rescale to it(code below). This rescaling will help normalize the images. You can define other augmentation parameter values such as shear_range, zoom_range etc.
These are just a few of the options available (for more, see the TensorFlow documentation. Let’s quickly go over what we just wrote:
- rescale: normalize images
- rotation_range is a value in degrees (0–180), a range within which to randomly rotate pictures.
- width_shift and height_shift are ranges (as a fraction of total width or height) within which to randomly translate pictures vertically or horizontally.
- shear_range is for randomly applying shearing transformations.
zoom_range is for randomly zooming inside pictures.
- horizontal_flip is for randomly flipping half of the images horizontally. This is relevant when there are no assumptions of horizontal asymmetry (e.g. real-world pictures).
- fill_mode is the strategy used for filling in newly created pixels, which can appear after a rotation or a width/height shift.
Once ImageDataGenerator has been initialized, you can call flow_from_directory method on it to get it to load images. You can pass the directory path and can define image shape and batch size. You can also define whether it’s a binary problem or categorical etc.
- train_dir: the training source directory you want to point your image generator to
- target_size: all images will be resized to this size
- class_mode: if two classes, binary, if multiple classes then categorcial
- batch_size: size of batch to load images
The names of the sub-directories will be labels for the images. For training a neural network the images need to be the same size, you can define a target size for the images while loading. This should be the input size to your model then. This way you won’t need to preprocess all your images before hand on your file system. The advantage of doing this at runtime is that you can then experiment with different sizes without impacting your original data.
It’s more efficient to load images in batches for training rather than one by one. So you can define batch_size while loading images through image generator and can experiment with different batch sizes easily as well.
As the images get flown off the directory, then the augmentation will take place in memory as they’re being loaded into the neural network for training. So if you’re dealing with a dataset and you want to experiment with different augmentations, you’re not overriding the data. The image generator lets you load it into memory and just in memory, process the images and then stream that as training set to the neural network we’ll ultimately learn on.
That’s about it for this post. For a detailed walk through of the code to use image generators, you can have a look at my article “Malaria Detection — From Kaggle to TensorFlow in Google Colab — All in One Place”. I’ll cover using Transfer Learning to get more out of less data, in detail in the next post. Cheers!
- You can have a look at Malaria Detection Article which uses TensorFlow and image generators
- You can also run the code for Malaria Detection from the respective git repo.
- Images for data augmentation scenarios taken from Laurence Moroney’s course Convolutional Neural Networks in TensorFlow on Coursera.