Solving text-based CAPTCHA with Machine Learning — part 1

How to

People on the Internet these days are more or less familiar with the term CAPTCHA, which stands for ‘Completely Automated Public Turing test to tell Computers and Humans Apart’. The main purpose of CAPTCHA is to prevent spam from bots. In early 2000, CAPTCHA is just plain text with noises, maybe some fantasy colour to make it harder for a computer to recognise. We take an average of 10~15 seconds to solve a CAPTCHA.

Spending a couple of seconds to authorise yourself doesn’t sound too bad until it becomes a bit counterintuitive.

Some of this really make us question ourselves.

It is obvious that we trade usability for security, but let’s admit it, these too-difficult-for-user CAPTCHA does not really help by any means. Moreover, the computer can already solve these with ease!

Note: This article will be divided into two parts. In this first part, we will cover the basic definition of CAPTCHA, how machine learning can tackle this problem easily, intuition on CNNs, an example of the approach that we will do. For the second part, will be the implementation in python using Tensorflow and Keras, experimental results and conclusion on whether using text-based CAPTCHA is still a great idea or not.

Machine learning — one way to tackle CAPTCHA

No matter how much CAPTCHA evolves, there are always people who come up with a tool to break it. One of the most famous methods is to use machine learning approach, and our main focus for today is going to be a specific type of Neural Network called Convolutional Neural Networks (CNNs).

I love to see CNNs as a mimic of the brain. It works similar to how our brain able to recognise, and differentiate one object from another. To provide a better intuition, given this picture below, you can immediately tell that these two animals are not the same species! but how?

It comes from the fact that we have seen possibly a million pictures of dogs and cats, as well as seen them in real life. When we were a kid, we are told that they are different. Then, our brain slowly understands the distinctions between these two. Hence, it gives us a capability to correctly recognise which one is a dog, and which one is a cat.

Using the same concept above, we are going to do the same for our Neural Network, and that is exactly what CNNs does. Well, not exactly the same because our computer does not perceive the picture the same as we do. They see bunches of numbers that indicate an intensity of colour on that particular pixel. If we have an RGB image, one of the way to display them as an array is [R, G, B, A]

Layers in CNNs are special as they are organised in 3 Dimensions, width, height and depth. This fact allows us to feed in a picture to the network. The final layer which is the fully connected layer will be the one that tells us what it predicts.

Feeding the picture to CNNs. It will use filter and pooling to extract out features by doing mathematical application against the picture matrices. This is the part where our machine learns components of this picture, starting from something simple like edge detection, and to be able to catch a pattern that we look for.

Filter is simply another matrix that we come up with, we will do element-wise multiplication against picture matrices. By applying this to the picture we have done a Convolution, combining two information into one.

Pooling is when we want to downsample an input representation by either picking the Max, or the Average out of each sub matrix in the matrices. Doing this help us to reduce the dimensionality of the data and prevent overfitting. Since we make the data to be more abstract.

Solving the CAPTCHA

Now that we have a basic understanding of what CNNs do, we will use this method to breakdown CAPTCHA and see how accurate machine can solve it.

Let’s us look at the CAPTCHA again. For the sake of simplicity, let’s assume that it comes in a combination of 26 English alphabets and 0–9 numerical numbers. We are going to fix our CAPTCHA’s length to be 6.

We are going to clean our data a bit before feeding into our network by breaking it down into an individual character using overlapping windows or edge detection to find the character. This is because if we were to allow the CNNs to predict a combination of 26 English alphabets (Both uppercase and lowercase) and 0–9 numbers, we’re going to have 62! x 62! x 62! x 62! x 62! x 62! possible outputs. Our poor machine is going to choke itself to death.

As for the noise, we are going to leave it at that because we really want to test out how accurate our machine is.

Now that we have the recipe, let’s plug in the CAPTCHA picture … .wait we can’t do that yet! We have to train our network first!

Yes, without training our network is nothing but a baby! In this scenario, we are going to use the famous MNIST which is a handwritten digit and EMNIST, a handwritten alphabets to train our network.

After we have chosen the desired dataset to train, we will need a model. As what most people do nowadays is to used a pre-trained model and fine-tune the networks. This is called transfer learning. However, sometimes it is an overkill when what you’re doing does not really reflect well with the pre-trained model as in this case, we are just trying to predict numbers and letters, not animals or faces.

That will be it for this part. I believe we have a good understanding of what our approach looks like, next will be an implementation of CNNs in python using Tensorflow and Keras. We will go through steps by steps, and conclude whether using text-based CAPTCHA in 2018 really helps with preventing spamming or not. Thank you for giving this article a chance!

Contact us

Drop us a line and we will get back to you