The Revolution of Behaviour Analysis: The Past Two Decades of CAPTCHA and AI
With the rapid development of Artificial Intelligence (AI), especially deep learning, text-based CAPTCHAs have finally come to a dead-end in the first decade of the 21st century. A study by Stanford University shows that 99% of text-based CAPTCHAs can be solved effortlessly. Back in 2012, the most popular CAPTCHA at that time, Google reCAPTCHA, still relied on text-based CAPTCHA and used it for a crowdsourcing project to help digitize old books, which means there was indeed a vacancy for a game-changing CAPTCHA product that truly combines security and user experience.
Stanford University, 2011
01 The blueprint for behavior analysis CAPTCHA
As regards the evolving deep learning and AI technology, GeeTest believed that the traditional text-based CAPTCHA had already passed its prime, and it was convinced that AI technology should be the core of future CAPTCHAs, as AI definitely will play a crucial role in cracking CAPTCHAs eventually, therefore, only AI-powered CAPTCHA that could withstand AI attacks will make it to the end.
Now that the text-based CAPTCHA would be obsolete eventually, what kind of CAPTCHA would replace it, and how to design a CAPTCHA product with a built-in AI model?
The founding team of GeeTest realized that the traditional CAPTCHA is an identity verification solely relying on the user's keyboard input. In the age of computers, the ways people interact with computers are mainly via their keyboard and mouse. Since the traditional CAPTCHA is keyboard input-based verification, the GeeTest team thought a CAPTCHA product based on the user's mouse trajectory was what they were looking for. The team looked into mouse-tracking data and found that the data revealed user's biological behavior when they move and click the mouse. They believed it was feasible to build an AI model with that data to distinguish humans from machines. Therefore, the team decided to create a CAPTCHA challenge that only required users to move and click their mouse. That challenge may seem easy, but the mouse trajectory did reflect the features of the user as a human or a bot. Based on such a vision, GeeTest designed a slide puzzle that was quite easy for humans to solve, and that was the original form of GeeTest behavior analysis CAPTCHA.
Draft of GeeTest slid puzzle in 2012
As early as when the concept of CAPTCHA based on behavior analysis was proposed, the GeeTest founding team had done an in-depth study about it. In the patent application for behavior analysis CAPTCHA in 2012, GeeTest defined it as more than just slide puzzles
02 The development of behavior analysis CAPTCHA
Even with a detailed blueprint for behavior analysis, it is impossible to create a perfect CAPTCHA product all at once. During the development of GeeTest CAPTCHA, through rapid iterations, the team finally made it happen.
2.1 Launch of the demo
The first step was to collect and analyze the user's behavior data via the slid puzzle. At this stage, the main focus was on the form of GeeTest CAPTCHA and the communication between the front end and back end. After several months of development, the demo of GeeTest CAPTCHA was launched on Wuhan University's BBS. Students were amazed at the new and different form of it. Instead of identifying the twisted letters and words, users completed the CAPTCHA challenge through a slide-to-solve puzzle. That was undoubtedly a huge change.
Wuhan University BBS in 2012
2.2 Machine learning models in behavior analysis
After the launch of the demo, the unexceptional user experience of GeeTest's slide puzzle had been proved. Therefore, the next step of the GeeTest team was to build a security model to analyze the collected user's behavior.
According to the concept of behavior analysis defined by GeeTest, the mouse trajectory contained the most essential biological characteristics of the user which were the key to distinguish between humans and machines. Based on professional essays and assumptions on human behavior, the team set basic indexes, such as dragging speed, acceleration, speed uniformity, vertical axis deviation, etc, and tried to build a model according to those data. When the demo launched, the GeeTest team had collected negative data (generated by legit users), but they still needed some positive data generated by attackers. The team decided to create positive data on their own by writing automated scripts to crack the puzzle. Through the analysis of the data, they found that the negative and positive data are highly distinguishable, which also verified the feasibility of the behavior analysis model.
With the verified feasibility, the next was to prove whether it was possible to build a machine learning model that could distinguish between man and machine. The team then adopted SPSS statistical analysis software and used the classic decision tree model to train it, which means GeeTest was capable of designing a CAPTCHA product with a feasible behavior analysis model.
2.3 Machine Learning Engineering
After proving the possibility of their machine learning model, the team decided to replace SPSS statistical analysis software with sklearn, a software machine learning library for the Python programming language, to optimize the model's performance and native code for model training.
As CAPTCHAs based on behavior analysis has become popular, the attempts to crack GeeTest CAPTCHA also have increased and the decision tree model was no longer enough, so that the team replaced it with more advanced machine learning models including random forest and SVM, and set more indexes for new behavior features in order to continuously optimize their models and improve its security capability by the following machine learning engineering processes.
2.4 AI learning models
Through adding more new indexes for behavior features and constantly optimizing, the performance of GeeTest CAPTCHA kept getting better and better, but along with the popularity of GeeTest CAPTCHA among customers from various industries, so did the threats against it. The learning process of GeeTest models back then relied too much on experience, which was a great disadvantage for the team, so they turned to an end-to-end deep learning AI model that can ensure the ultimate lasting security for the customers.
However, the trajectory data of user's behavior collected by GeeTest CAPTCHA was unfamiliar to deep learning technology and undoubtedly also a huge challenge for it, as the data length was not fixed, and contained two completely different dimensions (coordinates and time). Although the data did present in certain sequence, it was still quite different from the sequence of audio or text. Aiming to build an AI-based CAPTCHA, GeeTest team first adopted Convolutional Neural Networks (widely applied in image processing), and later Clustering and Hash models. Those AI models would keep evolving by themselves when more positive data (from attackers) and negative data (from users) were collected and, in turn, help GeeTest CAPTCHA achieve lasting automated updates.
It was in one iteration after another that behavior analysis CAPTCHA came true from a blueprint. When it comes to the confrontation between CAPTCHAs and attackers, behavior analysis provides completely new and incredibly effective countermeasures against attackers. During the continuous defense-against-attack process, GeeTest CAPTCHA's AI models have been training to get smarter and smarter. Since the decline of text-based CAPTCHA in 2012, GeeTest behavior analysis CAPTCHA has become the most innovative and successful product in the industry. It is even recognized as the only feasible product after text-based CAPTCHA. And that's the story of the revolution of behavior analysis CAPTCHA