26 May 2023 • 10 min read
26 May 2023 • 10 min read
Since GeeTest pioneered the new generation of intelligent CAPTCHA in 2013, we've been at the forefront of evolving and enhancing bot detection & mitigation technologies. Adopted by nearly 400,000 developers worldwide over the past decade, our ground-breaking "Behavioral Verification" has not only improved user experience but has also fortified the battle against bot-generated fraudulent activities. Leading companies such as Binance, Imperva, miHoYo, and Agoda have integrated our CAPTCHA into their systems, transitioning away from traditional verification methods.
In this ever-advancing cybersecurity landscape, we at GeeTest have consistently pushed boundaries, exploring and integrating cutting-edge technologies. From the early adoption of Neural Style Transfer technology in 2016, we've sought the perfect balance between user experience and security.
The working mechanism of Neural Style Transfer (Image Source)
GeeTest's innovation integrating "semantic understanding" with "Neural Style Transfer", protected by patent ZL 201830130077.X (Source: GeeTest)
Today, as the realm of AI-Generated Content (AIGC) - including Text-to-Image models, Large-scale Text-to-Image Generation Models (LTGMs), Large Language Models (LLMs), and beyond - continues to evolve, we seize every opportunity to learn, experiment, and assimilate these advanced technologies to further enhance our bot protection strategies.
This article takes you behind the scenes of our exploration into these generative AI technologies, offering insights into our process, progress, and intriguing experimental results.
Text-to-image models represent a pioneering approach within Multimodal Deep Learning. They can produce images that semantically correspond to text descriptions. Through the process of connecting visual features and linguistic information, these models bridge the gap between textual prompts and their visual counterparts.
Trained on extensive datasets containing paired natural language descriptions and corresponding images, these models extract key features from the text and map them onto the visual representation in the images. The training process involves sophisticated semantic comprehension and image synthesis.
With this training, the models can generate images from previously unseen textual descriptions. The text is first encoded into a feature vector, which then facilitates the synthesis of a corresponding image through a generator network.
The applications of these models span across several domains, covering the creation of realistic product images for e-commerce websites, development of visual aids for individuals with disabilities, generation of images for virtual and augmented reality applications, and production of CAPTCHA images.
In my work with AI image generation, I've found that there are several challenges to be addressed in terms of accuracy, controllability, and scalability.
One of the main issues with AI image generation processes relates to accuracy. It's clear that most pre-trained models are designed around English, which often leads to translation-induced ambiguities and mismatches between text and images. A word with multiple meanings can cause significant confusion.
For example, the images shown below represent a classic case of translation ambiguity. The English term "crane" can refer to either a construction machine or a bird, hence the model generates both types of images.
Here, we illustrate another challenge faced when generating images based on our existing prompt library. For instance, when the prompt is "electric mouse", the term "mouse" — which could signify an animal or a computer peripheral in English — was mistranslated to signify an actual "mouse", resulting in generated images depicting only the rodent. Furthermore, a conspicuous mismatch can be noted in the image at coordinates [1,2], where the generated image bears no relation to the provided prompt.
Similarly, this issue of semantic inconsistency persists with the prompt "can". While typically referring to an aluminum container, it unexpectedly generates a variety of unrelated images, including a cat, an aluminum suitcase, an aluminum beverage can, and a barn-like structure, among others.
Next, we present an instance where the model, switched to a cartoon style, and encountered both semantic ambiguity and sensitivity. Using the same prompt "electric mouse", we observed a variety of responses. Remarkably, it produced images ranging from mice (the animal) morphed into the form of computer mice, to anime characters exhibiting mouse-like features. This encounter demonstrates that the challenges persist across styles and contexts, reaffirming the need for careful attention and innovative solutions in AI image generation tasks.
These instances underline the need for models to better handle language nuances and improve the accuracy of the images generated. An avenue to explore is the open-source project like Fengshenbang-ML, constructed based on Chinese by the IDEA-CCNL. Given its localized structure, Fengshenbang could serve as a potential solution to accuracy-related issues in the context of Chinese language applications. Yet, in large-scale applications, the blend of advanced AI and human supervision remains critical to ensuring appropriate image generation.
Addressing sensitive materials and issues of fairness and bias can be partially resolved with safety-checker. However, implementing additional measures can significantly enhance these mitigation efforts:
In the sphere of AI image generation, scalability plays a critical role, especially in the context of mass content production. Achieving scalability requires careful computational resource planning and GPU resource allocation. This involves addressing three core demands:
Responding to these pressing needs, we've built a model service architecture leveraging Ray and K8s. This system enables a minimalist approach to code volume in model service deployment while providing flexibility for easy horizontal scaling.
A model device based on Ray and K8s (Source: GeeTest)
As illustrated above, the unique architecture enables us to deploy a model service with the least possible code volume.
These strategic improvements significantly bolster scalability and agility – both of which are vital for the large-scale application of AIGC in bot mitigation strategies.
Real-world applications of GeeTest CAPTCHAs generated by Generative AI Models (Source: GeeTest)
As we've explored, the integration of Generative AI technologies into GeeTest's bot mitigation strategies brings promising possibilities, yet also presents intriguing challenges. Our proactive approach to resolving accuracy, controllability, and scalability issues has paved the way for significant advancements in CAPTCHA technology. As we look to the future, our focus remains on developing advanced AI image generation techniques to combat threats like "ML model cracking" and "CAPTCHA solving farms". These forms of attacks often work in conjunction, where manual labor initially solves CAPTCHA for bots, followed by the utilization of AI/ML algorithms to train models for automatic cracking. Therefore, our focus lies in enhancing the updating speed of images and their resistance against model cracking.
Currently, GeeTest processes billions of daily API calls and maintains a stringent service solution, backed by our extensive experience in image security. We ensure automated updates of our entire online image resources every hour, incorporating 50,000 images across 200 categories. For clients with higher instantaneous requirements, we offer automated updates of 10,000 images in 50 categories every 10 minutes.
I look forward to sharing more about our ongoing explorations and trials in the world of AI-generated content (AIGC). We'll also dive into more real-world adversarial examples using LTGMs, discussing in detail how our models offer anti-cracking advantages in CAPTCHA verification and bot prevention. The journey continues, and I'm excited for what lies ahead.
Senior Software Engineer | GeeTest
Subscribe to our newsletter