13 Jul 2020 • 10 min read
13 Jul 2020 • 10 min read
Ever encountered a situation that material from your website ended up somewhere else without your permission? Or maybe your contact information was acquired by someone you don’t recall giving it to? This is outrageous, isn’t it? How do such things happen, although we have plagiarism and personal data protection laws? It’s simple. These activities are conducted in an automated fashion by web scraping bots who invade your website, stealing targeted information, and using it for fraudulent purposes such as competitive content & price scraping, spam, or black-hat SEO. Let’s look deeper into it.
Web Scraping, aka Screen Scraping, Web Data Extraction, or Web Harvesting, is an automated bot operation aimed at acquiring, extracting, and stealing large amounts of content and data from websites. To put it simply, web scraping is a form of data mining. Programmers deploy sophisticated bots trained to scrape information from websites and public profiles of real users on sites and social networks.
The purpose of web scraping depends on the culprits behind the bad bots. While your competitors want to gain an unfair competitive advantage through price & content scraping, fraudsters intend to utilize the information on their various money-making schemes such as illegal web services and spam-libraries.
First, web scraping can be dangerous because the data from your website is leaking into the wrong hands and can be used in various ways, including malicious ones. It is especially critical when it comes to user data.
Fraudsters scraping for users’ data are the most dangerous ones. The scandal involving Facebook and Cambridge Analytica in 2018 proved that the malicious use of personal data remains a burning issue nowadays.
The web scraping process is quite simple by nature and consists of two parts - acquiring the page and extracting data from it.
To acquire the page, web scrapers use crawlers, or as many people call them “spiders”. Spiders are the same algorithms that Google uses to crawl and index new websites to its search results. The only difference for web scraping crawlers is that they are using their crawling methods for different purposes. After crawling and downloading the page, hackers proceed to extract website data.
Extracting data involves scanning the page and collecting the necessary information from there. After receiving all the required data, web scrapers organize them into spreadsheets or databases for storing and further using. The data can be anything - names, phone numbers, companies, websites, etc. Now, to scrape the data fraudsters use various methods, let's take a closer look at them.
Web scraping can be conducted using various techniques. However, it is deployed in two main ways: manual and automated.
The copy-paste method is as old as the computers themselves. It uses human resources and takes time but has more guarantee that the delivered data will be relevant to the search.
Many people in the past copy-pasted from Wikipedia and other sources. Of course, for students and scientists today, it is punishable because plagiarism check tools were introduced. But there were times when this method had its popularity.
Despite requiring human resources and time, the copy-pasting method has its advantages - websites with bot management systems will have a hard time stopping it since a human, not a bot, conducts the process.
HTML parsing is a web scraping technique that suggests taking a page's HTML code and extracting relevant information. To conduct HTML parsing, web scrapers require JavaScript. It is a quick and efficient way of web scraping.
DOM or Document Object Model introduces and analyses an XML or HTML page structure in the form of a tree. Fraudsters use DOM parsers to get a deeper view of page structure and find easy paths to the information they are looking for. Hackers use a tool named XPath and browsers Internet Explorer and Firefox to conduct web scraping using the DOM parsing method.
Vertical aggregation is a website aggregation from specific verticals - the niches with vendors who tend to the needs of a particular audience. Such a web scraping method requires a large scale computing power and platforms. The platforms control scraping bots while they are searching for certain verticals. This method doesn’t require any manpower because the bots are effectively trained to detect verticals. The better quality of the aggregated data is, the more efficient bots are.
Text pattern matching is a matching technique conducted based on programming languages such as Perl or Python and involving the use of UNIX grep command.
Bots are looking for metadata semantic markups and annotations and use these parameters to find specific data snippets. However, if the annotations are found on web pages, the technique can be considered as DOM parsing.
Computer vision analysis allows bots to identify the crucial parts of the website needed for scraping and then extract all the necessary data. It performs page analysis of the way humans see it. Computer vision can analyze images, videos (real-time and recorded), read the text in pictures, and recognize handwriting.
To find out how exactly your site is being scraped, you need to know what you are dealing with. Here are the most widespread tools and bots used by hackers for web scraping websites.
Python is one of the most popular languages used for AI programming. Of course, it is not a secret that it is used to build web scraping bots. To create such a bot, they use request packages that allow conducting HTTP requests and Beautiful Soup package to control all HTML processing.
Beautiful Soup is a library for extracting data from HTML or XML files. Works in a combo with parsers to navigate, identify, and modify the parse tree. The package saves programmers a lot of time for extracting the data.
XPath or XML Path query language is used with XML documents. It is used to navigate them through their tree-like structure that XPath can analyze very well and select data using certain parameters. Therefore XPath has the most widespread application in DOM Parsing.
As strange as it seems, Google Sheets also has a capability for web scraping operation. It uses a function named IMPORT XML. However, this method is only useful when hackers specify the data or patterns required from a website. Otherwise, the annotations that were placed into a semantic layer accumulated and controlled independently, so the scrapers can extract data from this layer previously to scraping the pages.
Selenium browser automation tool allows performing any automated activity in its interface. Also, it makes it look as if a human performs it. That’s why it is a popular tool among hackers and fraudsters who deploy bots for malicious activities, including web scraping. Moreover, Selenium won’t just scrape a website for you; it will also give you a deep knowledge of how websites work.
Boilerpipe java library uses algorithms to detect and remove unnecessary clutter from the page, such as HTML tags. It provides its user an absolutely clean text. Its high-performance speed and accuracy of data attract a lot of hackers and make it one of their favorite tools.
Nutch is a sophisticated open-source web crawler that is perfect for batch processing. The tool crawls, extracts the data amazingly fast, and stores it once it is programmed. However, Nutch requires manual coding of the websites into its interface.
Watir is a publicly available Ruby library powered by Selenium created for automated procedures such as clicking links, filling out forms, and validating texts. A powerful tool for detecting and extracting data from websites and is popular with hackers.
Another hacker-friendly website automation tool. A JRuby wrapper created to operate on HtmlUnit - a GUI-Less browser for Java programs. It creates HTML models and facilitates an API to navigate pages, fill out forms, click on links, and a lot of other automated operations, including data extraction from websites.
Scrapy is a high-speed and level web scraping framework based on Python. It is used to crawl, scrape websites and extract data using APIs. It can be used for various purposes from data mining to automated testing.
Website creators generally hate web scraping because they cannot control how the information will be used. It depends on whether the bot that crawled the info was good or bad. But is it legal or not? Both yes and no, depending on the certain circumstance.
In September 2019, LinkedIn had a lawsuit in the US Court of Appeals against small data analysis company hiQ that regularly scraped the data from LinkedIn profiles. The court ruled in favor of hiQ simply because the data on the profiles was public anyway. Moreover, the court prohibited LinkedIn to use any opposing measures against the web scraping from the company.
The case above means that bots can scrape any information available for public use. What is more, US businesses cannot oppose it.
Does that mean that any scraper in the US can freely scrape websites? Yes and no. Because there is a copyright factor, any data aggregated by automated tools can’t be used for commercial purposes. To sum up, intellectual property rights still stand above any computerized scraping. Therefore, make sure you copyright everything you create on your website.
There were several other cases, however, that ruled out against automated web scraping activities. In other words, the court decision on LinkedIn vs. the hiQ case is not the final destination on the matter.
Bad bots acquire the content with the intent of using it for purposes outside the site owner’s control. They make 20% of all web traffic. They are using the data for malicious, fraudulent activities, such as data mining, online fraud, account takeover, data theft, plagiarism, spam, digital ad fraud, and others, including the use of the data for commercial purposes.
All these activities involving scraped data remain illegal and are considered a violation of the CFAA (Computer Fraud and Abuse Act).
Here is the list of web scraping activities for businesses that are considered legal:
Many real estate companies prefer to fill their databases for properties using web scraping because it is a fast and efficient way to get new clients.
If someone needs a statistical analysis of companies working for a particular industry and niche, the fastest and the most efficient way to do it is to scrape the information about companies. Industries have dozens and hundreds of companies. To cover them, all human resources won’t be enough.
Price comparison websites are quite comfortable and useful for people. They scrape prices from different stores and help users find the best deals.
Web Scraping is used by businesses one way or another. Some companies are using web scraping for personal growth: listing, industry statistics, price comparison, counterparts analysis, etc. The others are using it for fraud: data mining, online fraud, account takeover, data theft, plagiarism, different forms of spam, digital ad fraud, and others. Your counterparts could be using web scraping against you as well.
To detect web scraping, you need to pay attention to the following aspects of your website activity:
IP tracking is an old and reliable way of tracking any harmful activity on your website. However, many tools make automated visits look like human ones. How can you find out whether they are automated or genuinely human? Here are some tips:
One way to stop scrapers from their activity is to set up a sign-up process with email confirmation. This might slow them down. However, since bots can learn how to click links and fill out forms, this solution may not be as effective as it used to be.
Putting your important information into media objects such as images, videos, podcasts, and other media forms will make web scraper's task harder because the main type of content they are normally scraping for is text.
UA is a tool that helps identify the browser used, operating system, and other features when users enter a website. UAs can recognize java scripts and automated tools when they aren’t protected. Nevertheless, many bots have learned to hide their automatic features and bypass UAs.
AJAX means “Asynchronous JavaScript And XML” - it is a set of new web developing techniques that make web applications perform asynchronously. It means that the website doesn’t need to reload when something new is added to it.
This technique might get in the way of web scrapers’ operation since they are trained to scrape websites at certain parameters. H such sites will require deeper training for bots.
Spider Honeypots place trapping links on the websites that are only visible to crawlers, not humans. They both detect web scraping bots and block them.
Placing captcha at your key engagement sections like sign up, purchase, comments, and other pages will help prevent bots from scraping the important information on your website. However, remember that legacy CAPTCHAs
are no longer effective against sophisticated bad bots in our day and age. To effectively mitigate these advanced threats the modern internet faces today, an AI-powered CAPTCHA such as GeeTest CAPTCHA
Web scraping is a constant headache for major businesses these days. Protecting your information from web scrapers is crucial for your business and intellectual rights. Depending on the scale of your business, you can use various means to fight scraping bots.
The best option for you will be simple and free solutions such as monitoring IP and email sign-up options.
Your best option could be free CAPTCHA, such as ReCAPTCHA. However, you need to understand that free isn’t always completely helpful.
If your website has any commercial value with considerable traffic, your best option will be a full-stack bot management solution that will help preserve your business's security and competitiveness from not only web scrapers but all bad bot threats.
GeeTest
GeeTest
Subscribe to our newsletter