What is a crawler and how does it work?

What is a crawler and how does it work?

Concepts

Search robots (they are also crawlers, bots, web spiders) – a program that indexes site pages by searching already on the indexed pages.

Bot operation scheme:



  1. Scanning – collecting all data from a page including images, text and video. This process happens more than once, because changes can be made on the page.
  2. Indexing – adding information to the search engine database.
  3. Search results – search for information by index and ranking of pages based on relevance to the query.

How search robots work and their functions

Search results are formed in three stages:

  • Scanning – collection of all data from web pages by bots, including texts, pictures and videos. This process occurs regularly, taking into account the frequency of resource updates.
  • Indexing – entering the collected information into the database of search engines with the assignment of a certain index for quick search. On major news portals, content is indexed almost immediately after publication.
  • Delivery of results – information search by index and page ranking, taking into account the relevance to the request.

Sometimes the process of indexing pages occurs even without first scanning them. In file robots.txt specifies rules for crawling, but not indexing pages. Therefore, if the search robot finds the page in another way, for example, if third-party resources refer to it, it can add it to the database.



What bots do Google and Yandex have?

Each search engine has its own search bots. Let’s take a look at Google and Yandex as examples.

Google

  • Googlebot – the main bot. Works for desktop and mobile versions of standard sites. Since July 2019, priority scanning of mobile versions of sites has been added, so most robots will process mobile versions.
  • Googlebot Images – search robot for indexing images.
  • Googlebot News is a bot that adds content to Google News.
  • Google Favicon is a crawler that collects favicons (icons) of sites.

Impressive, huh? Yandex’s situation is no worse, there are also many bots.

Yandex

  • The main robot that indexes pages is YandexBot / 3.0.
  • The bot that downloads pages to check their availability is YandexAccessibilityBot / 3.0.
  • Robot that detects project mirrors – YandexBot / 3.0; MirrorDetector.
  • The bot that indexes images is YandexImages / 3.0.
  • A bot that downloads favicons of sites. – YandexFavicons / 1.0.
  • A crawler that indexes multimedia content is YandexMedia / 3.0.
  • The bot that collects materials for Yandex.News is YandexNews / 4.0.
  • Yandex.Metrica crawlers – YandexMetrika / 2.0, YandexMetrika / 3.0.

Search engine management

For example, the code below in the robots.txt file prevents the Yandex.Images robot from indexing all images.

User-agent: YandexImagesDisallow: /

And this one prohibits the main search engine Google from indexing the page on which this tag is located:

What’s on the dark side?

It is undoubtedly cool that you can find the information you need through a search in a couple of seconds. But let’s see how this can be used for evil purposes:

  • OSINT – it is not so difficult to find personal information through a search, which means to replenish the piggy bank of compromising evidence on an enemy.
  • Inability to delete – many people think that it will not be difficult to delete personal information, but you are mistaken. Often assholes work on Google, and they will not want to listen to your requests.

According to the results

Different content is processed by bots in a different sequence. This allows huge amounts of data to be processed simultaneously. Thanks to crawlers, we can search for the information we need every day. The robot itself can search for pages, and such a program does not require special expenses for employees. But there are also dark sides, like OSINT through search, refusal to delete information, etc.

It is better to block information from indexing using themeta tag or the X-Robot tag http-header, since the robots.txt file contains only crawling recommendations, not direct commands for action.


26 Views

0 0 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments


Do NOT follow this link or you will be banned from the site!
0
Would love your thoughts, please comment.x
()
x

Spelling error report

The following text will be sent to our editors: