Googlebot is the general name for Google’s spider which is also called a web crawler. A web crawler is a type of software that works to gather information, follow links and then send that information over someplace.
A Googlebot is of two types: a desktop crawler (which simulates the user on a desktop) and a mobile crawler (which simulates the user on a mobile device). So, a Googlebot visits a website, gathers information about the content of web pages, and then sends that information to Google.
All the search engines have their own versions of spiders. Besides Googlebot, other crawlers are- Bingbot for the search engine Bing, Slurp for the search engine Yahoo, Baiduspider for the Chinese search engine Baidu, YandexBot for Russian search engine Yandex, etc.
The information sent by the Googlebot, after it has crawled a site, is used by Google to maintain and update Google Index (GI).
A GI is a huge database that is used by Google to store data and information about the web pages which were sent to it by its web-crawlers. For a website to appear on search engine results, it needs to be crawled and indexed.
A site is almost crawled by a Googlebot frequently in small intervals. To crawl a website, firstly the Googlebot accesses its robots.txt file. This file – robots.txt contains the rules for crawling a site and it contains a list of what is disallowed to be crawled. Googlebot also uses sitemap.xml to know which areas of a site can be crawled or indexed.
How does Googlebot work?
To begin the crawling process, a crawler has the list of addresses that are available to it from past crawls and the sitemaps which are provided by website owners.
These websites may have links to other pages (on the same site or to another site) and the crawler uses such links to discover other pages. Special attention is paid to new sites, changes in existing sites, and dead links by the crawler.
Thus, by following the links, crawlers visit various pages on the internet. Google creates a Cache of each page it visits. This cache is the screenshot of a webpage and is kept on Google’s servers. This is referred to when someone does a search for it.
A website’s ranking is based on its cached content rather than its real-time content. Therefore, it may take some time for the rankings to alter after some changes have been made to the website.
Unless there is a robots.txt file that may contain explicit instructions for the crawler about what not to follow, the crawler will follow links to every page and catalogs whatever it finds.
Googlebot also uses a sitemap to find pages. A sitemap is an XML or text file which contains a list of all the web pages which one wants Google to know from a particular website. Such sitemaps can be submitted to Google via the GSC (Google Search Console). Once a sitemap is submitted, it helps Google to go through the desired pages on a website.
As Google relies on complex algorithms to schedule crawling, it may not be necessary that using a sitemap would guarantee all items mentioned in it to be crawled or indexed. However, having a sitemap certainly benefits in crawling and indexing.
To check whether a website is crawlable or not, one could use the GSC. One could refer to the Crawl Errors page in the GSC. In this section, one could find a detailed report of things that are unfavorable for us with regards to crawling.
Websites cannot be wholly kept uncrawlable because if any other website has provided links to our website then its URL would appear on their referral tag. Since a spider goes through links, it would be able to crawl our website through that provided link.