LeadsNut

Your Question is “How to block an entire directory from being crawled?” well, Let’s discuss.

Search engines use their web crawlers to navigate through pages and links so that they can index all the information gathered. This information gathered helps Search Engines to respond to user queries and present appropriate search results.

Web-crawlers crawl through a website and use links available on it to navigate to other pages on the website as well as on other websites. Websites have a certain file named robots.txt which contains information for web agents or bots of a Search Engine.

This information is a kind of instruction or rules to follow while perusing through the pages or directories of that website. Except for the malicious bots, user agents of a Search Engine usually follow such instructions while they crawl a website’s pages.

Robots.txt can contain instructions that can allow or disallow a web-crawler to visit pages or sections of websites completely. It can contain instructions that can block a web crawler from not crawling a particular page or directory. These instructions could be issued in a variety of ways or combinations.

To understand how to block an entire directory from being crawled, one can take the help of the following example(s):

In the robots.txt, if one wants to allow all web-crawlers to access all content then, these lines could be used:

User-agent: *

Disallow:

If one wants to disallow a specific web-crawler to access all content, say Googlebot, from a specific folder then, these instructions could be used:

User-agent: Googlebot

Disallow:/

If one wants to disallow a specific web-crawler from a specific directory or folder, then these instructions could be used:

User-agent: Googlebot

Disallow: /sample-subfolder/

If one wants to disallow all web-crawlers from a specific directory then, simply use the wild-card character, the asterisk to indicate ALL in the User-agent section, like this:

User-agent: *

Disallow: /sample-subfolder/

Similarly, for a single user-agent or for multiple web-crawlers, multiple directories, at a time, could also be disallowed. Following is the example to disallow a web-crawler from accessing multiple directories,

User-agent: Googlebot

Disallow: /sample-folder-one/

Disallow: /sample-folder-two/

Disallow: /sample-folder-three/

A simple modification in the above example could bar all web-crawlers from accessing these three directories; this is how to do it:

User-agent: *

Disallow: /sample-folder-one/

Disallow: /sample-folder-two/

Disallow: /sample-folder-three/

These things could be done for a specific web page too, as is shown in the following lines:

If one wants to disallow a web-crawler from a specific web page then, these instructions could be used:

User-agent: Googlebot

Disallow: /sample-subfolder/sample_page.html

If one wants to disallow all web-crawlers from a specific web page, then simply use the asterisk as was shown in the blocking-a-directory’s case:

User-agent: *

Disallow: /sample-subfolder/sample_page.html

In case, specific crawlers are mentioned in the instructions but others are not mentioned then the left out crawlers crawl the pages. In order to ensure that every possible user-agent is taken care of, be sure to use wild-card characters (the asterisk, as was demonstrated in the above examples) wisely in various combinations as the need arises.

So from the above examples, we see can see how the robots.txt and the instruction contained in it can block a web-crawler for our current needs. The crawling instructions are optional though, a user-agent could also choose to ignore such instructions. Also, in case a website doesn’t have robots.txt, then by default, all the pages and elements would be crawled by a web crawler.

Leave a Reply

Your email address will not be published. Required fields are marked *

Call Now