What is robots.txt?
What is Robots.txt?
Robots.txt refers to robots exclusion protocol (REP). It is a text file website developers create for teaching robots the principles of crawling and indexing on the websites.
Disalow all search engine robots crawling on all website pages
Disallow a particular robot from particular directory
Disallow a particular web robot from certain web pages
The file with the instructions for robots have to be put in the folder of the highest level on a website server (the root). Example: http://www.seomall.net/robots.txt
The initial version of REP of 1994, updated 1997, determines crawler rules and is reflected in robots.txt structure. Some major search engines also honor additions like URI patterns (that are also known as wild cards and refer to URL-like notation to match URLs). Its add-on that dates back to 1996 determines indexer rules (REP tags) applied in the robots meta data. It is called “robots meta tag.” At the same time, search engines honor other REP elements with an X-Robots-Tag. SEO specialists can apply REP tags to the HTTP header of files that are not basen on HTML such as PDFs or pictures. The attribute rel-nofollow determines the principles of work with links for search engines in which the A Element’s REL element contains the parameter “nofollow.”
URI and REP attributes (noindex, nofollow, unavailable_after) create specific jobs for indexers. There situations (nosnippet, noarchive, noodp) when queries are processed by engines at runtime. The situation with crawler directives is different. Each search engine deals with REP attributes differently. For example, Google might remove URL-only or ODP links on their results pages when the website content goes with “noindex”. Not all engines are so harsh. For example, Bing from time to time adds these outside links to banned URLs on their search results pages. Since REP elements can be provided both in META attributes of X/HTML contents and HTTP headers of web files, it was agreed that the materials tagged with X-Robots attributes should overturn contradictory directives in META attributes.
Indexer directives disguised in the form of microformats will override the object settings for certain HTML attributes. For instance, when a page’s X-Robots-Tag has “follow” directive (it does not have “nofollow” directive), the rel-nofollow attribute of a certain A link (object) overrules.
Robots text file does not support indexer directives. But setting indexer attributes is still possible for classes of URIs with server based scripts which are active on site level. Afterwards X-Robots-Tags can be determined for requested objects. If you want to use this approach, you should have basic website developing skills and understand the basics of web servers and the HTTP protocol.
Dealing with patterns
Google and Bing accept two regular symbols that are applied to define the objects on the website that a SEO specialist wants to leave out from crawling. These two symbols are the asterisk (*) and the dollar sign ($).
- * – is a wildcard that matches any sequence of symbols
- $ – it means the end of the URL
A robots file is open to the wide audience. Anybody can find out what content the webmaster has hidden from Google. This implicates that if a SEO master has strictly inside-company data which should be prevented from public view, they have to use a more trustworthy method—such as password security — to keep users away from secret pages.
- It is worth mentioning that crawlers with initially bad intentions will probably do not care for robots file instruction and therefore this protocol can not be considered the best protection mechanism.
- Only one “Disallow:” line corresponds to an URL.
- Different subdomains have different robots files.
- The filename of robots file is case sensitive. You should have “robots.txt” instead of “Robots.TXT.”
- Never leave spacing for separate search parameters, such as /category/ /product page” will not be understood by robots.txt.
SEO Benchmarking practices
There are several methods to prevent search engines entering a certain URL:
It instructs robots that they should ignore the given URL, but are allowed to put the page in the index and show it in SERP (Screenshot of Google SERP below.)
Using Meta NoIndex
This informs engines that they can access, but may not show the URL in results. This is the advisable technique.
Using Nofollowing Links
This considered a bad method by most specialists. Choosing this technique does not guarantee that the engines will not access pages by different routes: through browser toolbars, links from other pages, analytics, and more.
Why you should consider Meta Robots instead of Robots.txt
You can see the about.com’s robots.txt. Do you see that they are disallowing the directory /library/nosearch/.
Now look at what happens in Google what when you look for the URL.
Google gives you several thousands of pages that are actually from “disallowed” address. Google Robots have not crawled these pages, so it appears in the URL form instead of a traditional search results page.
This becomes an issue when these pages collect links. Those pages then can become very juicy from ranking point of view (that rank how popular they are and how trustworthy), but these pages can’t share these goodies with any other pages, because the links on them don’t ever get crawled.
Invisible links (for Google)
If you want to take away certain pages from search engine index, the noindex meta tag <meta name=”robots” content=”noindex”> overrules robots.txt.
- 10 ways to use robots.txt for other than the original purpose
- Robots Exclusion Protocol
The official source of information about the Robots Exclusion Protocol
- W3 and Robots Exclusion Protocol
W3’s official documentation on the Robots Exclusion Protocol.
- Robots.txt on WordPress