Whenever we talk about SEO of blogs, WordPress robots.txt file plays a major role in search engine ranking. It helps to advise search engines how to crawl your website. Placing a robots.txt file in the root of your domain lets you stop search engines indexing sensitive files and directories. For example, you could stop a search engine from crawling your images folder or from indexing a PDF file that is located in a secret folder.
In this article, I will show you how to optimize your WordPress robots.txt for SEO and help you understand the importance of robots.txt file.
What is WordPress Robots.txt?
Robots.txt file helps search engine robots to direct which part to crawl and which part to avoid. When Search bot or spider of Search Engine comes to your website and wants to index your site, they follow Robots.txt file first. Search bot or spider follows this files direction for index or no index any page of your website.
Do I Really Need a Robots.txt File?
Absence of a robots.txt file will not stop search engines from crawling and indexing your website. However, it is highly recommended that you create one. If you want to submit your site’s XML sitemap to search engines, then this is where search engines will look for your XML sitemap unless you have specified it in Google Webmaster Tools.
I highly recommend that if you do not have a robots.txt file on your site, then you immediately create one.
How to Create a Robots.txt File?
If you are using WordPress, you will find Robots.txt file in the root of your WordPress installation. For static websites, if you have created one or you developer has created one, you will find it under your root folder.
If you do not have a robots.txt file in your site’s root directory, then you can always create one. A robots.txt file can be created in seconds. All you need to do is create a new text file on your computer and save it as robots.txt. Next, simply upload it to your site’s root folder using an FTP client. I recommend file permissions of 644 for the file.
You should also check out the WordPress plugin WP Robots Txt or WordPress SEO by Yoast; which allows you to modify the robots.txt file directly through the WordPress admin area. It will save you from having to re-upload your robots.txt file via FTP every time you modify it.
How to Use Robots.txt File?
It does not take long to get a full understanding of the robots exclusion standard, as there are only a few rules to learn. These rules are usually referred to as directives.
The two main directives of the standard are:
- User-agent – Defines the search engine that a rule applies to.
- Disallow – Advises a search engine not to crawl and index a file, page, or directory.
An asterisk (*) can be used as a wildcard with User-agent to refer to all search engines. See a sample robots.txt file:
User-Agent: * Allow: /wp-content/uploads/ Disallow: /wp-content/plugins/ Disallow: /readme.html
User-agent and Disallow are supported by all crawlers, though a few more directives are available. These are known as non-standard as they are not supported by all crawlers. However, in practice, most major search engines support these directives too.
- Allow – Advises a search engine that it can index a file or directory.
- Sitemap – Defines the location of your website sitemap.
- Crawl-delay – Defines the number of seconds between requests to your server.
- Host – Advises the search engine of your preferred domain if you are using mirrors.
You could add the following directive to your website robots.txt file to block search engines from crawling your whole website.
User-agent: * Disallow: /
It’s useful if you are developing a new website and do not want search engines to index your incomplete website.
Following robots.txt file for WordPress have instructed all bots to index our image upload directory.
&gt;User-Agent: * Allow: /wp-content/uploads/ Disallow: /wp-content/plugins/ Disallow: /readme.html
In the next two lines we have disallowed them to index our WordPress plugins directory and the readme.html file.
Defining your sitemap will help search engines locate your sitemaps quicker. This, in turn, helps them locate your website content and index it. You can use the Sitemap directive to define multiple sitemaps in your robots.txt file.
A sitemap can be placed anywhere in your sitemap. Generally, website owners list their sitemap at the beginning or near the end of the robots.txt file.
Sitemap: http://www.yourwebsite.com/sitemap-index.xml Sitemap: http://www.yourwebsite.com/category-sitemap.xml Sitemap: http://www.yourwebsite.com/page-sitemap.xml Sitemap: http://www.yourwebsite.com/post-sitemap.xml
Optimizing Your Robots.txt File for SEO
In the guidelines for webmasters, Google advises webmasters to not use robots.txt file to hide low quality content. If you were thinking about using robots.txt file to stop Google from indexing your category, date, and other archive pages, then that may not be a wise choice.
Remember, the purpose of robots.txt is to instruct bots what to do with the content they crawl on your site. It does not stop bots from crawling your website.
It is recommend that you disallow readme.html file in your robots.txt file. This readme file can be used by someone who is trying to figure out which version of WordPress you are using. If this was an individual, then they can easily access the file by simply browsing to it.
On the other hand if someone is running a malicious query to locate WordPress sites using a specific version, then this disallow tag can protect you from those mass attacks.
You can also disallow your WordPress plugin directory. This will strengthen your website’s protection if someone is looking for a specific vulnerable plugin to exploit for a mass attack.
The Maximum Size of a Robots.txt File
According to an article on AskApache, you should never use more use more than 200 disallow lines in your robots.txt file. In 2006, some members of Webmaster World reported seeing a message from Google that the robots.txt file should be no more than 5,000 characters.
This would probably work out to be around 200 lines if we assume an average of 25 characters per line; which is probably where AskApache got this figure of 200 disallow lines from.
Few years later, Google’s John Mueller clarified the issue and he said:
If you have a giant robots.txt file, remember that Googlebot will only read the first 500kB. If your robots.txt is longer, it can result in a line being truncated in an unwanted way. The simple solution is to limit your robots.txt files to a reasonable size.
Be sure to check the size of your WordPress robots.txt file if it has a couple of hundred lines of text. If the file is larger than 500kb, you will have to reduce the size of the file or you may end up with an incomplete rule being applied.
The WordPress Robots.txt is a powerful tool for advising search engines what to crawl and what not to crawl. It does not take long to understand the basics of creating a robots.txt file, however if you need to block a series of URL’s using wildcards, it can get a little confusing. So be sure to use a robots.txt analyzer to ensure that the rules have been set up in the way that you want them.
Also remember to upload robots.txt to the root of your directory and be sure to adjust the code in your own robots.txt file accordingly if WordPress has been installed in a subdirectory. For example, if you installed WordPress at www.yourwebsite.com/blog/, you would disallow the path /blog/wp-admin/ instead of /wp-admin/.