Writing a basic robots.txt file can be extremely simple,
but can also lead to major headaches
if you don’t know what you’re doing.
One of the most common mistakes in robots.txt files is making assumptions about search engine crawlers and how they interact with websites. This article will walk you through six common issues that people run into with their Robots.txt files and how to fix them so that your website can be properly crawled and indexed by search engines, improving your overall site optimization and user experience at the same time.
The file isn’t found
Robots.txt is a simple text file that tells search engines like Google and Bing whether or not they can crawl (read) specific parts of your website. If you create one, don’t forget to place it in your root directory so that every page on your site is listed in it, too—and be sure to set all necessary directives correctly, including disallowing crawling of directories such as images, JavaScript, and style sheets.
01. Getting blocked by Google
It’s common for website owners to create robots.txt files that block Google from indexing certain pages on their site, but these block files can actually get you blocked from Google if they’re not configured correctly! The two biggest mistakes people make with robots.txt files are blocking directories (as opposed to specific URLs) and blocking directories without a trailing slash: /shop/ instead of /shop/. Learn how to avoid these mistakes and gain more control over your site with robots.txt!
02. Getting blocked by Bing
Almost everyone is familiar with Bing (they even have a funny commercial about how it’s easier for people named Bing) but did you know that if you block Bingbot in your robots.txt file, not only will Google be unhappy with you, but so will Bing? Yes, it’s true: blocking Bing can hurt both search rankings and traffic as well as create duplicate content issues.
03. Noindex in robots.txt doesn’t equal noindex in XML sitemap
If you have an XML sitemap set up on your site, it’s possible that Google is crawling your pages even if they are blocked by robots.txt files. This may lead you to think that noindexing these pages is not effective at all, but it actually is; just make sure you have noindex tags set on these pages in both XML sitemaps and robots.txt files so they don’t get crawled by Google, Bing, or other search engines.
04. Nofollow in robots.txt doesn’t equal nofollow in XML sitemap
If you include a nofollow link in your robots.txt file, it does not equal a nofollow in your XML sitemap. In fact, only ping-frequencies will be ignored from XML sitemaps, not links – which is something you want to avoid! Not all search engines use user-agent identification: You might have set up your robots.txt to redirect non-compatible search engines (such as mobile phones) from going through most of your site’s pages. However, keep in mind that not all search engines identify themselves as web crawlers, so you might want to think about specifying their user agents more specifically by location (for example, an American Google bot vs. a French one). Use subdomains or specific pages if possible: To ensure that certain content is always excluded, instead of just sending it on another page or different directory, consider creating a subdomain or secondary site entirely for those purposes. This way, if you find yourself wanting to update those URLs later on down the line without affecting anything else on your domain, you can do so with ease.
05. Keyword Disallowed By Robots.txt
It’s one of those Google will see it but not robots rules, so you don’t want that keyword in your page title or anywhere else on your page, really. You also want to avoid using any kind of allusion that’s against TOS—just don’t do it. It may be hard at first without these keywords, but there are plenty of other ways to get traffic, right? Wrong! Not many people go beyond page 1 if they can help it! If you want to rank for something, remove any prohibited keywords from your content, use more natural language and wait for success.
06. What the heck is User-agent
* ? : All web servers (which is what hosts your website) have a file named robots.txt that tells search engines how they should index your site. This isn’t for blocking bots, per se; more forgiving hints about which pages you want to be indexed and which ones you don’t. For example, if you want to block WordPress from being indexed under its directory name, then add User-agent: WordPress into your robots.txt file to block them (although there are better ways of doing that too). Let’s take a look at some common issues.
Conclusion and resources
A robots.txt file is a powerful tool for controlling how search engines access your website’s content, but it can be easy to make mistakes that could prevent Google from crawling your site as effectively as possible (or in some cases, even allow Google to crawl parts of your site you didn’t intend it to). Here are six common issues people have with their robots.txt files and tips on how you can fix them –
1. File doesn’t exist or isn’t being used: The first thing you should do when checking your robots.txt file verifies that it actually exists and that it has an effect on how your website appears in search results.