Google Announces End of Support for Robots.txt Noindex Directives
Google announced early this month that they will be discontinuing their unofficial support of noindex directives within robots.txt files. As of 1 September 2019, publishers who are currently relying on robots.txt crawl-delay, nofollow, and noindex directives will need to find another way to instruct search engine robots how to crawl their sites’ pages.
These policy updates will mean rapid changes are necessary for many web publishers in preserving their approach to search engine optimisation. Are you one of them? Continue reading to learn more about Google’s major policy update and what it will mean moving forward.
What are robots.txt files?
Webmasters create robots.txt text files to direct user agents (web crawlers) how to crawl the pages on their websites. These text files either allow or disallow web robots such as search engine crawlers to engage in specified behaviour.
Robots.txt files can be as short as two lines long. The first line specifies a user agent to which the directive applies. The second provides specific instructions to that user agent, such as allow or disallow. Specific web crawlers will disregard robots.txt files that are not directed at them, and some will ignore the files that are. Google will soon count themselves among the latter group.
Google has long discouraged publishers from using crawl-delay, nofollow, and noindex directives within robots.txt files, but have followed most directives in spite of having no standardised policy toward them.
Why is Google ditching robots.txt now?
Google has spent years trying to standardise their robot exclusion protocol so that they can move ahead of this change. This is why Google has also long encouraged publishers to find alternatives to robots.txt directives.
In their announcement, Google said they were making the Robots Exclusion Protocol (REP) an internet standard. To do so, they have open-sourced the C++ library they used to parse and match roles in robots.txt files. The 20-plus-year-old library, along with a testing tool offered by Google, can help developers create the parsing tools of the future.
How can you keep controlling crawling on your site?
Disregarding robots.txt noindex does not leave publishers without means to control crawling on their sites.
Use robots meta tags to noindex
Robots meta tags are supported in HTTP response headers, as well as HTML.
Use HTTP status codes 404 and 410
These status codes tell crawlers that the page does not exist, thus dropping it from Google’s index once they’ve been crawled and processed.
Hide content behind password protections
Unless you’ve signaled subscription or paywall content with markup, content that is concealed behind login pages will often remove it from Google’s index.
Disallow in robots.txt
If search engines are disallowed from crawling a page, that content cannot be indexed.
If your sites have relied on robots.txt noindex directives to avoid search engine indexing, you have until 1 September to make the necessary changes. If you’re not sure whether your site uses noindex directives, it would be wise to double check. Indexing certain pages you want concealed can cost your website in search rankings.
For help auditing your site’s indexing and search performance, continue reading the Pure SEO blog for more information, or contact us today for the SEO support you need.
Rollan Schott is a copywriter with Pure SEO. Rollan was born and raised in the United States, having moved to New Zealand after 4 years teaching and writing in Asia. When he's not churning out quality content at breakneck speed, Rollan is probably busy writing the next great American novel. He may also be idly watching true crime documentaries in his Auckland Central apartment with his wife, Lauren. The latter is more likely than the former.
GET ACTIONABLE ADVICE, WEEKLY
Subscribe to our blog and get awesome digital marketing content sent straight to your inbox.