A solid SEO strategy relies on search engines crawling your website and indexing its pages in their databases to appear on results pages. If we may as well have written that sentence in code, let us explain. Below, we’ll explore what crawlability and indexing mean, outline some common issues websites encounter and provide simple fixes to ensure your website gets a foot in the search engine door.
Crawlability & Indexing in SEO might sound scary, but they don’t have to be. Our experts are on the case to get your SEO up to scratch.
What is Crawlability?
Search Engines are essentially the know-it-alls of the internet. When search engine users enter questions into their search bar, search engine bots sift through their indexes of billions of web pages to find what is most relevant to those queries. But how do web pages make their way into these indexes?
Web pages are indexed through a process known as crawling. These search engine bots (sometimes called spiders) follow links from site to site, 24/7, viewing pages, reading their content and code, assessing their quality and intent, and adding them to their indexes.
Why a Page Might Not Be Indexed
Search engine bots may not have indexed a web page (yet) for several reasons. These include:
Noindex meta tags
Noindex directives are applied at the page level and instruct search engines not to index the page. These tags are often applied to pages like login screens, internal search results, and ‘thank you’ pages. Learn more with our Ultimate Guide to Noindex Directives.
Robots.txt files help to manage a website’s crawling traffic by designating which pages should or should not be indexed. Learn more in our Ultimate Guide to Robots.txt Files.
Search engines may not index a page containing duplicate content because it has determined that a different page is the source of that content. Search engines can determine the source page in a few different ways:
Canonical tags – Canonical tags are used on pages to tell search engines whether this page or another page is the original source of the content. These can be user declared (the site owner or web dev manually adds canonical tags to pages with duplicate content) or search-engine-declared (search engine AI determines on its own which page is the original).
Regional Content – a domain may contain the same content on multiple pages spread across a collection of regional sites. Regional content is dictated by hreflang code snippets. Search engines may not index regional pages if they cannot render the hreflang code snippet.
Learn more about this issue with our Ultimate Guide to Duplicate Content.
Not yet crawled
Newer pages—published in the last several days or even weeks—may not be indexed simply because search bots haven’t crawled them yet. If bots can reach the page (if a link to the page exists from another page or a sitemap), they will eventually crawl and index it, usually within a month.
Search engine bots rely on links to make their way across the world wide web. An orphan page has no inbound links from other pages, making it inaccessible to search engines and unavailable for indexing.
Crawl Errors and Fixes
Search engines constantly crawl through links and content, seeking public pages to serve searchers seeking answers. If errors occur as bots attempt to access webpages, it may hinder the sites’ ability to be indexed or found, blocking rankable content and its appearance on the SERP.
Even if the content is optimised with an SEO strategy, crawlability issues can still occur.
One of the most common issues in both crawlability and indexing is the pesky 404 error. A 404, or ‘Page Not Found’ error means the server could not locate the requested web page, meaning fewer users will find and use the page, eventually leading to a decline in user experience, viewing and ranking.
There could be a multitude of reasons why 404 errors are happening on your site. Here are a few of them, alongside some solutions.
Broken Links are roads that lead nowhere, so ensuring all your links have destinations is key to optimising your crawlability. Soft 404 errors happen when a non-existent URL gives a response code other than 404 or 410. Search engine bots would waste time crawling and indexing URLs stored in cache but no longer exist, rather than live URLs. Make sure your non-existent URL return standard 404s and let your live sites do the talking on the SERP.
Crawlability and indexing depend on your robots.txt file as this tells bots what you do and don’t want to be indexed. A bot may postpone crawling if it fails to find your robots.txt file; it may reduce your crawl budget or not index your site.
Make sure your site’s robot.txt is always available on the root of the domain as (https://websiteurl.com/robots.txt). Ensure each domain and its sub has a corresponding robots.txt file if you don’t want them included in the search results. A reachable and up-to-date robots.txt page will enhance crawlability & trawling in SEO optimisation, so it’s a good investment of resources.
Content Hidden Behind Login or Paywall Screens
While it’s tempting to lock your content behind a login or paywall, it may be stopping search engine bots from crawling your site. The more obstacles a bot encounters on its crawl, the more likely it is to turn back, lowering the number of pages seen and diminishing your crawl budget.
The best practice is to ensure at least some content on your pages is free. However, optimising your content is key to ensuring rankings and visibility.
Indexing in SEO
Indexing occurs after crawling. Search engine bots shortlist their findings into vast digital libraries called indexes. From these indexes, the bots then organise the websites into their relevance to search queries. Search engines must index your web pages before they have a chance to rank, and there are several ways to do this.
Ensuring that your website is indexed is easy if you know your way around its ins and outs. Here are a few tips to optimise the indexing of your site.
Avoid Under and Over Indexation
The issues you most want to avoid with indexation are:
When pages you don’t want to be indexed are indexed (over-indexation). Examples include:
Different URLs for the same product based on variations like colour, size, etc.
Dynamic URLs generated due to search
Indexing dynamic URLs generated for wish lists and successful orders
When pages you do want to be indexed are not indexed (under-indexation). Examples include:
Canonicalizing product pages to category pages
Canonicalizing paginated pages to the parent page
Mistakenly blocking important pages through robots.txt or by applying the Noindex meta tag
Use Internal Linking
Internal linking helps to reinforce the hierarchy of pages within your site and ensures search engine bots can reach all valuable pages and establish correlations between them. This is the best way to avoid orphan pages that search engines would overlook.
Strategic Site Mapping
As we’ve mentioned, search engine bots love efficiency, so making sure your site is easy to navigate will help both crawlability and indexing. Make sure you clear your site of any broken or old links and make sure your meta-directives are loud and clear. Meta directives will tell the search engine where and how to index your site, increasing relevance and good user experience to build your ranking.
Submit New Pages or Sitemaps to Search Engines Directly
You can wait for search engine bots to find and crawl your web pages; if they have inbound links, this will happen eventually. However, the quickest, surest way to have your site crawled and indexed is to submit it to search engines directly. Google Search Console, Bing Webmaster Tools, and other search engine hubs help you analyse your search engine performance.