Search

Thursday, August 11, 2022

The 4 stages of search all SEOs need to know

“What’s the difference between crawling, rendering, indexing and ranking?”

Lily Ray recently shared that she asks this question to prospective employees when hiring for the Amsive Digital SEO team. Google’s Danny Sullivan thinks it’s an excellent one.

As foundational as it may seem, it isn’t uncommon for some practitioners to confuse the basic stages of search and conflate the process entirely.

In this article, we’ll get a refresher on how search engines work and go over each stage of the process.   

Why knowing the difference matters

I recently worked as an expert witness on a trademark infringement case where the opposing witness got the stages of search wrong.

Two small companies declared they each had the right to use similar brand names.

The opposition party’s “expert” erroneously concluded that my client conducted improper or hostile SEO to outrank the plaintiff’s website. 

He also made several critical mistakes in describing Google’s processes in his expert report, where he asserted that:

  • Indexing was web crawling.
  • The search bots would instruct the search engine how to rank pages in search results. 
  • The search bots could also be “trained” to index pages for certain keywords.

An essential defense in litigation is to attempt to exclude a testifying expert’s findings – which can happen if one can demonstrate to the court that they lack the basic qualifications necessary to be taken seriously.

As their expert was clearly not qualified to testify on SEO matters whatsoever, I presented his erroneous descriptions of Google’s process as evidence supporting the contention that he lacked proper qualifications. 

This might sound harsh, but this unqualified expert made many elementary and apparent mistakes in presenting information to the court. He falsely presented my client as somehow conducting unfair trade practices via SEO, while ignoring questionable behavior on the part of the plaintiff (who was blatantly using black hat SEO, whereas my client was not).

The opposing expert in my legal case is not alone in this misapprehension of the stages of search used by the leading search engines. 

There are prominent search marketers who have likewise conflated the stages of search engine processes leading to incorrect diagnoses of underperformance in the SERPs. 

I have heard some state, “I think Google has penalized us, so we can’t be in search results!” – when in fact they had missed a key setting on their web servers that made their site content inaccessible to Google. 

Automated penalizations might have been categorized as part of the ranking stage. In reality, these websites had issues in the crawling and rendering stages that made indexing and ranking problematic. 

When there are no notifications in the Google Search Console of a manual action, one should first focus on common issues in each of the four stages that determine how search works.

It’s not just semantics

Not everyone agreed with Ray and Sullivan’s emphasis on the importance of understanding the differences between crawling, rendering, indexing and ranking.

I noticed some practitioners consider such concerns to be mere semantics or unnecessary “gatekeeping” by elitist SEOs. 

To a degree, some SEO veterans may indeed have very loosely conflated the meanings of these terms. This can happen in all disciplines when those steeped in the knowledge are bandying jargon around with a shared understanding of what they are referring to. There is nothing inherently wrong with that. 

We also tend to anthropomorphize search engines and their processes because interpreting things by describing them as having familiar characteristics makes comprehension easier. There is nothing wrong with that either. 

But, this imprecision when talking about technical processes can be confusing and makes it more challenging for those trying to learn about the discipline of SEO. 

One can use the terms casually and imprecisely only to a degree or as shorthand in conversation. That said, it is always best to know and understand the precise definitions of the stages of search engine technology.

Many different processes are involved in bringing the web’s content into your search results. In some ways, it can be a gross oversimplification to say there are only a handful of discrete stages to make it happen. 

Each of the four stages I cover here has several subprocesses that can occur within them. 

Even beyond that, there are significant processes that can be asynchronous to these, such as:

  • Types of spam policing.
  • Incorporation of elements into the Knowledge Graph and updating of knowledge panels with the information.
  • Processing of optical character recognition in images.
  • Audio-to-text processing in audio and video files.
  • Assessing and application of PageSpeed data.
  • And more.

What follows are the primary stages of search required for getting webpages to appear in the search results. 

Crawling

Crawling occurs when a search engine requests webpages from websites’ servers.

Imagine that Google and Microsoft Bing are sitting at a computer, typing in or clicking on a link to a webpage in their browser window. 

Thus, the search engines’ machines visit webpages similar to how you do. Each time the search engine visits a webpage, it collects a copy of that page and notes all the links found on that page. After the search engine collects that webpage, it will visit the next link in its list of links yet to be visited.

This is referred to as “crawling” or “spidering” which is apt since the web is metaphorically a giant, virtual web of interconnected links. 

The data-gathering programs used by search engines are called “spiders,” “bots” or “crawlers.” 

Google’s primary crawling program is “Googlebot” is, while Microsoft Bing has “Bingbot.” Each has other specialized bots for visiting ads (i.e., GoogleAdsBot and AdIdxBot), mobile pages and more. 

This stage of the search engines’ processing of webpages seems straightforward, but there is a lot of complexity in what goes on, just in this stage alone. 

Think about how many web server systems there can be, running different operating systems of different versions, along with varying content management systems (i.e., WordPress, Wix, Squarespace), and then each website’s unique customizations. 

Many issues can keep search engines’ crawlers from crawling pages, which is an excellent reason to study the details involved in this stage. 

First, the search engine must find a link to the page at some point before it can request the page and visit it. (Under certain configurations, the search engines have been known to suspect there could be other, undisclosed links, such as one step up in the link hierarchy at a subdirectory level or via some limited website internal search forms.) 

Search engines can discover webpages’ links through the following methods:

  • When a website operator submits the link directly or discloses a sitemap to the search engine.
  • When other websites link to the page. 
  • Through links to the page from within its own website, assuming the website already has some pages indexed. 
  • Social media posts.
  • Links found in documents.
  • URLs found in written text and not hyperlinked.
  • Via the metadata of various kinds of files.
  • And more.

In some instances, a website will instruct the search engines not to crawl one or more webpages through its robots.txt file, which is located at the base level of the domain and web server. 

Robots.txt files can contain multiple directives within them, instructing search engines that the website disallows crawling of specific pages, subdirectories or the entire website. 

Instructing search engines not to crawl a page or section of a website does not mean that those pages cannot appear in search results. Keeping them from being crawled in this way can severely impact their ability to rank well for their keywords.

In yet other cases, search engines can struggle to crawl a website if the site automatically blocks the bots. This can happen when the website’s systems have detected that:

  • The bot is requesting more pages within a time period than a human could.
  • The bot requests multiple pages simultaneously.
  • A bot’s server IP address is geolocated within a zone that the website has been configured to exclude. 
  • The bot’s requests and/or other users’ requests for pages overwhelm the server’s resources, causing the serving of pages to slow down or error out. 

However, search engine bots are programmed to automatically change delay rates between requests when they detect that the server is struggling to keep up with demand.

For larger websites and websites with frequently changing content on their pages, “crawl budget” can become a factor in whether search bots will get around to crawling all of the pages. 

Essentially, the web is something of an infinite space of webpages with varying update frequency. The search engines might not get around to visiting every single page out there, so they prioritize the pages they will crawl. 

Websites with huge numbers of pages, or that are slower responding might use up their available crawl budget before having all of their pages crawled if they have relatively lower ranking weight compared with other websites.

It is useful to mention that search engines also request all the files that go into composing the webpage as well, such as images, CSS and JavaScript. 

Just as with the webpage itself, if the additional resources that contribute to composing the webpage are inaccessible to the search engine, it can affect how the search engine interprets the webpage.

Rendering

When the search engine crawls a webpage, it will then “render” the page. This involves taking the HTML, JavaScript and cascading stylesheet (CSS) information to generate how the page will appear to desktop and/or mobile users. 

This is important in order for the search engine to be able to understand how the webpage content is displayed in context. Processing the JavaScript helps ensure they may have all the content that a human user would see when visiting the page. 

The search engines categorize the rendering step as a subprocess within the crawling stage. I listed it here as a separate step in the process because fetching a webpage and then parsing the content in order to understand how it would appear composed in a browser are two distinct processes. 

Google uses the same rendering engine used by the Google Chrome browser, called “Rendertron” which is built off the open-source Chromium browser system. 

Bingbot uses Microsoft Edge as its engine to run JavaScript and render webpages. It’s also now built upon the Chromium-based browser, so it essentially renders webpages very equivalently to the way that Googlebot does. 

Google stores copies of the pages in their repository in a compressed format. It seems likely that Microsoft Bing does so as well (but I have not found documentation confirming this). Some search engines may store a shorthand version of webpages in terms of just the visible text, stripped of all the formatting.

Rendering mostly becomes an issue in SEO for pages that have key portions of content dependent upon JavaScript/AJAX. 

Both Google and Microsoft Bing will execute JavaScript in order to see all the content on the page, and more complex JavaScript constructs can be challenging for the search engines to operate. 

I have seen JavaScript-constructed webpages that were essentially invisible to the search engines, resulting in severely nonoptimal webpages that would not be able to rank for their search terms. 

I have also seen instances where infinite-scrolling category pages on ecommerce websites did not perform well on search engines because the search engine could not see as many of the products’ links.

Other conditions can also interfere with rendering. For instance, when there is one or more JaveScript or CSS files inaccessible to the search engine bots due to being in subdirectories disallowed by robots.txt, it will be impossible to fully process the page. 

Googlebot and Bingbot largely will not index pages that require cookies. Pages that conditionally deliver some key elements based on cookies might also not get rendered fully or properly. 

Indexing

Once a page has been crawled and rendered, the search engines further process the page to determine if it will be stored in the index or not, and to understand what the page is about. 

The search engine index is functionally similar to an index of words found at the end of a book. 

A book’s index will list all the important words and topics found in the book, listing each word alphabetically, along with a list of the page numbers where the words/topics will be found. 

A search engine index contains many keywords and keyword sequences, associated with a list of all the webpages where the keywords are found. 

The index bears some conceptual resemblance to a database lookup table, which may have originally been the structure used for search engines. But the major search engines likely now use something a couple of generations more sophisticated to accomplish the purpose of looking up a keyword and returning all the URLs relevant to the word. 

The use of functionality to lookup all pages associated with a keyword is a time-saving architecture, as it would require excessively unworkable amounts of time to search all webpages for a keyword in real-time, each time someone searches for it. 

Not all crawled pages will be kept in the search index, for various reasons. For instance, if a page includes a robots meta tag with a “noindex” directive, it instructs the search engine to not include the page in the index.

Similarly, a webpage may include an X-Robots-Tag in its HTTP header that instructs the search engines not to index the page.

In yet other instances, a webpage’s canonical tag may instruct a search engine that a different page from the present one is to be considered the main version of the page, resulting in other, non-canonical versions of the page to be dropped from the index. 

Google has also stated that webpages may not be kept in the index if they are of low quality (duplicate content pages, thin content pages, and pages containing all or too much irrelevant content). 

There has also been a long history that suggests that websites with insufficient collective PageRank may not have all of their webpages indexed – suggesting that larger websites with insufficient external links may not get indexed thoroughly. 

Insufficient crawl budget may also result in a website not having all of its pages indexed.

A major component of SEO is diagnosing and correcting when pages do not get indexed. Because of this, it is a good idea to thoroughly study all the various issues that can impair the indexing of webpages.

Ranking

Ranking of webpages is the stage of search engine processing that is probably the most focused upon. 

Once a search engine has a list of all the webpages associated with a particular keyword or keyword phrase, it then must determine how it will order those pages when a search is conducted for the keyword. 

If you work in the SEO industry, you likely will already be pretty familiar with some of what the ranking process involves. The search engine’s ranking process is also referred to as an “algorithm”. 

The complexity involved with the ranking stage of search is so huge that it alone merits multiple articles and books to describe. 

There are a great many criteria that can affect a webpage’s rank in the search results. Google has said there are more than 200 ranking factors used by its algorithm.

Within many of those factors, there can also be up to 50 “vectors” – things that can influence a single ranking signal’s impact on rankings. 

PageRank is Google’s earliest version of its ranking algorithm invented in 1996. It was built off a concept that links to a webpage – and the relative importance of the sources of the links pointing to that webpage – could be calculated to determine the page’s ranking strength relative to all other pages. 

A metaphor for this is that links are somewhat treated as votes, and pages with the most votes will win out in ranking higher than other pages with fewer links/votes. 

Fast forward to 2022 and a lot of the old PageRank algorithm’s DNA is still embedded in Google’s ranking algorithm. That link analysis algorithm also influenced many other search engines that developed similar types of methods. 

The old Google algorithm method had to process over the links of the web iteratively, passing the PageRank value around among pages dozens of times before the ranking process was complete. This iterative calculation sequence across many millions of pages could take nearly a month to complete. 

Nowadays, new page links are introduced every day, and Google calculates rankings in a sort of drip method – allowing for pages and changes to be factored in much more rapidly without necessitating a month-long link calculation process.

Additionally, links are assessed in a sophisticated manner – revoking or reducing the ranking power of paid links, traded links, spammed links, non-editorially endorsed links and more. 

Broad categories of factors beyond links influence the rankings as well, including: 

Conclusion

Understanding the key stages of search is a table-stakes item for becoming a professional in the SEO industry. 

Some personalities in social media think that not hiring a candidate just because they don’t know the differences between crawling, rendering, indexing and ranking was “going too far” or “gate-keeping”. 

It’s a good idea to know the distinctions between these processes. However, I would not consider having a blurry understanding of such terms to be a deal-breaker.

SEO professionals come from a variety of backgrounds and experience levels. What’s important is that they are trainable enough to learn and reach a foundational level of understanding.

The post The 4 stages of search all SEOs need to know appeared first on Search Engine Land.



from Search Engine Land https://ift.tt/JjNo2OL
via IFTTT

No comments:

Post a Comment