How Google's search engine works

Why you care – because Google is the best search engine out there

Of course no one knows exactly how Google's search engine works. It is a closely guarded secret.

What is known, is that Google works far better than any other search engine out there. It simply finds relevant and useful resources for users much more quickly and easily than its competitors.

Having worked in the field of search engines and Search Engine Optimisation (SEO) since 2000, my opinion is that Google has acheived this by concentrating on giving users the quickest and easiest search interface, and providing them with the most relevant results.

For the most part Google has relegated conventional business concerns, like revenue streams, to a secondary status when compared the effort they have invested in search. In doing so dominated the market by providing the best product.

Google is continually improving its search. Sometimes it makes mistakes, and more recently its search has been to an extent been compromised by the revenue streams produced by Google Ads.

Search engine work by creating huge databases of the world wide web. When you put a search query into Google, Google is not searching the web, it is searching one of many databases it has compiled that index the web.

The Googlebot, other robots, and spidering the web

Search robots spider the web following links and indexing the content crawl over. These robots are essentially specialised web browsers. The earlist version of the Googlebot, Google's search robot, was a modified Lynx browser. (Lynx was an early, text only, web browser).

To compile these databases Google sends out search robots, called Googlebots, to browse ('spider') the web. They have a start point and from there these robots simply follow links, leaving a trail – like spiders silk – behind them of URLs (web addresses) and the related web content and metadata they have passed over. This is how Google indexes the web.

The data from this indexing is them evaluated using an algorithm, or rather a set of algorithms, that categorise pages – as for example as a book shop, or as a site about cats –, and scores pages.

The Google algorithms are of course a closely guarded secret. There are some aspects of how Google algorithms work to evaluate your website that are known, or that can be inferred. These are:

  • Google uses a scoring system called PageRank. There are browser add-ons that allow you to view the PageRank of any URL.
  • Google search results will bold words and phrases in the search engine results page (SERP) that relate to the entered search query. The bold words can be in:
    • The page URL: For Google to do this is usually, but not always, requires word seperators – hyphens or underscores –
    • The page title: Google often, but not always, uses the pages title as the hyperlink.
    • The metadata description: The page metadata description is sometimes displayed as the two line description of the page that appears in the Google results.
    • Content in Headings tags, particualrly H1 and H2 tags: Sometimes google will display this content in the rsult description.
    • Page content: A sample of page content is sometimes displayed as the two line description of the page that appears in the Google results.
    • Link text: A sample of link text taken from the title attribute of a page linking to the target page is sometimes displayed as the two line description of the page that appears in the Google results.
  • Directory listing: A directory listing description of the site is sometimes displayed as the two line description of the page that appears in the Google results. If it is a search directory listing it is most likely taken from DMOX.org, the Google Directrory, which is a version of DMOZ.org, and possible the Yahoo Directory (though I think this was more likely the in the past).

It is known that there is also some manual, meaning human, evaluation used by Googlein compiling its search results.

How Google PageRank works

Google PageRank has it origins in the early conception of what the internet was for, and how it would be used.

Hyper Text Mark-up Language (HTML) was designed to mark up academic papers. Quite naturally as universities were the earliest adopters of the civilian internet.

PageRank was predicated on the assumption that the more a page, like an academic paper, was cited/linked to, the greater importance and weight that paper/webpage carried. You could view this as an early application of crowd sourcing. In academic terms, it is simply peer review. Just like Darwin is cited more often than one's own university disseration, so Darwin by a process of peer review is regarded as the more important and reliable source of information.

The upshot was that SEO practicioners very quickly concentrated their efforts on getting as many inbound links to their website as possible in order to raise a page's PageRank, and improve its placing in natural search results.

Not all inbound links to a page are equal though.

  • The value of an inbound link is known to be affected by the PageRank score of the originating page.
  • Having an accurate brief description of the target page in the 'link phrase' is thought to improve the value of the link. (The link phrase 'click here' does not describe what the user can expect when they click on the link, so is of low value).
  • Some inbound links may actually be harmful to a sites rating. Links from 'link farming' sites may imply that a site hangs out in bad company.

In addition what counts as a link may not be a simple as first appears.

  • Links within a domain also count. A page that is prominently linked to from the website homepage, for example in the glabal navigation, is inferred by the search engine to be a more important page than one that isn't linked to, or is linked to from a less prominent area of the page, such as a link included in the footer.
  • An inclusion in the index of another search engine, may also be counted as an inbound link.
  • An inclusion in the index of a different international variant of a search engine may also be counted as an inbound link. Therefore a UK company may rank higher in the UK Google index by also being included in the United States Google index.

Farmer sites and Google's search quality dilemna

Google human reviews

It is known from a leaked document that there is a manual search review procedure at Google.

The manual review does not review websites. Rather it reviews the results of search queries.

The purpose of the reviews is to 'sanity check' the search results Google's automated algorhithm's are producing. In certain instances this results in Google over riding the automatically calculated results of its various algorthims for top positions in natural search. This is in effect a fixed search .

This can be demonstrated quite clearly for the website of London based artist Lucy Harrison. This website is not optimised at all. Yet if you input the search query "Lucy Harrison' into Google UK pages from her website occupy the top two natural search results. If you input "Lucy Harrison artist' in Google UK Lucy Harrison's website occupies the first three natural search results, and then information about Lucy Harrison on other gallery and artist websites complete the entirety of search results on the first page.

This phenonmenon can also be viewed on Google UK using when searching for the Guardian newspaper site, using the search query 'Guardian'. It isn't as an explicit example as the Lucy Harrison one, but there was a time a couple of years ago, when the Guardian didn't come first in the natural search. That place was taken by a similarly monikered insurance company.

I think it's logical to infer that Google has done this, at least in part, as a response to how many users use Google now, which is not to search as such. Rather users are using Google as a shortcut to access a site they know they want. Many users have Google set as their brwoser homepage, and it's quicker to type into Google than to type in a site URL. User testing has shown that it is now rare for users to type URLs directly into the address bar on their browsers, whereas in the early days of mass internet use, it was common.

Another aspect of this, is this review process likely contributes to how Google generates predicted search queuries as users begin typing.

Search robots and Accessibility

It is important to remember that Googlebots, and other search robots, are not only mindless automatons, they are also blind.

This means that accessible websites have a head start in SEO.

The most well known aspect of accessibility benefitting SEO is that images should include alternative descriptions, because search spiders cannot read the imformation imparted by images. Accessibility also requires well formatted HTML, and well formatted HTML allows search robots to spider sites more efficiently, and crucially with more semantic (meaningful) information that assists in the evaluation of the websites.

More on search engine optimisation

See also

All UX services

Ross Holloway Web Consultant | UX web designer | business analyst | web content | project manager