If you follow me on Twitter, then you know I sometimes complain about the current state of the industry – most notably centered around what passes for research and discussion these days. It feels like people want to be handed the fish – with little interest in learning how the person with the fish caught it. Seeking out debates and experience seems to have been replaced by wanting to be spoon fed blog posts – often laced with assumptions and misinformation hidden within a single-case graph or a slick graphic coupled with an impressive looking byline.
Separating fact from fiction becomes increasingly hard as the next generation of our industry raises themselves up by following rather than exploring.
At SMX West Google engineer Paul Haahr gave a presentation that gave some insight into how Google works. In my opinion, it's the most transparent and useful information we've been presented with by Google in years.
I spent this morning taking a full set of notes on it – which I always do with something like this because I believe it helps me better retain the information. Within my notes, I make notations about questions I have, theories I come up with and conclusions I draw – right or wrong. This isn't the sexy work, but it's the necessary work to be a formidable opponent in the game.
As I looked at the notes, I realized I missed the discussions, debates, and sharing of experience that used to surround analyzing information like this.
Some of what I feel limits that kind of discussion these days is needing to be seen as an infallible expert on all things Google. The industry has become so consumed with being an expert that it's afraid to ask questions or challenge assumptions for fear of being proven wrong. Unlike many of the names within this industry, I'm not afraid to be wrong. I welcome it. Being proven wrong means I have one more piece of concrete knowledge needed to win the game. Being questioned or challenged on a theory I hold gives me another theory to test.
So I'm publishing my notes – and personal notations – and am making a call for a real, exploratory – fuck it if I'm right or wrong – search discussion. Whether you're old school with massive experience or new school with untested theories and ideas – bring it to the table and let's see what we can all walk away from it with.
Notes from the How Google Works presentation – SMX West 16
To be clear, these are my notes from the presentation and not a transcription (notations in orange are comments made by me and not the speaker).
General opening remarks
- Google is all about the mobile first web
- Your location matters a lot when searching on mobile
- Auto complete plays bigger role
- His presentation centers mostly around classic search
Life of a query
Timestamp: 3:38 – Link to timestamp
Haahr infers that this next bit of information is a 20 minute secret-sauce stripped version of the half-day class attended by every new Google engineer.
He starts by explaining the two main parts of the search engine:
1. What happens ahead of time (before query):
- Analyzing crawled pages: links, render contents, annotating semantics
- Build the index: think of it like the index of book
- Made up of Shards. Shards segment groups of millions of pages.
- There are thousands of Shards in the Google index
- Per document Metadata
2. And query processing:
- Query understanding – What does the query mean: are there known entities? Useful synonyms? Specifies that context matters for queries.
- Retrieval and scoring
- Send the query to all the shards
- Find matching pages within each Shard
- Computes a score for A. the query (relevance) and B. the page (quality)
- Sends back the top pages from each Shard by score
- Combine all the top pages from each Shard
- Sort combined top Shard results by score
- Post-retrieval adjustments
- Host clustering (notation –
does this mean using a dedicated server can be a bonus? Should you check shared hosts for sites with similar topics? Confirms need for separate hosts for networked or related sites?this has been clarified by a former Googler, see this comment for more detail. The tldr is that host clustering is synonymous with domain clustering (the more widely used term in the industry) and site clustering and does not refer to host as in hosting.)
- Are sitelinks appropriate?
- Is there too much duplication?
- Spam demotions and manual actions get applied
- Snippets get pulled
- Host clustering (notation –
What engineers do
Timestamp: 8:49 – Link to timestamp
- Write code
- Write formulas to compute scoring numbers to find the best match between a query and a page based on scoring signals
- Query independent scoring factors – Feature of page like Pagerank, language, mobile friendliness
- Query dependent scoring factors – Feature of page and query such as keyword hits, synonyms, proximity, etc. (notation – in relation to the proximity of the keyword within the page or of the user locale or of the site's presumed locale?)
- Combine signals to produce new algorithms or filters and improve results
Key metrics for rankings
Timestamp: 10:10 – Link to timestamp
- Relevance – Does the page answer the user query in context – this is the front of the line metric
- Quality – How good are the results they show in regard to answering the user query? How good are the individual pages? (notation – Emphasis on individual is mine)
- Time to result (faster is better) (notation – Time for site to render? Or for the user to be able to find the answer on the ranking page? Or a combination? Site render time could be a subfactor of time for the user to be able to find the answer on the ranking page? Edit > Asked Haahr for clarification on Twitter – he is unable to elaborate. However, there is some probable elaboration found via Amit Singhal in this comment.)
- More metrics not listed
- Offers he “should mention” that the metrics are based on looking at the SERP as a whole and not for one result as a time.
- Uses the convention that higher results matter
- Positions weighed
- Reciprocally ranked metrics
- Position 1 worth the most, position 2 is worth half of what number 1 is, position 3 is worth one 1/3 of number 1, etc. (notation – The premise of reciprocally ranked metrics went over my head and I welcome simplified clarifications on what he's talking about here.)
Timestamp: 12:00 – Link to timestamp
Metric optimization ideas and strategies are developed through an internal evaluation process that analyzes results from various experiments:
Timestamp: 12:33 – Link to timestamp
- Split testing experiments on real traffic
- Looking for changes in click patterns (notation – There has been a long-time debate as to whether click through rates are counted or taken into account in the rankings. I took his comments here to mean that he is asserting that click through rates are analyzed from a perspective of the quality of the SERP as a whole and judging context for the query vs. benefitting a specific site getting more clicks. Whether or not I agree with that I'm still arguing internally about.)
- Google runs a lot of experiments
- Almost all queries are in at least one live experiment
- Example experiment – Google tested 41 hues of blue for their result links trying to determine which one performed best
Example given for interpreting live experiments: Page 1 vs. Page 2
- Both pages P1 and P2 answer the user's need
- For P1 the answer only appears on the page
- For P2 the answer appears both on the page and in the snippet (pulled by the snippeting algorithm – resource on the snippet algortihm)
- Algorithm A puts P1 before P2; user clicks on P1:from an algorithmic standpoint this looks like a “good” result in their live experiment analysis
- Algorithm B puts P2 before P1; but no click is generated because the user sees answer in the snippet; purely from an algorithmic standpoint this looks like a “bad” result
But in that scenario, was Algorithm A better than Algorithm B? The second scenario should be a “good” result because the user got a good answer – faster – from the snippet. But it's hard for the algorithm to evaluate if user left the SERP because the answer they needed wasn't there or if they left because they got their answer from a snippet.
This scenario is one of the reasons they also use human quality-raters.
Human quality-rater experiments
Timestamp: 15:21 – Link to timestamp
- Show real people experimental search results
- Ask them to rate how good the results are
- Human ratings averaged across raters
- Published guidelines explaining criteria for quality-raters to use when rating a site
- Tools support doing this in an automated way
— States they do it human quality-rater experiments for large query sets to obtain statistical significance and cites it as being similar to Mechanical Turk like processes
— Mentions that the published rater guidelines are Google's intentions for the types of results they want to produce (notation – this is very different than a user rating a query based on personal satisfaction – instead they are told to identify if the results for the query meet Google's satisfaction requirements and include the kind of results Google believes should be included – or not included. Quality rater guidelines are the results produced by the Google dream algorithm.)
— He says if you're ever wondering why Google is doing something, it is most often them trying to make their results look more like the rater guidelines. (notation – Haahr reiterated to me on Twitter how important he believes reading the guidelines is for SEOs.)
— Slide showing human rater tools: Slides 33, 34
— Re mobile first – more mobile queries in samples (2x)
- Raters are told to pay attention to the user's location when assessing results
- Tools display mobile user experience
- Raters visit websites on smartphones, not on a desktop computer
Timestamp: 19:04 – Link to timestamp
Are the needs as defined by Google met?
- Instructions tell raters to think about mobile user needs and think about how satisfying the result is for mobile user
- Rater scales include: fully meets, highly meets, moderately meets, slightly meets, fails to meet
- Slider bars are available to further sub classify a “meets” level
- Example: a result can be classified highly meets and the slider bar allows the rater to subclassify that “highly meets” result as very highly meets, more highly meets, etc.
- There are two sliders for rating results – one for the “needs met” (relevancy) rating and one for the “page quality”
- Slider bars are available to further sub classify a “meets” level
- Examples of fully meets in slides – slide 41:
- Query CNN – cnn.com result – fully meets
- Search for yelp and you have yelp app installed on phone so google will serve the app – fully meets
- To be rated a fully meets query, they want an unambiguous query and wholly satisfy the user's needs for that query
- Examples of highly meets in slides – slides 42 – 44 showing varying subclassifications of highly meets queries
- Informational query and the result is a great source of information
- Site is authoritative
- Author has expertise on the topic being discussed
- Comprehensive for the query in question
- Showing pictures where the user is likely looking for pictures
- Examples of moderately meets in slides – slide 45
- Result has good information
- Interesting and useful information, though not all encompassing for the query or super authoritative
- Not worthy of being a number one answer, but might be good to have on the first page of results
- Slightly meets
- Result contains less good information
- Example: a search for Honda Odyssey might bring up the page for the 2010 Odyssey on KBB. It slightly meets because the topic is correct and there is good information, but the ranking page is outdated. The user didn't specify the 2010 model, so the user is likely looking for newer models. He cites this result as “acceptable but not great”
- Result contains less good information
- Fails to meet
- Example: A search for german cars and get the Subaru website (which is manufactured in Japan)
- Example: A search for rodent removal company brings up a result half a world away (notation – They want to geo-locate specific query types that are likely to be geo-centric in need – ex. Local service businesses. Using quality raters can help them identify what these service types are and add to the standard geo-need list like plumbers, electricians, etc.)
Assessing page quality:
Timestamp: 23:58 – Link to timestamp
The three most important concepts for quality:
- Is the author an expert on topic?
- Is webpage authoritative about the topic
- Can you trust it?
- Gives example categories where trustworthiness would be most important to assessing the overall page quality – medical, financial, buying a product
The rating scale is high quality to low quality:
- Does the page exhibit signals of high quality as defined in part by:
- Satisfying amount of high quality main content
- The website shows expertise, authority and trustworthiness for the topic of the page
- The website has a good reputation for the topic of the page
- Does the page exhibit signals of low quality as defined in part by:
- The quality of content is low
- Unsatisfactory amount of main content
- Author does not have expertise or is not authoritative or trustworthy on the topic – on the topic is bolded in his presentation (notation – The concept behind Author rank lives in my opinion. We were who taught them how to connect the dots with Authorship markup. They can no doubt now do this algorithmically and no longer need us manually for connecting those dots.)
- The website has an explicit negative reputation
- The secondary content is unhelpful – ads, etc. (notation – Human input giving them a roadmap to how they're calculating and shaping the Above the Fold algorithm? Likely also refers to the affiliate notations in search rater guidelines starting on page 10 of the Google quality rater guidelines.)
Optimizing the metrics – the experiments
Timestamp: 25:28 – Link to timestamp
- Someone has an idea for how to improve the results via metrics and signals or solve a problem in the results
- Repeat development of and testing on idea until the feature is ready; code, data, experiments, analyzing results of experiments which can take weeks or months
- If the idea pans out, some final experiments are run and a launch report is written and undergoes a quantitative analysis
- He feels this process is objective because it comes from outside team who was working on and is emotionally invested in the idea
- Launch review process is held
- Every Thursday morning there is a meeting where the leads in the area hear about project ideas, summaries, reports or experiments, etc.
- Debates surrounding if it's good for users, for the system architecture, and to argue if the system can continue to be improved if this change is made. (notation – He makes a reference to them having published a launch review meeting a few years ago. I believe he is referring to this.)
- If approved it goes into production
- Might ship same week
- Sometimes it takes a long time in rewriting code to make it fast enough, clean enough, suitable for their architecture, etc. and can take months
- One time it took almost two years to ship something
The primary goal for all features and experiments is to move pages with good ratings up and pages with bad ratings down. (notation – I believe he means human ratings, but that was not clarified.)
Two of the core problems they face in building the algorithm
Timestamp: 28:50 – Link to timestamp
Systematically bad ratings:
- Gives bad rating example, texas farm fertilizer
- User is looking for a brand of fertilizer
- Showed a 3 pack of local results and a map at the top position
- It's unlikely the user doing the search wants to go to company's headquarters since it's sold in local home improvement stores
- But raters on average cited the result with the map of the headquarters as almost highly meets
- Looked successful due to raters ratings
- But in reality they noted what Google describes as a pattern of losses
- In a series of experiments that were increasing the triggering of maps, human raters were rating them highly
- Google disagreed, so they amended their rater guidelines to create more examples of these queries and explaining to users that they should be cited as failed to meet – see slide 61 of the presentation
- The new examples told raters that if they didn't think the user will go there, maps are a bad result for the query, citing examples like:
- radio stations
- lottery office
- When Google sees patterns of losses, they look for things that are bad in results and create examples for rater guidelines to correct them
Metrics don't capture things they care about AKA missing metrics
- Shows Salon.com article on slide with the headline Google News Gets Gamed by a Crappy Content Farm
- From 2009-2011 they received lots of complains about low quality content
- But human ratings were going up
- Sometimes low quality content can be very relevant
- He cites this as an example of what they consider content farms
- They weren't measuring what they needed to
- So they defined an explicit quality metric – which is not the same as relevance – and this is why relevance and quality each have their own sliders for human raters now
- Determined quality is not the same as relevant
- They were able to develop quality signals separate of relevance signals
- Now they can work on improving the definitions of both separately in the algorithm
Quality signals became separate of relevancy signals (notation – Emphasis is mine. I think most of the search industry sees this as one metric and think it is important to emphasize that they are not and have not been for a long while now.)
So what now?
Contribute. What insights did you take away from the presentation? What were your thoughts on the things I notated? Were there things I didn't notate that you have a comment on or had a theory spurred from? Do you disagree with any of Haahr's assertions? Do you disagree with mine? Did anything in his presentation surprise you? Did anything get confirmed for you? Whatever thoughts you had on his presentation, drop them in the comments below.