Looking closely at Google’s search results can be informative – at least, if you take some inductive leaps, and apply knowledge learned in other activities. Take a look at the graphic, showing these early April 2010 search results for Matt Cutts’ web site. Notice that the same articles appear several times, wIth slightly different URLs? Now look further down the results. See any more duplications? For example, that mini-review of the iPad. I can’t see that article listed anywhere else in the search results.
Why are the articles shown twice or more? Because they have different URLs. They are the same page content, but on different paths. Duplicates. Notice that Matt’s blog is not penalised for these duplicates – because they aren’t what Google considers to be duplicate content. They are the same content, reached by different paths, and in this case the different paths are tracking parameters for Google Analytics. It’d be more than a little annoying if Google handled “duplicate” pages, caused by using their own tracking parameters, as if they were different resources!
Why aren’t other articles on the blog shown twice or more times? What magic makes it that two day old articles get three showings, but articles five days old and older show just once?
Interesting, isn’t it?
Now, instead of sorting by date order, sort by relevance. Let’s look for those duplications again:
So these duplicate pages of the last few days, appear as the lowest relevance pages in the last year of articles on Matt’s site. Yet they have the same text as the main article… or do they? Might there be something magic that the cache shows us?
The cached versions show us that they were taken up to a week previously – and reveal that these were found from Feedburner. Google is taking different URLs, determined by the tracking tags, and including them in results. The weight for the tagged pages is lower – Matt’s blog doesn’t refer to these URLs, and nor does anything else that Googlebot will crawl. So these additional pages rank lower, despite having the same content – the difference lies in the backlinks, and perhaps other factors.
But if there are no external backlinks to these articles, what makes them appear in results, at all, in any position? This, I think, is where we drag in another factor that seems to apply, especially to blogs. Trust. If a blog is well trusted by Google, then articles posted will tend to rank highly in search results, in as little as a few minutes for highly relevant searches (relevant, that is, to what Google thinks the article is about, which mostly means the title). Amongst the two hundred-or-so factors that Google is looking at, is whether they trust your results are likely to satisfy searchers. If your articles have enough hits from satisfied (long reading) users, you leap up the rankings within minutes of posting. Less trusted, lower weight postings won’t appear on page one, if they appear at all. That trust is extended to articles with no backlinks – these pages appear in results because they’re coming from a trusted site.
Why do these duplicate pages disappear, and when, and what does that tell us?
At some point in the last week, Google has removed the extra results for other page references with additional parameters. If we look again next week, we’ll see that the extra results for the current recent articles have also disappeared, but if Matt makes more posts, we may see those articles with extra tracking parameters for a brief period. Why do these “duplicate” pages disappear?
Well, one reason is that Google detests spam. A bad page of search results would be a page that contained different sites with the same article – because that wouldn’t reflect the diversity of opinion and solution on the web. Google would prefer to have ten different answers to the search query, than one answer on ten highly ranked sites. There’s a conscious effort at Google to compare articles.
How does Google identify that? Well, we can take some guesses by looking at the search results. Notice that bit that says “cached”? Taking several shots of a page (cached pages), over time, lets Google see that page content is evolving. User generated content accretes to a page – and that’s visible in successive cached snaphots.
Google can therefore see which parts of a page are static and which are likely to be UGC, even without any effort to understand the page structure. Matts’ articles get a lot of commentary, so it’d be pretty easy for Google to determine that a common core of the page is unchanging – and identical. How easy? Well, there have been tools to perform that kind of textual analysis with computers since at least the 1970′s to my certain knowledge – I used them back then. More recent techniques have used data compression techniques to compare compression rates of samples, using the highly optimised algorithms for good data compression – and Google has smart people who probably do more complex stuff than that.
Canonical Link References
One more datum to collect! Canonical Link References – does Matt’s blog use the canonical link ref to make sure that Google knows the best URL for this article?
The answer is very much affirmative – Matt so much likes the canonical link reference, it is added twice in the header! Hmm – some problem with conflicting plugins, perhaps? I don’t think this duplication of the canonical link reference is intentional. Fortunately they are the same, or we’d get into some discussions about whether Google believes the first or second canonical link reference!
What *should* the canonical link reference do for Google, when we see the variant forms with tracking tags? It should tell Google that the preferred form is the one without the tracking tags. We should end up with just the preferred form showing in search results. And that’s what we see – but the canonical link reference isn’t the only way that Google looks for probably duplicated data. You can tell, if you look for other blogs that use FeedBurner, but that don’t use canonical link references. Those blogs still get deduplicated listings – showing that other components to deduplicate are still working, but still take days to do so.
What does the transitory presence of these pages tell us?
Thinking solely about what Matt’s results are showing us – not taking any evidence from other experiments and tests into account…
I think the presence of these pages in the search results says that Google is reading FeedBurner feeds for Matt’s blog, and getting tagged data, tagged for Google Analytics. I’m guessing that the tags are not set as ignored in Matt’s Webmaster Console, so Google sees the tagged pages as different. Because users are not linking to them, and the blog itself doesn’t refer to them, the usage eventually dies out – where “eventually” means less than a week. And the canonical link reference probably also helps.
Note that the alternate pages *are* present. Not highly ranked, but present. So… the content is the same, the server is the same, but the links to these pages are different; the untagged page is referenced within the blog (by any category or blog tag and the archive), and after a day or so, there are probably some links to these articles, all probably pointing to the untagged page. So that tells us that backlinks are important – or there’d be no reason why these tagged pages shouldn’t rank as highly – the user experience is likely to be the same, the content was (at the time of first snapshot/cache) the same. So if backlinks within Matt’s blog weren’t important, then the pages should rank with equal weight… and they clearly don’t.
This observation also says something important to us – if we watch our web server log files carefully! You may have read some of my other articles here about web server log file analysis for search engine optimisation – I think the web server log files tell us important things about how Google perceives our sites. If we see Googlebot requesting a tagged resource, then that tells us something about what resources we have out there, our state of canonicalisation, and Google’s speed of change. We want to know the speed of change, because when we’re trying to improve performance, we’ll want to see when we might expect to see results, at earliest.
We don’t, unfortunately, have access to Matt’s web server log files… but you have access to yours… What do they tell you, under similar conditions? I may return to this topic, as I think the research is pretty interesting for what it tells us about how Google goes about making decisions.
Summary
Google sees the same blog article under different URLs over time. Google collapses the references to a single page URL – probably using intrinsic information (page content) and supplied information (canonical link references), and backlinks, and the Webmaster Tools mechanism to instruct Google to ignore certain parameters.
URLs for alternate page presentations get into the results for trusted sites, even if these URLs have no weighty backlinks. Not highly ranked, but they do appear. They appear quickly and disappear within a few days. This suggests that whatever it is at Google that evaluates, takes a few days to do so.
I don’t think this points to content being the key, but to content with backlinks and user-preferred results, as being the most important. And the presence of these duplicates is a hint that a site is trusted. Not a great hint, as if you don’t use FeedBurner or another similar tagging resource, you won’t see this effect.
I’ve not considered the backlinks to Matts blog in detail… there are a lot and the signal is messy. I’ve left that analysis out of this article. Besides, discovering the timeline of backlinks is itself a pretty tricky exercise, unless you, as the experimenter, control them; that’s definitely not the case for Matt’s blog!
What’s especially interesting is that an article, without significant backlinks to it yet, can appear high in search results, but that this tends to be for the canonical representation of the page, even though Google is probably learning of the page via FeedBurner/pingomatic notification (IOW, likely to be tagged). So the first mechanism that notifies Google of the URL, is probably *NOT* the canonical form – yet the canonical form is listed highest, within minutes. That suggests some fast processes at Google for identifying and evaluating a page’s relevance, and then some slower processes to determine whether the page can be justified for continuing presence.
And that behaviour of high ranking without external justification, in turn, has some implications for the weight that will flow from a blog article… It is initially likely to be low, and if the article “sticks” in the results because it helps users, then it gains some kind of value. Otherwise, the weight of the page will remain low – though, strictly, understanding that evolution of the weight of the article requires looking at the impact of links from articles in a blog. Another day, perhaps ;)
I hope you found this close look at search results amusing, if not educational.







Kim Clink wrote,
Great article Jeremy!
Link | May 18th, 2010 at 4:03 pm
Jeremy Chatfield wrote,
Thanks, Kim – from a prolific blogger like yourself, over on AdWords Help Experts, that’s a real compliment!
Link | May 19th, 2010 at 9:36 am