I’ve started another burst of postings about web server log file analysis and what it tells search engine optimisers about search engine spiders. Web spider behaviour often lies behind issues that I find on other blogs. For example, Dave Naylor has a couple of recent articles that are interesting. A good one to read is about using the “motion charts” in Google Analytics to find opportunities. But there’s an odder one about Anchor Text. Some of that article is confirmation of stuff Matt Cutts has written about – the first link being the one that carries anchor text value, for example, or anchor text and nofollow, or delayed echoes of Rand Fishkin’s recent article on Anchor Text.
Apart from the validation of Matt Cutts statements, there’s one result that appears blindingly obvious. Malformed URLs don’t pass anchor text – and by implication, weight. In the context of the example in the article, adding a space to a URL in the anchor, destroys the value. Googlebot changes spaces (which aren’t valid characters in a URL) into “%20″ symbols. In Dave Naylor’s article, that means that the Googlebot will do a DNS lookup for a domain that doesn’t and can’t exist – spaces are not allowed in domain names. If the URL in the anchor’s href had been a fully pathed URL, then a space would be added to the end and converted to a “%20″.
That full URL, with an appended “%20″ won’t be found on the site. It should appear, at some point, in web server log files as a 404 for a Googlebot visit. 404′s don’t pass weight. So why the surprise that a malformed URL would fail?
I think the real point, not cleanly spelled out in the article, is that web browsers don’t parse pages the way that search engine spiders parse pages. A browser will cope with the embedded space. That ability of a browser to infer the useful thing to do, doesn’t make the space into a valid character in URLs – not without being escaped, anyway. And the consequence of appending the space, will be that a web spider makes a request for a resource that will usually be 404′ed, unless the administrator has used Apache mod_speling or an equivalent typo-correction tool (which should yield a 301 redirect to the correct resource).
Attempting to infer the SEO value of browser interpreted behaviour, without understanding Googlebot behaviour, will create confusing and misleading problems.


Bill Slawski wrote,
Hi Jeremy,
On the “first link” issue, there are some problems with the sources of information. The video from Matt Cutts doesn’t actually answer the question that was asked. I believe Matt may have answered the question elsewhere, stating that the answer was more complicated than whether or not the anchor text from the first link was what counted, and that if “the anchor text was the same” in both links that they would likely drop one of the links.
The experiment from SEOmoz doesn’t take into account other possibilities that may skew the result of their testing.
For instance, if Google is using the Phrase-Based indexing process described in the Anna Lynn Patterson patent filings, then that could influence how much weigh from anchor text from a specific link might be passed along by a link. See the section labeled “Document Annotation for Improved Ranking,” in Phrase-based indexing in an information retrieval system
There are also other possibilities that may influence the weight given to anchor text (if any)when more than one link to the same URL appears upon a page, and Google creates an Anchor Map for the URLs on the page crawled. The idea of Google using a “first link” approach is too simplistic.
Regarding the space issue, it is likely that Google is escaping the additional space in the URL with a %20, and that means that Google would never reach the web server log files, from the example shown in David Naylor’s post. Instead of the domain being “http://www.davidnaylor.co.uk” it would instead be “http://www.davidnaylor.co.uk%20″ which isn’t a valid domain.
Link | April 10th, 2010 at 8:58 am
Jeremy Chatfield wrote,
Hi Bill – thanks.
I’d intended this article as a quick note, mostly surrounding the “web server log files and robot behaviour” theme I’m working on. I’ll make *that* more explicit in the article. :)
I’m entirely with you on the added space issue. Dave Naylor’s specific example would create a reference to a failed DNS lookup and therefore no possible record of an attempt to follow the link, visible in web server log files. A pathed URL with an appended space would lead to a 404 being recorded on the web server for that domain. I’ll make that more clear in the article. Good catch.
Yes, the SEOmoz article misses out other cases. Much as the Dave Naylor article does – for example, the impact of using meta-refreshes on target pages instead of 301 redirects. I’ve also seen examples of duplicate anchor text being used in main navigation with different HREFs, resulting in weight being split between two resources – making the home page more authoritative for the anchor text than the two other pages specifically linked.
And, yes – in my haste to push this out, I inserted the wrong link to Matt on the topic of first links. I know the article that I meant. I’ll find the right reference shortly and embed that instead of the inappropriate YouTube link; I didn’t tag the article well in my delicious bookmark. :(
Link | April 10th, 2010 at 9:16 am
Bill Slawski wrote,
Thank you, Jeremy,
It was an interesting note, and it sounds like it should be a very interesting article. I’ll definitely be looking forward to it.
Matt’s statement about first links may have been in a comment on someone’s blog rather than a video or article that he wrote.
Link | April 10th, 2010 at 1:55 pm
Neales wrote,
I find the impact of using meta-refreshes on target pages instead of 301 redirects is a debatable issue in itself.
[[Edited to change the user name from a keyword to something slightly more sane]]
Link | April 25th, 2010 at 1:20 pm
Knobs wrote,
I think Google must ping the owner of the malformed URL and get it corrected rather than its bot adding something on its own.
Link | April 27th, 2010 at 4:50 am
Jeremy Chatfield wrote,
@Neales – debatable? In what way? Google will show a Title and a Description for a meta-refreshed page and doesn’t appear to transfer weight to the redirected resource. That’s completely different from a 301′ed resource. The debate, I suppose, is about what was intended, and which is the right technique for the effect to achieve?
However, since you’re posting here from India, on behalf of a UK company, I suspect that you are just a spammer. Shame. The discussion might have been interesting. Except I’m unlikely to put much credence onto someone leaving spammy comments in a nofollowed UGC section. Just destroys my confidence in your competence.
Link | April 27th, 2010 at 6:10 am
Jeremy Chatfield wrote,
@Knobs – the malformed URL is in the example above was on the blog of the person who malformed the URL. So, no, Google’s not going around telling blog owners that they have the wrong URL.
However, web server log files *when they can* do reveal that Google attempts to get malformed URLs. The log files can’t, of course, tell you when someone has referred to a domain that doesn’t exist, like the “.com ” (dot com space) domain. There is no such domain, and DNS won’t resolve that request (or at least, all the resolving DNS’s that I know, won’t resolve that).
Link | April 27th, 2010 at 6:13 am