<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Merjis Internet Marketing Blog &#187; spiders</title>
	<atom:link href="http://blog.merjis.com/category/spiders/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.merjis.com</link>
	<description>Effective Internet Marketing Strategy and Tactics Through Test</description>
	<lastBuildDate>Thu, 12 Jan 2012 09:18:42 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>SEO: Close Reading Of Search Results</title>
		<link>http://blog.merjis.com/2010/05/17/seo-close-reading-of-search-results/</link>
		<comments>http://blog.merjis.com/2010/05/17/seo-close-reading-of-search-results/#comments</comments>
		<pubDate>Mon, 17 May 2010 12:12:01 +0000</pubDate>
		<dc:creator>Jeremy Chatfield</dc:creator>
				<category><![CDATA[google]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[spiders]]></category>

		<guid isPermaLink="false">http://blog.merjis.com/?p=400</guid>
		<description><![CDATA[Looking closely at Google&#8217;s search results can be informative &#8211; at least, if you take some inductive leaps, and apply knowledge learned in other activities. Take a look at the graphic, showing these early April 2010 search results for Matt Cutts&#8217; web site. Notice that the same articles appear several times, wIth slightly different URLs? [...]]]></description>
			<content:encoded><![CDATA[<p>Looking closely at Google&#8217;s search results can be informative &#8211; at least, if you take some inductive leaps, and apply knowledge learned in other activities. Take a look at the graphic, showing these early April 2010 search results for Matt Cutts&#8217; web site. Notice that the same articles appear several times, wIth slightly different URLs? Now look further down the results. See any more duplications? For example, that mini-review of the iPad. I can&#8217;t see that article listed anywhere else in the search results. </p>
<div id="attachment_401" class="wp-caption alignnone" style="width: 610px"><a href="http://blog.merjis.com/wp-content/uploads/2010/04/site_mattcutts.com-Google-Search.gif"><img src="http://blog.merjis.com/wp-content/uploads/2010/04/site_mattcutts.com-Google-Search.gif" alt="Search results for Matt Cutts site, for the last year, sorted by date not relevance" title="site_mattcutts.com - Google Search" width="600" height="545" class="size-full wp-image-401" /></a><p class="wp-caption-text">Notice that recent articles are sometimes shown twice or more, with different URLs</p></div>
<p>Why are the articles shown twice or more? Because they have different URLs. They are the same page content, but on different paths. Duplicates. Notice that Matt&#8217;s blog is not penalised for these duplicates &#8211; because they aren&#8217;t what Google considers to be duplicate content. They are the same content, reached by different paths, and in this case the different paths are tracking parameters for Google Analytics. It&#8217;d be more than a little annoying if Google handled &#8220;duplicate&#8221; pages, caused by using their own tracking parameters, as if they were different resources!</p>
<p>Why aren&#8217;t other articles on the blog shown twice or more times? What magic makes it that two day old articles get three showings, but articles five days old and older show just once?</p>
<p>Interesting, isn&#8217;t it?</p>
<p>Now, instead of sorting by date order, sort by relevance. Let&#8217;s look for those duplications again:<br />
<div id="attachment_402" class="wp-caption alignnone" style="width: 610px"><a href="http://blog.merjis.com/wp-content/uploads/2010/04/site_mattcutts.com-Google-Search-2.gif"><img src="http://blog.merjis.com/wp-content/uploads/2010/04/site_mattcutts.com-Google-Search-2.gif" alt="Search for the site of matt cutts, look at the last page of results to see the duplicates" title="site_mattcutts.com - Google Search-2" width="600" height="666" class="size-full wp-image-402" /></a><p class="wp-caption-text">The duplicate pages, with different URLs appear lowest for relevance.</p></div></p>
<p>So these duplicate pages of the last few days, appear as the lowest relevance pages in the last year of articles on Matt&#8217;s site. Yet they have the same text as the main article&#8230; or do they? Might there be something magic that the cache shows us?</p>
<p><a href="http://blog.merjis.com/wp-content/uploads/2010/04/Things-to-do-in-Japan-and-Thailand.gif"><img src="http://blog.merjis.com/wp-content/uploads/2010/04/Things-to-do-in-Japan-and-Thailand-300x235.gif" alt="" title="Things to do in Japan and Thailand?" width="600" height="470" class="alignnone size-medium wp-image-403" /></a></p>
<p><a href="http://blog.merjis.com/wp-content/uploads/2010/04/Things-to-do-in-Japan-and-Thailand-2.gif"><img src="http://blog.merjis.com/wp-content/uploads/2010/04/Things-to-do-in-Japan-and-Thailand-2-300x200.gif" alt="" title="Things to do in Japan and Thailand?-2" width="600" height="400" class="alignnone size-medium wp-image-406" /></a></p>
<p>The cached versions show us that they were taken up to a week previously &#8211; and reveal that these were found from Feedburner. <i>Google is taking different URLs, determined by the tracking tags, and including them in results.</i> The weight for the tagged pages is lower &#8211; Matt&#8217;s blog doesn&#8217;t refer to these URLs, and nor does anything else that Googlebot will crawl. So these additional pages rank lower, despite having the same content &#8211; the difference lies in the backlinks, and perhaps other factors. </p>
<p>But if there are no external backlinks to these articles, what makes them appear in results, at all, in any position? This, I think, is where we drag in another factor that seems to apply, especially to blogs. Trust. If a blog is well trusted by Google, then articles posted will tend to rank highly in search results, in as little as a few minutes for highly relevant searches (relevant, that is, to what Google thinks the article is about, which mostly means the title). Amongst the two hundred-or-so factors that Google is looking at, is whether they trust your results are likely to satisfy searchers. If your articles have enough hits from satisfied (long reading) users, you leap up the rankings within minutes of posting. Less trusted, lower weight postings won&#8217;t appear on page one, if they appear at all. That trust is extended to articles with no backlinks &#8211; these pages appear in results because they&#8217;re coming from a trusted site. </p>
<h3>Why do these duplicate pages disappear, and when, and what does that tell us?</h3>
<p>At some point in the last week, Google has removed the extra results for other page references with additional parameters. If we look again next week, we&#8217;ll see that the extra results for the current recent articles have also disappeared, but if Matt makes more posts, we may see those articles with extra tracking parameters for a brief period. Why do these &#8220;duplicate&#8221; pages disappear?</p>
<p>Well, one reason is that Google detests spam. A bad page of search results would be a page that contained different sites with the same article &#8211; because that wouldn&#8217;t reflect the diversity of opinion and solution on the web. Google would prefer to have ten different answers to the search query, than one answer on ten highly ranked sites. There&#8217;s a conscious effort at Google to compare articles. </p>
<p>How does Google identify that? Well, we can take some guesses by looking at the search results. Notice that bit that says &#8220;cached&#8221;? Taking several shots of a page (cached pages), over time, lets Google see that page content is evolving. User generated content accretes to a page &#8211; and that&#8217;s visible in successive cached snaphots. </p>
<p>Google can therefore see which parts of a page are static and which are likely to be UGC, <i>even without any effort to understand the page structure</i>. Matts&#8217; articles get a lot of commentary, so it&#8217;d be pretty easy for Google to determine that a common core of the page is unchanging &#8211; and identical. How easy? Well, there have been tools to perform that kind of textual analysis with computers since at least the 1970&#8242;s to my certain knowledge &#8211; I used them back then. More recent techniques have used data compression techniques to compare compression rates of samples, using the highly optimised algorithms for good data compression &#8211; and Google has smart people who probably do more complex stuff than that. </p>
<h2>Canonical Link References</h2>
<p>One more datum to collect! Canonical Link References &#8211; does Matt&#8217;s blog use the canonical link ref to make sure that Google knows the best URL for this article? </p>
<p><a href="http://blog.merjis.com/wp-content/uploads/2010/04/Source-of-http___www.mattcutts.com_blog_site-speed_.gif"><img src="http://blog.merjis.com/wp-content/uploads/2010/04/Source-of-http___www.mattcutts.com_blog_site-speed_.gif" alt="" title="Source of http___www.mattcutts.com_blog_site-speed_" width="600" height="450" class="alignnone size-full wp-image-409" /></a></p>
<p>The answer is very much affirmative &#8211; Matt so much likes the canonical link reference, it is added twice in the header! Hmm &#8211; some problem with conflicting plugins, perhaps? I don&#8217;t think this duplication of the canonical link reference is intentional. Fortunately they are the same, or we&#8217;d get into some discussions about whether Google believes the first or second canonical link reference! </p>
<p>What *should* the canonical link reference do for Google, when we see the variant forms with tracking tags? It should tell Google that the preferred form is the one without the tracking tags. We should end up with just the preferred form showing in search results. And that&#8217;s what we see &#8211; but the canonical link reference isn&#8217;t the only way that Google looks for probably duplicated data. You can tell, if you look for other blogs that use FeedBurner, but that don&#8217;t use canonical link references. Those blogs still get deduplicated listings &#8211; showing that other components to deduplicate are still working, but still take days to do so. </p>
<h2>What does the transitory presence of these pages tell us? </h2>
<p>Thinking solely about what Matt&#8217;s results are showing us &#8211; not taking any evidence from other experiments and tests into account&#8230;</p>
<p>I think the presence of these pages in the search results says that Google is reading FeedBurner feeds for Matt&#8217;s blog, and getting tagged data, tagged for Google Analytics. I&#8217;m guessing that the tags are not set as ignored in Matt&#8217;s Webmaster Console, so Google sees the tagged pages as different. Because users are not linking to them, and the blog itself doesn&#8217;t refer to them, the usage eventually dies out &#8211; where &#8220;eventually&#8221; means less than a week. And the canonical link reference probably also helps. </p>
<p>Note that the alternate pages *are* present. Not highly ranked, but present. So&#8230; the content is the same, the server is the same, but the links to these pages are different; the untagged page is referenced within the blog (by any category or blog tag and the archive), and after a day or so, there are probably some links to these articles, all probably pointing to the untagged page. So that tells us that backlinks are important &#8211; or there&#8217;d be no reason why these tagged pages shouldn&#8217;t rank as highly &#8211; the user experience is likely to be the same, the content was (at the time of first snapshot/cache) the same. So if backlinks within Matt&#8217;s blog weren&#8217;t important, then the pages should rank with equal weight&#8230; and they clearly don&#8217;t. </p>
<p>This observation also says something important to us &#8211; if we watch our web server log files carefully! You may have read some of my other articles here about web server log file analysis for search engine optimisation &#8211; I think the web server log files tell us important things about how Google perceives our sites. If we see Googlebot requesting a tagged resource, then that tells us something about what resources we have out there, our state of canonicalisation, and Google&#8217;s speed of change. We want to know the speed of change, because when we&#8217;re trying to improve performance, we&#8217;ll want to see when we might expect to see results, at earliest. </p>
<p>We don&#8217;t, unfortunately, have access to Matt&#8217;s web server log files&#8230; but you have access to yours&#8230; What do they tell you, under similar conditions? I may return to this topic, as I think the research is pretty interesting for what it tells us about how Google goes about making decisions. </p>
<h2>Summary</h2>
<p>Google sees the same blog article under different URLs over time. Google collapses the references to a single page URL &#8211; probably using intrinsic information (page content) and supplied information (canonical link references), and backlinks, and the Webmaster Tools mechanism to instruct Google to ignore certain parameters. </p>
<p>URLs for alternate page presentations get into the results for trusted sites, even if these URLs have no weighty backlinks. Not highly ranked, but they do appear. They appear quickly and disappear within a few days. This suggests that whatever it is at Google that evaluates, takes a few days to do so. </p>
<p>I don&#8217;t think this points to content being the key, but to content with backlinks and user-preferred results, as being the most important. And the presence of these duplicates is a hint that a site is trusted. Not a great hint, as if you don&#8217;t use FeedBurner or another similar tagging resource, you won&#8217;t see this effect. </p>
<p>I&#8217;ve not considered the backlinks to Matts blog in detail&#8230; there are a lot and the signal is messy. I&#8217;ve left that analysis out of this article. Besides, discovering the timeline of backlinks is itself a pretty tricky exercise, unless you, as the experimenter, control them; that&#8217;s definitely not the case for Matt&#8217;s blog!</p>
<p>What&#8217;s especially interesting is that an article, without significant backlinks to it yet, can appear high in search results, but that this tends to be for the canonical representation of the page, even though Google is probably learning of the page via FeedBurner/pingomatic notification (IOW, likely to be tagged). So the first mechanism that notifies Google of the URL, is probably *NOT* the canonical form &#8211; yet the canonical form is listed highest, within minutes. That suggests some fast processes at Google for identifying and evaluating a page&#8217;s relevance, and then some slower processes to determine whether the page can be justified for continuing presence. </p>
<p>And that behaviour of high ranking without external justification, in turn, has some implications for the weight that will flow from a blog article&#8230; It is initially likely to be low, and if the article &#8220;sticks&#8221; in the results because it helps users, then it gains some kind of value. Otherwise, the weight of the page will remain low &#8211; though, strictly, understanding that evolution of the weight of the article requires looking at the impact of links from articles in a blog. Another day, perhaps ;)</p>
<p>I hope you found this close look at search results amusing, if not educational. </p>
 <img src="http://blog.merjis.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=400" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://blog.merjis.com/2010/05/17/seo-close-reading-of-search-results/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Non-news: Malformed URLs don&#8217;t pass Anchor Text.</title>
		<link>http://blog.merjis.com/2010/04/09/non-news-malformed-urls-dont-pass-anchor-text/</link>
		<comments>http://blog.merjis.com/2010/04/09/non-news-malformed-urls-dont-pass-anchor-text/#comments</comments>
		<pubDate>Fri, 09 Apr 2010 22:47:08 +0000</pubDate>
		<dc:creator>Jeremy Chatfield</dc:creator>
				<category><![CDATA[SEO]]></category>
		<category><![CDATA[spiders]]></category>

		<guid isPermaLink="false">http://blog.merjis.com/?p=385</guid>
		<description><![CDATA[I&#8217;ve started another burst of postings about web server log file analysis and what it tells search engine optimisers about search engine spiders. Web spider behaviour often lies behind issues that I find on other blogs. For example, Dave Naylor has a couple of recent articles that are interesting. A good one to read is [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve started another burst of postings about web server log file analysis and what it tells search engine optimisers about search engine spiders. Web spider behaviour often lies behind issues that I find on other blogs. For example, Dave Naylor has a couple of recent articles that are interesting. A good one to read is about <a href="http://www.davidnaylor.co.uk/increasing-sales-with-google-analytics-motion-charts.html">using the &#8220;motion charts&#8221; in Google Analytics to find opportunities</a>. But there&#8217;s an odder one about Anchor Text. Some of that article is confirmation of stuff Matt Cutts has written about &#8211; the first link being the one that carries anchor text value, for example, or <a href="http://www.mattcutts.com/blog/pagerank-sculpting/">anchor text and nofollow</a>, or delayed echoes of <a href="http://www.seomoz.org/blog/results-of-google-experimentation-only-the-first-anchor-text-counts">Rand Fishkin&#8217;s recent article on Anchor Text</a>.</p>
<p>Apart from the validation of Matt Cutts statements, there&#8217;s one result that appears blindingly obvious. Malformed URLs don&#8217;t pass anchor text &#8211; and by implication, weight. In the context of the example in the article, adding a space to a URL in the anchor, destroys the value. Googlebot changes spaces (which aren&#8217;t valid characters in a URL) into &#8220;%20&#8243; symbols. In Dave Naylor&#8217;s article, that means that the Googlebot will do a DNS lookup for a domain that doesn&#8217;t and can&#8217;t exist &#8211; spaces are not allowed in domain names. If the URL in the anchor&#8217;s href had been a fully pathed URL, then a space would be added to the end and converted to a &#8220;%20&#8243;. </p>
<p>That full URL, with an appended &#8220;%20&#8243; won&#8217;t be found on the site. It should appear, at some point, in web server log files as a 404 for a Googlebot visit.  404&#8242;s don&#8217;t pass weight. So why the surprise that a malformed URL would fail? </p>
<p>I think the real point, not cleanly spelled out in the article, is that <b><i>web browsers don&#8217;t parse pages the way that search engine spiders parse pages</i></b>. A browser will cope with the embedded space. That ability of a browser to infer the useful thing to do, <a href="http://www.w3.org/Addressing/URL/url-spec.txt">doesn&#8217;t make the space into a valid character in URLs</a> &#8211; not without being escaped, anyway. And the consequence of appending the space, will be that a web spider makes a request for a resource that will usually be 404&#8242;ed, unless the administrator has used <a href="http://httpd.apache.org/docs/1.3/mod/mod_speling.html">Apache</a> <a href="http://httpd.apache.org/docs/2.0/mod/mod_speling.html">mod_speling</a> or an equivalent typo-correction tool (which should yield a 301 redirect to the correct resource).</p>
<p>Attempting to infer the SEO value of browser interpreted behaviour, without understanding Googlebot behaviour, will create confusing and misleading problems. </p>
 <img src="http://blog.merjis.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=385" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://blog.merjis.com/2010/04/09/non-news-malformed-urls-dont-pass-anchor-text/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Googlebot and Search Visitors</title>
		<link>http://blog.merjis.com/2010/04/08/googlebot-and-search-visitors/</link>
		<comments>http://blog.merjis.com/2010/04/08/googlebot-and-search-visitors/#comments</comments>
		<pubDate>Thu, 08 Apr 2010 14:28:25 +0000</pubDate>
		<dc:creator>Jeremy Chatfield</dc:creator>
				<category><![CDATA[google]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[spiders]]></category>

		<guid isPermaLink="false">http://blog.merjis.com/?p=362</guid>
		<description><![CDATA[I&#8217;ve been interested in the behaviour of Googlebot, the robot that Google uses to crawl the web, for years. It&#8217;s a topic that seems largely unaddressed by search engine optimisers, yet the behaviour of Googlebot should be extremely important. After all, uncrawled sites tend to have problems with ranking many pages &#8211; the best you [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been interested in the behaviour of Googlebot, the robot that Google uses to crawl the web, for years. It&#8217;s a topic that seems largely unaddressed by search engine optimisers, yet the behaviour of Googlebot should be extremely important. After all, uncrawled sites tend to have problems with ranking many pages &#8211; the best you can get is to have pages ranked that other people are pointing to, which, for most businesses, tends to be just the home page. </p>
<p>I&#8217;ve fairly recently had discussions with a few web site managers who&#8217;d made what appears to me to be the most peculiar decision &#8211; to block Googlebot because of the traffic impact. This resonated with a previous short article that I&#8217;d posted, about a problem identified by a Google staffer who was running his own blog. He&#8217;d seen his <a href="http://blog.merjis.com/2010/02/25/google-hates-me-im-being-penalised-or-not/">blog dropped from search results</a> and was looking for why that might be happening. </p>
<p>There&#8217;s certainly a potential problem &#8211; low bandwidth sites may suffer if Googlebot consumes the available bandwidth. But if you don&#8217;t have Googlebot crawling, then how are you going to appear, anyway? </p>
<p>You could use the Webmaster Tools to request that Google slows the crawl for your site. This should still result in having the crawling and indexing, and minimal damage to the traffic. But just disabling the crawl, by using robots.txt to block all crawling, or to block crawling of large sections of the site that should have user interaction, is probably a mistake.</p>
<p>There is also the legitimate concern that Googlebot&#8217;s visits might be draining server resources at peak traffic periods. That&#8217;s moderately difficult for non-technical site owners to work out. Google Analytics (and the other JavaScript page bug based web analytics packages, such as CoreMetrics, Omniture, Webtrends, etc) measure user visits, not Googlebot and other bot visits. Verifying that Googlebot isn&#8217;t interfering with and slowing down visitors, is pretty much impossible to understand without going to web server log file analysis.</p>
<h2>Web Server Log File Analysis</h2>
<p>I like web server log files. There&#8217;s things I can find out from them, in a few hours, that I simply can&#8217;t find from Google Analytics, CoreMetrics and Omniture. Look at this graph, for example. I&#8217;ve taken web server log files from a UK-targeted business, and extracted Google-inspired visits and Googlebot visits, by hour. </p>
<p><a href="http://blog.merjis.com/wp-content/uploads/2010/04/Googlebot-crawl-rate.gif"><img src="http://blog.merjis.com/wp-content/uploads/2010/04/Googlebot-crawl-rate.gif" alt="Graph shows that Googlebot is more active when visitors aren&#039;t present" title="Googlebot crawl rate" width="600" height="506" class="alignnone size-medium wp-image-363" /></a></p>
<p>The graph shows that Googlebot is busiest when users are less present. That is, when Google can see visitors coming to the site, the crawl volume is reduced. </p>
<p>This pattern of making Googlebot most active when the site visits are least active, seems to be the most common pattern that I can see in clients&#8217; web server logfiles. It makes a lot of sense for Google, too: </p>
<ul>
<li>Continuing visits by Googlebot allow them to check that the site is still working (preventing Google from delivering users to a 404&#8242;ed page)</li>
<li>Site performance under load can be monitored (helping Googlebot tune crawling rates, and verifying that users are getting responses from the site, mostly)</li>
</ul>
<h2>Summary</h2>
<p>Googlebot seems to be quite smart about when it visits sites. The more users that are being sent to a site in a given hour, the relatively lower rate that it crawls. So Googlebot should never get in the way of visitors, under normal conditions.</p>
<p>Simply disabling Googlebot looks like a weak way to go.</p>
<p>Following <a href="http://googlewebmastercentral.blogspot.com/2010/03/working-with-multi-regional-websites.html">suggestions from the Google Webmaster Blog</a>, if you have areas of the website that change at different speeds, you might want to validate multiple webmaster consoles for different sections of the site. That would allow setting different crawl rates. I&#8217;ve not tried this, yet&#8230; I don&#8217;t have a client for whom I want to restrict crawling speeds!</p>
 <img src="http://blog.merjis.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=362" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://blog.merjis.com/2010/04/08/googlebot-and-search-visitors/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Google Hates Me, I&#8217;m Being Penalised. Or Not.</title>
		<link>http://blog.merjis.com/2010/02/25/google-hates-me-im-being-penalised-or-not/</link>
		<comments>http://blog.merjis.com/2010/02/25/google-hates-me-im-being-penalised-or-not/#comments</comments>
		<pubDate>Thu, 25 Feb 2010 15:01:55 +0000</pubDate>
		<dc:creator>Jeremy Chatfield</dc:creator>
				<category><![CDATA[google]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[spiders]]></category>

		<guid isPermaLink="false">http://blog.merjis.com/?p=329</guid>
		<description><![CDATA[Great story from a Google staffer about how his site started to disappear from rankings. I&#8217;ve seen clients lose *huge* chunks of traffic for very similar reasons. Sometimes the reason you don&#8217;t show isn&#8217;t for the obvious search engine optimisation reasons or that you&#8217;ve lost Google&#8217;s love. Sometimes there&#8217;s a simple technological explanation&#8230;]]></description>
			<content:encoded><![CDATA[<p>Great story from a <a href="http://www.jasonmorrison.net/content/2010/how-my-site-disappeared-from-google-search/">Google staffer about how his site started to disappear from rankings</a>. I&#8217;ve seen clients lose *huge* chunks of traffic for very similar reasons. Sometimes the reason you don&#8217;t show isn&#8217;t for the obvious search engine optimisation reasons or that you&#8217;ve lost Google&#8217;s love. Sometimes there&#8217;s a simple technological explanation&#8230;</p>
 <img src="http://blog.merjis.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=329" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://blog.merjis.com/2010/02/25/google-hates-me-im-being-penalised-or-not/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Spiders, IIS Caseless, Cookieless and Search Engine Indexes.</title>
		<link>http://blog.merjis.com/2008/09/15/spiders-iis-caseless-cookieless-and-search-engine-indexes/</link>
		<comments>http://blog.merjis.com/2008/09/15/spiders-iis-caseless-cookieless-and-search-engine-indexes/#comments</comments>
		<pubDate>Mon, 15 Sep 2008 09:47:22 +0000</pubDate>
		<dc:creator>Jeremy Chatfield</dc:creator>
				<category><![CDATA[SEO]]></category>
		<category><![CDATA[spiders]]></category>
		<category><![CDATA[web analytics]]></category>

		<guid isPermaLink="false">http://blog.merjis.com/2008/09/15/spiders-iis-caseless-cookieless-and-search-engine-indexes/</guid>
		<description><![CDATA[Digging into IIS web server log files is quite interesting. I&#8217;ve developed a number of in-house tools over the years that help understanding why web spiders go where they do. I&#8217;ve been reworking them from an Apache dominated view to include some of the things that IIS does. You can see requests like &#8220;GET /(J(1)S(4dab&#8230;..))/&#8221; [...]]]></description>
			<content:encoded><![CDATA[<p>Digging into IIS web server log files is quite interesting. I&#8217;ve developed a number of in-house tools over the years that help understanding why web spiders go where they do. I&#8217;ve been reworking them from an Apache dominated view to include some of the things that IIS does. </p>
<p>You can see requests like &#8220;GET /(J(1)S(4dab&#8230;..))/&#8221; all over the log files. You can go to Yahoo!Site Explorer and find pages that are indexed with these cookieless mode paths. But I can&#8217;t find any of the big 3 robots (Google, Yahoo and MS) crawling those pages.  Perhaps the reason is where the cookieless paths are embedded &#8211; ASP seems to embed these URLs in form requests. </p>
<blockquote><p>method=&#8221;post&#8221; action=&#8221;/(Y(1)T(d4ab&#8230;&#8221;</p></blockquote>
<p>At the moment my best guess is that something else is crawling, capturing the links and leaving them on a third party site, which the SE Spiders are then finding. At some point I&#8217;ll try to find out where that may be. It&#8217;s a lower priority than making sure those references point to the right page. A few odd spiders do seem to crawl the IIS cookieless paths &#8211; things like the &#8220;<a href="http://www.majestic12.co.uk/">Majestic12</a>&#8221; search engine spider.</p>
<p>Why would I want this listing of strange paths stopped? Because some of the pages that it links to are pages that are SEO targets, and this multiple/duplicate listing dilutes the weight given to the real path. Unless the bots are smart enough to know that this is really known as a shorter path name&#8230; But if the spiders are smart enough to avoid crawling them, why are the SE&#8217;s dumb enough to list them? Hmm. An interesting puzzle. maybe, (shudder), I&#8217;ll have to set up an IIS server and do some experiments. </p>
<h3>Case folding continued</h3>
<p>On the other hand, I do find <a href="http://www.webmasterworld.com/google/3628552.htm">spiders crawling case folded file names</a>. Looking for &#8220;/default.aspx&#8221; and &#8220;/Default.aspx&#8221;, for example. You can see GoogleBot and Slurp happily grabbing exactly the same file under names that are identical because of IIS case folding. </p>
<p>Even worse, you can see some search queries turning up at both forms. That means that even Google can be confused when the same file is known under multiple names, because link naming is not consistent. Worse yet, while *you* may control linking on your own site, you can&#8217;t control it from all external sites. </p>
<p>I&#8217;m looking at an <a href="http://www.codeplex.com/IIRF">ISAPI Rewriter</a> as the only way to deliver the technological structure for these IIS based clients, and to make sure that <a href="http://www.isapirewrite.com/">the right URL is optimised</a>. </p>
<h3>Case Folding Surprises</h3>
<p>For the last three years, pretty much everyone we&#8217;ve worked with for SEO has used Apache. So all the tools we have are built around Apache, implicitly. It&#8217;s been quite an exercise to rework them to work with IIS. Intellectually, I&#8217;d known that case was important &#8211; I&#8217;ve been using UNIX systems since 1978 or so, and usually have been involved with Windows systems. I was shocked by the number of inbound links and search query references using a wide variety of case changes.</p>
<p>It&#8217;s made me even more determined to check inbound links and use mod_speling on standard Apache installations &#8211; it will easily catch and fix case problems. </p>
<p>I&#8217;m pretty sure that this wouldn&#8217;t be a problem if IIS was built on a case preserving filesystem. I&#8217;ve been backwards and forwards over the <a href="http://blogs.msdn.com/webmaster/default.aspx">Microsoft SEO presentations on their blog</a> and I can&#8217;t see any real reference ti this issue and how to resolve it. Perhaps MSN doesn&#8217;t suffer from it, so they don&#8217;t think they have to care? Maybe I glossed over a presentation or article that does explain the issue and how to resolve it. It seems like a pretty easy way to do some Google Bowling :(</p>
<h3>Summary</h3>
<p>Still no firm conclusions.  IIS Cookieless mode is worrying, but I can&#8217;t find any evidence that it affects the Big 3. </p>
<p>IIS (and probably Mac OS X) case folding is a problem, at least for Google. It looks as though there may be different page ranks involved for different case variations of the same file. Maintaining internal link naming consistency probably helps, but you can&#8217;t stop external sites from linking to case variations and thereby possibly reducing PR from time to time. Either use a case respecting file system (Mac OS X allows that choice), or use one of the ISAPI rewriters to issue a 301 redirect from case variations to the correct file name. </p>
 <img src="http://blog.merjis.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=216" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://blog.merjis.com/2008/09/15/spiders-iis-caseless-cookieless-and-search-engine-indexes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SEO, IIS case folding filenames, Spiders, Analytics, and Robots.Txt</title>
		<link>http://blog.merjis.com/2008/08/20/seo-iis-case-folding-filenames-spiders-analytics-and-robotstxt/</link>
		<comments>http://blog.merjis.com/2008/08/20/seo-iis-case-folding-filenames-spiders-analytics-and-robotstxt/#comments</comments>
		<pubDate>Wed, 20 Aug 2008 15:32:41 +0000</pubDate>
		<dc:creator>Jeremy Chatfield</dc:creator>
				<category><![CDATA[microsoft]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[spamfighting]]></category>
		<category><![CDATA[spiders]]></category>
		<category><![CDATA[web analytics]]></category>

		<guid isPermaLink="false">http://blog.merjis.com/2008/08/20/seo-iis-case-folding-filenames-spiders-analytics-and-robotstxt/</guid>
		<description><![CDATA[AFAICS, the best way to administer IIS for SEO purposes, seems to be to run screaming from the room and hide under a desk until you are allowed to use Apache. So many of the default behaviours create difficulties for users or SEO. Yes, I&#8217;ve been continuing to dig into web analytics and IIS web [...]]]></description>
			<content:encoded><![CDATA[<p>AFAICS, the best way to administer IIS for SEO purposes, seems to be to run screaming from the room and hide under a desk until you are allowed to use Apache. So many of the default behaviours create difficulties for users or SEO. Yes, I&#8217;ve been continuing to dig into web analytics and IIS web server log files for a couple of clients.  I&#8217;ve now seen this problem, again &#8211; first noticed many, many years ago  (1999 or so, I think) for another company and <b>still</b> not solved by default:</p>
<h3>What is the authoritative name of the web page?</h3>
<p>Assume that your web site uses IIS and has a home page available as &#8220;/index.htm&#8221; You can then use the following as silent synonyms of the home page:</p>
<ul>
<li>/index.htm</li>
<li>/Index.htm</li>
<li>/INDEX.HTM</li>
<li>/InDeX.hTm</li>
</ul>
<h3>Why is finding the same file under multiple names a problem?</h3>
<p>Whether you use a web browser, or a search engine bot crawls your web site, each one of those URLs appears to be a different file, albeit with identical content. If you have optimised the content, then each is a candidate to   be the answer for search users. So the spiders do need to track each file name variant, and check them, to see what has changed. </p>
<p>In each case you get an immediate web server response of &#8220;200&#8243; &#8211; file found. Link love can be spent on a wide variety of paths that lead to the same place &#8211; but the spiders aren&#8217;t told that. There is another way to do this, which does not work, out of the box, on Linux and Apache, but is fairly easy to set up.</p>
<h3>Web Brand Is All About Experience</h3>
<p>If the file system does not fold case &#8211; that is, it treats upper and lower case letters as two distinct things &#8211; then a request for a file with mismatched case delivers a 404 &#8211; File Not Found. Now that&#8217;s a bad user interface experience. Brand is all about experience, so why punish your users with a 404 because they can&#8217;t remember what the capitalisation of your ProDuct (sic) is? </p>
<p>You need to find a way to deliver both the web page that the user wanted to get to, and also let the spiders know that there is one authoritative page &#8211; there may be a lot of different links to get there, but just one resource. </p>
<p>On Merjis.com we use a technique that helps the spider understand that we have one page, and that case changes are a link problem, not a server duplication. If you try to reach:</p>
<ul>
<li><a href="http://merjis.com/contact">http://merjis.com/contact</a></li>
<li><a href="http://merjis.com/Contact">http://merjis.com/Contact</a></li>
<li><a href="http://merjis.com/CoNtAcT">http://merjis.com/CoNtAcT</a></li>
<li><a href="http://merjis.com/CONTACT">http://merjis.com/CONTACT</a></li>
</ul>
<p>then you should get to the same page, the Merjis contact information page. In terms of the user experience, this is just the same user experience as on IIS. But we issue a redirect on all the non-standard forms of the page name and IIS doesn&#8217;t. Spiders can see that only one page for &#8220;contact&#8221; exists on the Merjis site, even if it is accessible through many different URLs. This cuts down redundant crawls, focuses link love on a single page, doesn&#8217;t lose references from typographically challenged links, gets users to the page they want whatever the case of the URL they type, and is generally A Good Thing.</p>
<h3>How To See A Redirect</h3>
<p>If you don&#8217;t use a tool like &#8220;wget&#8221; or one of the Firefox HTTP inspection tools, your only real clue to our redirection is that whatever you did type in the URL bar, is replaced by our chosen URL for the resource. Between your input and our response, we added a redirect. Spiders will see the redirection, and only index one page and can pour all the link love on that page.</p>
<p>That&#8217;s completely unlike IIS standard behaviour. The default behaviour is to fold uppercase and lowercase. That means you see the URL that you typed. There&#8217;s no information that the file is a single file known by many names. </p>
<p>Spiders can&#8217;t guess that a single file is a single file &#8211; they only know what they are told. They get told what links exist in sitemaps and by other link references across the web. If a spidered site has references to mixed case versions of names, then the spiders will tromp madly off to each alternative case version of exactly the same file.</p>
<h3>SEO Means Never Saying Sorry To Stupid Spiders</h3>
<p>I&#8217;m of the opinion that helping the spiders to find the right information, helps SEO. Sending spiders to a dozen spelling variations of a path, doesn&#8217;t boost rank, unless the spiders are clever. Even without that, sending spiders to crawl redundant pages, when they could more frequently crawl real content, is a waste of the attention from search engines. If spiders were clever, they wouldn&#8217;t crawl redundant paths to the same content, repeatedly, across the whole server&#8230; So give them a hand to get to the right single file that should be taking all the page rank. </p>
<p>This default case folding behaviour means that IIS again contributes to spamming search engine indexes. Not so bad, except that it causes yet another problem.</p>
<h3>My Web Analytics Don&#8217;t Fold Case</h3>
<p>When you are trying to analyse what is happening to users, just one miskeyed filename can result in the analytics giving you multiple paths to, for example, conversion. Typically the JavaScript page bug reports the filename that was accessed &#8211; using the same capitalisation that the web server delivered. Why? Because on a large fraction of the other web servers, case does matter and &#8220;/index&#8221; is a different file from &#8220;/INDEX&#8221;.</p>
<p>So on an IIS delivered file, the same page can be known by a wide range of names that mean the same thing. But analytics packages don&#8217;t (usually) fold case &#8211; so each reference to a different capitalisation adds another meaningless node to journey analysis. </p>
<p>[ N.B. See Chris's comment below. I should quantify the assertion that Analytics don't fold case - of course, if any of them *do*, that's another problem... ]</p>
<p><ins datetime="2008-08-21T23:50:46+00:00">The following deleted section is a bit rubbish. I failed to properly read and understand the robots.txt spec. I interpreted a line that meant all records in the file, to just mean the User-Agent line. robots.txt allows case folding &#8211; /MEMBERS and /members are identical according to the later spec; the earlier spec only clearly states that the User-Agent field ignores case &#8211; leaving the possibility of ignoring case or respecting it in a Disallow line. </p>
<p>However &#8211; I started this article because I found a clients private area on IIS had been crawled and indexed, despite being listed in robots.txt. I still need to describe that investigation &#8211; but this article is long enough.<br />
</ins></p>
<p><del datetime="2008-08-21T23:49:03+00:00">And that&#8217;s not all. Oh no&#8230;</p>
<h3>Google Spiders Content Disallowed In Robots.txt</h3>
<p>If you have parts of the site that you don&#8217;t want to be in the index, you can use <a href="http://www.robotstxt.org/robotstxt.html">robots.txt</a> to exclude those directories or applications.</p>
<p>Except&#8230; you can&#8217;t. Not sensibly, not with IIS out of the box.</p>
<p>Say that you have a directory called &#8220;/members&#8221; and you want only signed in members to see the content. You exclude the spiders with:</p>
<p><code>Disallow: /members</code></p>
<p>However&#8230; case folding&#8230; This directory is also accessible as &#8220;/MEMBERS&#8221; and that <a href="http://www.webmasterworld.com/google/3357250.htm">isn&#8217;t excluded</a> in this customers&#8217; Robots.txt. So your hidden content is now visible if just one link, somewhere on the internet, or even in your own site, uses a <a href="http://www.webmasterworld.com/forum93/75.htm">different capitalisation</a> from that which has been put in the Robots.txt file.</p>
<p>Is this Google and Yahoo!&#8217;s problem to resolve? IMO, not really. If you choose to use a server that makes the same content available under a range of paths, it is up to you to protect those paths, not for spider developers to guess that you may have shot yourself in the foot.</p>
<p>OTOH, the SEs do themselves no favours for reducing spam in the indexes, and decreasing crawl volumes and bandwidth usage, by failing to recognise that IIS can serve the same page under a multiplicity of case variations. That&#8217;s a different problem &#8211; but solving one would solve the other. If the Server can be detected as using case folding then using a case independent match of Robots.txt paths would be a useful extension. </p>
<p>I can even imagine adding a new directive to Robots.txt to express that pathnames do not respect case.</p>
<p></del></p>
<h3>Other Case Folding Systems</h3>
<p>Well, Apple OS X. It may be my favoured desktop OS, but it has a default FS that is caseless:</p>
<p><code><br />
$ echo boo > goose<br />
$ cat GOOSE<br />
boo<br />
$<br />
</code></p>
<p>I haven&#8217;t tested &#8211; I have no SEO clients with Mac servers &#8211; but I suspect that Mac servers with a default FS run the same problem of a futile and useless Robots.txt protection.</p>
<p>IIRC, OS/2 aka &#8220;Warp&#8221; was used for some years as a web server and it used a case folding FS &#8211; so if you run one of those ancestral systems, watch out.</p>
<h3>Webmasters</h3>
<p>Your defence? Well, make IIS respect case in queries and do a proper redirect to the actual file. </p>
<p><del datetime="2008-08-21T23:54:22+00:00">That way, spiders could properly use Robots.txt and your hidden content wouldn&#8217;t be accessible. Try asking Microsoft about that configuration. Heh. Here&#8217;s <a href="http://support.microsoft.com/kb/217103">Microsoft&#8217;s page about creating a Robots.txt</a> &#8211; note the discussion about case folding? Oh, there *is* no mention of case folding? Hmm. Well, I think that&#8217;s a lesson in its own right. </p>
<p>Failing any rational advice from Microsoft, you could do a combinatorial madness on Robots.txt:</p>
<p><code><br />
Disallow: /members<br />
Disallow: /Members<br />
Disallow: /mEmbers<br />
...<br />
Disallow: /MEmbers<br />
...<br />
Disallow: /MEMBERS<br />
</code><br />
and so on. It&#8217;s easy. Yeah. Right.</p>
<p>Have I mentioned that I think IIS adds problems, rather than removing them?<br />
</del></p>
<p>It is all so much more complex if you have cookieless mode enabled for your ASP applications. Because the path given to the robots doesn&#8217;t match the path that is denied. Deny &#8220;/secret&#8221; and you get a path that starts &#8220;/(&#8221; and goes on to &#8220;))/secret&#8221;. Combine that with case folding and there is no end to index spamming. </p>
<p>And, of course, Robots are the main thing that need to read and respect Robots.txt. </p>
<p>Given a modicum of sense, this whole area could be made a lot simpler for system admins and webmasters. If an IP address and user agent doesn&#8217;t accept cookies, and asks for &#8220;robots.txt&#8221;, it is probably a robot. Stop sending cookieless tracking paths to that IP and UA.</p>
<h3>Sticking In a Reverse Caching Proxy</h3>
<p>If you are a technological sophisticate, then you could insert Apache with a rewriter and mod_speling, to get the benefits of case matching and redirection to the single real file instance. You&#8217;ll possibly see a slight average speed up for users, as unchanged content is delivereddirect from the Apache cache. </p>
<p>How to set up and configure one of these cute web servers is beyond the scope of this article, though. </p>
<h3>Scale Of Problems</h3>
<p>It&#8217;s quite nasty for people using hosted IIS, who have no significant control. I have no doubt that the SEs do duplicate detection. Matt Cutts has written that in-site duplication isn&#8217;t too awful a penalty &#8211; probably because they keep spidering IIS sites. OTOH, places like WebMasterWorld have a fair number of webmaster stories about having two or more case variations of the same page in the search engine indexes, with radically different page rank &#8211; and that depending upon what has been spidered recently, the position of the site will change radically.  </p>
<p>There is a difference between ignoring duplicates and sending full credit for an inbound link to the &#8220;master&#8221; version of the duplicates. Some folk seem to think that there&#8217;s no great loss. I&#8217;m pretty conservative about this &#8211; why risk losing the benefits of inbound links, just because of case folding? </p>
<p>Case folding for private content is a problem. I&#8217;ve seen many complaints about Google revealing information concealed by Robots.txt, and I have strong suspicions that in most of these cases, the complainants were running IIS and had <del datetime="2008-08-21T23:54:22+00:00">an undetectably case-mismatched link reference somewhere on the site, or had</del> enabled Cookieless mode for ASP. </p>
<p>I can&#8217;t find any authoritative discussion about this issue, especially in Microsofts&#8217; Knowledge Base, hence the posting here. Of course it may be that I detest MS so much that I haven&#8217;t spent enough time poking around their resources. That&#8217;d be a quite valid criticism of me and this article. :)</p>
<h3>Other Solutions For Privacy</h3>
<p>Given that Robots.txt is effectively useless faced with both Cookieless mode and case folding, you really, really need to get to grips with your friendly <a href="http://searchengineland.com/070305-204850.php">page-specific metatags &#8220;NOINDEX&#8221;, &#8220;NOFOLLOW&#8221; and NOARCHIVE&#8221;</a>.</p>
<p>Any private content should at least have &#8220;NOINDEX&#8221;.</p>
<p>Arguably, terminal pages (action pages, etc) and private content, get &#8220;NOFOLLOW&#8221; directives.</p>
<h3>Summary</h3>
<p>IIS, by default, apparently opens up web sites to some <a href="http://blog.merjis.com/2007/09/17/google-bowling-and-identity/">Google Bowling</a> activities. You can even bowl yourself out, if you use case variations in your own URLs.</p>
<p>Carefully review what you do have; probably the simplest suggestion is that you stick to all lower case in every link. Watch out for inbound links that use mixed case &#8211; those can scupper your rank.</p>
<p>Use the meta tag for robots to set NOINDEX and possibly NOFOLLOW on your private content on case-folding IIS systems to prevent your data leaking into the search engines.</p>
<p>Best bet &#8211; dump IIS. OK. That&#8217;s a biased view, and a very selective view. But if you want to rank highly in search engines, setting up and running a LAMP (Linux Apache MySQL PHP) or similar system is easy and cheap. While it does have comparable problems (unbranded 404&#8242;s etc) the solutions are also cheap and easy to set up. If I&#8217;m really blunt, I wouldn&#8217;t actually use PHP, either &#8211; the script language design makes it too easy to embed SQL, IME. I&#8217;d steer for Python, Ruby On Rails or OCaml&#8230; probably ;)</p>
<p>It&#8217;s better than finding your private data leaked, or that you&#8217;ve been scuppered by a near unfindable typo.</p>
 <img src="http://blog.merjis.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=211" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://blog.merjis.com/2008/08/20/seo-iis-case-folding-filenames-spiders-analytics-and-robotstxt/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>IIS Cookieless Generates Spider Crawling Problems</title>
		<link>http://blog.merjis.com/2008/08/18/iis-cookieless-generates-spider-crawling-problems/</link>
		<comments>http://blog.merjis.com/2008/08/18/iis-cookieless-generates-spider-crawling-problems/#comments</comments>
		<pubDate>Mon, 18 Aug 2008 08:13:16 +0000</pubDate>
		<dc:creator>Jeremy Chatfield</dc:creator>
				<category><![CDATA[google]]></category>
		<category><![CDATA[microsoft]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[spamfighting]]></category>
		<category><![CDATA[spiders]]></category>
		<category><![CDATA[web analytics]]></category>
		<category><![CDATA[yahoo!]]></category>

		<guid isPermaLink="false">http://blog.merjis.com/2008/08/18/iis-cookieless-generates-spider-crawling-problems/</guid>
		<description><![CDATA[Another case of Web Server Log File Analysis on IIS being disturbed by bots, having the potential for SEO naughtiness and spamming the search engines. The problem is created by IIS&#8217;s cookieless model. The idea appears to be to present a unique string in the path so you can track sessions without needing a cookie. [...]]]></description>
			<content:encoded><![CDATA[<p>Another case of Web Server Log File Analysis on IIS being disturbed by bots, having the potential for SEO naughtiness and spamming the search engines. The problem is created by IIS&#8217;s cookieless model. The idea appears to be to present a unique string in the path so you can <a href="http://weblogs.asp.net/paulomorgado/archive/2008/08/01/iis-asp-net-cookieless-support-not-working-as-expected.aspx">track sessions without needing a cookie</a>. </p>
<p>Clients that use IIS do seem to suffer from the strangest problems. I kept finding indexed pages, some ranking absurdly highly, with an path infix like (X(kjhgkjfkuyfku)). In other words, if I had a page like &#8220;http://merjis.com/login&#8221;, I&#8217;d find exactly the same content as &#8220;http://merjis.com/(X(stuff))/login, making it look as if I had lots of duplicate pages. Appended parameters as a session ID, I can understand. There&#8217;s even ways to cope with those. </p>
<p>The format of the infix was quite rigid &#8211; 23 characters starting and ending with a parenthesis, and embedding at least one more parenthetical group. I think the regex &#8220;\/\([A-Z]\([A-Za-z0-9]*\)\)\/&#8221; will match every example that I&#8217;ve seen. Oddly, you could navigate to them using a web browser and they worked, but even spidering the site and grepping the resulting mirror failed to show these strange paths in the HTML&#8230; so how did they get there?</p>
<h3>Cookieless Session Tracking</h3>
<p>This looks like an effort by Microsoft to allow tracking of people that don&#8217;t want to be tracked, or who might have, for example, an office based transparent proxy that blocks cookies. AFAICS, when IIS detects that a user doesn&#8217;t permit cookies, it starts sending unique paths. The use of the &#8220;referer_info&#8221; (sic) field allows tracking a single user across the site, looking at where they were and where they&#8217;ve gone to. The positive benefit is that users can be given access to stateful services (like logging in) without needing a cookie. It is, I think, effectively a session cookie, rather than a permanent cookie, as it can&#8217;t recognise you from previous sessions &#8211; your starting point will be the same as any other visitor. </p>
<p>This session tracking seems pretty barking to me. If a user prevents you from serving and saving a session cookie, and you then create a trackable session by using session keys in the URL, you have just violated the privacy that the user requested. If users try to use a use a resource that needs stateful dialogue &#8211; something that remembers whether you were logged in, for example &#8211; then I think that if the user disables cookies it is perfectly reasonable to tell them that this resource won&#8217;t work for any areas requiring an authenticated session, though the public areas may be freely roamed. If they want privacy in the session, then they can&#8217;t have access to anything that requires remembering from page view to page view, who they are. </p>
<h3>Cookies and Search Engine Spiders</h3>
<p>Robots don&#8217;t hold cookies. If faced with an IIS server configured to handle cookieless users, spiders end up being delivered with a false directory structure. Every new visit generates a new directory structure. You get excessive crawling and pages appear in the index as unique, when they are really duplicates caused by server behaviour. </p>
<p>Now, this too seems like a petty piece of madness. Given that some users want to maintain privacy to the extent that they do not even want session cookies, and this number is small on web servers that offer services involving a log in or other identification service, why would you make it more difficult for search engines to spider the site? I believe that you can get more users to a properly indexed site, than the number you&#8217;d lose from failing to handle uncookied users (unless you offer a specific service for the uncookied, of course). </p>
<h3>Robots.txt to the rescue</h3>
<p>Fortunately the workround for this problem seems small. In Robots.txt, add a line:</p>
<p><code>Disallow: *(<br />
</code></p>
<p>For all the spiders that I care about, that seems to prevent crawling the special tracking URL. The consequences of that&#8230; well, I&#8217;m not convinced it is entirely good. But it does stop silly URLs from being indexed after a single crawl.</p>
<p>Technically, this line says &#8220;disallow spidering for a URL starting with anything that has a &#8216;(&#8216; in it somewhere&#8221;. Although this client seems mostly to suffer from the infix immediately after the domain name, reading around the web suggests that the infix could be put at any directory level slash. Otherwise &#8220;Disallow: /(&#8221; would work and avoids the failure possibilities of the wildcard. </p>
<h3>Spidering Improvements?</h3>
<p>I can&#8217;t see why spiders should behave like this.</p>
<p>Having grabbed a server identity from the headers and behaviour, it should be possible to then strip out the session tracking from the path. I can&#8217;t currently think of a reason to *not* do this &#8211; unless you were really trying to sneakily discourage IIS administrators from using this tracking method. </p>
<p>I can think of anti-competitive ways to use the cookieless IIS behaviour. For example, find the user sitemap, point to it and let spiders follow the unique links to every path &#8211; then you can explode the crawl on each site by doubling it. And it accumulates &#8211; because every session appears to be valid for a long time, allowing repeat crawls and generating new unique paths with every reference. </p>
<p>You&#8217;d have thought that this explosion was worth trapping and stopping in the &#8220;fast crawl&#8221; discovery bots, so that the slower inspection spiders wouldn&#8217;t add these redundant pages to the indexes. The presence of these idiot links in link reports, and the multiple crawling of them by various bots, suggests that spiders are still stupid. </p>
<h3>Webmasters?</h3>
<p>Well, if you can&#8217;t avoid running IIS, take a good hard look at your web servers log files. Do you really have a useful volume of search from real users who are cookieless or have you just ramped up bandwidth so that bots can crawl redundant pages without adding any revenue? Is the majority usage of this feature an escalating collection of spiders? Might you get more users if link love wasn&#8217;t being directed to duplicate pages on redundant paths?</p>
<p>Put in the crawl reduction &#8220;Disallow: *(&#8221; and possibly a &#8220;Disallow: /(&#8221; lines to your Robots.txt and verify them with the Search Engine webmaster tools Robots.txt checkers. At least you&#8217;ll be focusing crawl on pages that should be indexed. </p>
<p>Ideally, turn off the cookieless mode. AFAICS, it is a breach of the privacy rights that users were trying to assert. If you offer a service that needs state, then when you detect that you can&#8217;t cookie, offer an apology that parts of the site are unusable. IMO, that&#8217;s perfectly acceptable for both users and designers and won&#8217;t lose any business that you couldn&#8217;t have gained. </p>
 <img src="http://blog.merjis.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=210" width="1" height="1" style="display: none;" />]]></content:encoded>
			<wfw:commentRss>http://blog.merjis.com/2008/08/18/iis-cookieless-generates-spider-crawling-problems/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

