I’ve been involved in some paid search click fraud measurement for about five years. It’s pretty interesting work, trying to understand whether the clicks you’ve bought are related to the traffic on the site, and any qualifiers that you’ve added, such as geotargets and the keywords. Oddly, it has provided a sideways illumination on a topic in Search Engine Optimisation - web analytics attribution.
I don’t remember seeing this addressed in any of the web analytics blogs and books that I’ve read for the last few years. It’s possible I missed something, and I’ll gladly add references to those articles, if anyone contributes them in comments here, or in emails to me.
Observation
Like much of good science, this starts with an observation:
When you look at the sources for which you have control of the requested URL, a substantial fraction have no referring info.
To understand what that observation means, we need to know a bit more about how visitors arrive, and what sources of data we have about the visitor.
Page Requests
Paid search, banner advertising, email and so on, allow you control the page request offered by users. You can add tracking parameters, and that lets your web measurement system collect that the request came from a particular source. That’s pretty important for click fraud measurement. If you aren’t adding tracking tags, then the visitor could be turning up from anywhere for any reason.
For example, lets’ assume that you are advertising for an Organic Cosmetics business. You place an advert on a keyword offering “organic acne treatment”. You bring people to a page on the site, for that treatment - http://www.organic acne treatment.com /our-products . You *MUST* add a tracking tag, or use some kind of autotagging (offered by AdWords, Yahoo!Search Marketing, etc) so that clicks from advertising can be measured by your web analytics service, and attributed to your paid search campaign.
What the tag looks like, will depend on your Web Analytics package. If you use Google Analytics, you might have tags like “?utm_media=ppc&utm_source=yahoo&utm_creative=organic+cosmetics”. So the requested page will be something like:
http://www.organic acne treatment.com /our-products?utm_media=ppc&utm_source=yahoo&utm_creative=organic+cosmetics
Without that, all you know is that someone requested that page… or is it?
And, of course, knowing that the click arrived as a result of an activity where *you* control the page request (typical of paid advertising opportunities), means that you now have a handle on part of the click fraud questions.
Referer (sic)
Something else that a web browser can optionally tell the web server, is known as the “referer”, in RFC2616 about the HTTP protocol (See Part 14, on Headers). This “referer” header sent from your web browser to the web server, says what the page request was, for the page *before* the request to the server. So we might see that the request for our page in the example, was referred to by “http://www google com/search?q=organic+acne+treatment” (and there’s usually some other stuff in there, describing the language and the browser and so on). So we know that a search engine was the source, and we know the keyword.
However, we don’t *know* that this click was from paid search. A common frustration amongst new advertisers is failing to identify clicks from AdWords, because advert tracking tags have not been added. Without the tracking parameters, all you have to work from is the “referer” header volunteered by the browser - and that only tells you that the user was looking at a page, not whether they clicked a free or a paid link.
So, one common misattribution is to *over-allocate* clicks to organic search and *under-allocate* to paid search, because neither tracking tags are available, nor is there any common way to tell from the “referer” field, that the source was paid search. I’m going to ignore the further complexities of referrer headers when using contextual advertising - it is much more complex :)
Tying the Pieces Together
Assume that we now have tagged clicks, so we can tell the paid sources that are sending traffic. We have the paid source telling us what they claim. We can match the two pieces of data. If they don’t match, then we have some further questions about the sources of mismatch. I’m not going into that, in this article. I’m going to take a sideways look at that other data-stream, the “referer”.
When you have a good source of paid clicks, one that you can trust as delivering a high fraction of the clicks they claim, and where we get good conversion rates and matching keywords and search queries, we can infer that the people being sent are are also “good” - no or few spoofing robots, no or few paid clickers or fraudsters, etc.
If we now look at the “referer”, we should see that all the visitors come from a page with paid search on it, shouldn’t we?
We don’t. Anywhere from 10% to more than 40% of the clicks have no referer_info at all, or have some clearly dummy data inserted instead of a real referrer.
Attribution Of Origin
So if we have no tracking parameters, we can only know what the origin is, if we have the “referer” field correctly filled in. And in about a fifth of the paid search cases, we don’t. So what does that tell us about “Direct” traffic?
It should tell us that Direct traffic is partially composed of search engine driven traffic, that is missing a referrer header. Some of it will be paid search, some organic search and some from other resources that are potentially trackable.
In other words, there may be more than 50% more clicks that should be attributable to organic search, than are showing in normal web measurements. And Direct is over-represented as a consequence of the way that data is collected.
If you have paid search clicks being tagged and tracked, this means that there is a systematic data error, in which clicks are allocated to Direct, when they come from unknown searches on unidentified organic search engine results.
Summary
Watch out for over-allocation to “Direct” as a consequence of missing or misleading information fed by Web Browsers to web servers.
You should be able to use paid search data to estimate the likely misallocation of clicks to Direct when they should be organic search, and even to estimate the likely frequency for higher volume keywords.
AFAICS, most web measurement and analysis services do not compensate for missing “referer” fields - they don’t even directly report on the number of missing referrer fields in attributed clicks, making the estimation of misattributed clicks hard.

John K wrote,
JC,
What would be the ideal situation, if one is able to control everything downstream of the PPC click. I.e. what do you recommend for the technically capable to make the tracking more accurate?
Link | November 25th, 2009 at 5:20 am
Jeremy Chatfield wrote,
Hi John - for PPC, email, bannering and other controlled sources, I’d use “query” parameters - the classic “?thing=value&other=stuff”. It’s pretty much the only way to pass information about the origin.
The “referer” header is useful, but what would be seriously helpful would be to (optionally?) re-encode origin data in the request. Unlikely to happen for privacy reasons. There are good reasons to be able to with hold the referrer.
If anything, this shows how frustrating the non-webmail mail user interfaces are. I’ve plenty of new clients for whom we can correlate an increase in direct traffic to email campaigns - clients who thought their emails weren’t working, or that perhaps the email triggered people to use a bookmark. Show them how a click from a non-web email has no referrer, and that the URL can encode the origin, and they get useful measurable data.
And it would be helpful for the Web Analytics companies to identify, by default, streams of clicks from webmail. The servers for many of the services are quite identifiable in the referers, but rarely shown as a class of origin. Not exactly the question you asked, though ;)
To make the data more valuable, break up campaigns and mark them separately, so you have different tracking parameters accessible for each of the classes of origin (Google Pages, Google Partner Searches, Google Content Network, Placement/Site Targeted) etc. That way you can better track bounce rates, ROI, etc, by controllable targeting options.
Link | November 26th, 2009 at 11:46 am
TAZook wrote,
I like the suggestions around tracking page requests. I was smarter by the end of the post than I’d been at the beginning. :)
We have a couple of in-house tracking programs–one is a custom analytics program and the other is tag tracking in emailed referrals. (We have GA, but I rarely dig into it.) When I started trying to match up email tags against our in-house data submission logs, I realized, for instance, how randomly some people fill out the fields in our online forms. Now I need to dig into Google Analytics and try to set something up to provide cleaner data.
I find I’m wondering if Jeremy is suggesting some kind of Analytics filter in his last paragraph? I’m not sure how else to track bounce rates or ROI by tracking parameters.
Link | February 6th, 2010 at 12:27 am
Jeremy Chatfield wrote,
Hi Theresa - absolutely - yes. At some point I’ll get around to documenting the multiplicity of filters we usually try to set up, so that when there is a change of ROI, we can nail it to the changes in the way that traffic is arriving. Especially important if you are paying for the traffic, but also quite useful in SEO.
If you have full tagging, and have separated the campaigns out so that you have adequate targeting control (fairly tricky when comparing Google Pages and Google Search Partners, for example - but again, something that analysing the referer_info can illuminate) then you can make better decisions about which target network should receive bid increases and decreases.
The problem with most web analytics packages is that extracting the proportion of misattributed data is external to the package, and hard to put back into the analysis. If your analytics package makes you have to rework your numbers, for a measurable cause, surely there’s a question you should be asking your analytics provider? :)
Link | February 6th, 2010 at 1:07 am