Effective Internet Marketing Strategy and Technique Through Experiments, Measurement and Audit

Click Fraud, Google AdWords and gclid

A tardy response to Matt Cutts posting and Shuman Ghosemajumder’s joining in the debate… I don’t see some important stuff in Google’s approach to this issue. Maybe I’m selectively vision impaired. Or maybe I read too many other reports about naughtiness on the net.

New (2008-03)! Google Blog article on click fraud forensics.
And newer! Richard Ball’s analysis of Google’s lousy management of click quality.

This recent Forbes news report about click fraud is probably just about right for the percentage - if we use an advertisers definition of click fraud, and ignore that Google doesn’t charge for the clicks it already identifies as invalid. Do you care if a click is fraudulent, if it isn’t charged? What does the presence of an identifiable and large volume of fraudulent or otherwise invalid clicks imply about the unidentified volume? Well, I’m not covering that here. I’ve tackled some of it in previous postings and I’m sure I’ll return to this topic. Again. And probably also cover, again, rational advertiser response to fraud (you should drop your bid, so your advertising costs remain the same) and why advertisers are irrational (and how this is good for search engines and great news for cheaters, social deviants and scumbags). No, this article is about measurement techniques and what they can tell you. Numbers…

If you can not measure it, you can not improve it - Lord Kelvin

I should clearly say that while this is article is critical of Google, I find the state of many other search engines to be much worse. Google offers the highest quality and volume paid search engine of those that we consistently use or have used, as measured by conversion rates and volume of conversion from specifically targeted adverts. I have a lot of respect for what Google has done. I think AdWords is the smartest high volume paid search advertising system we deal with. That doesn’t mean that it is perfect, or that it can not be improved.

This gclid thing, what is it?

About two years ago, I noticed “&gclid=(stuff)” being added to web server log files on clicks from Google. Pretty cool idea, I thought, and we started writing web log analysis programs that used the gclid to determine whether visitors were unique. We realised that this gclid tag should give you some idea about whether visitors are seeing you for the first time from an advert, or returning, without an additional cost for the click. In other words, it provides illumination into whether Google is charging for duplicate clicks, and the behaviour of unique visitors (e.g. whether individual anonymous users bookmark or use the “back” button a lot) - visitor behaviour and click fraud insight with one tag!

You’ll see the tag in detailed web server logs in the request. You don’t (shouldn’t) add “gclid” yourself. It gets added by Google to each impression that is delivered. That is, you submit an advert, and each time that advert is shown, Google adds this tag to the destination URL. It makes each advert impression unique, and allows the adverts to be tracked by web analytic packages. Specifically, I believe that it helps Google to work out when they shouldn’t charge for a second click on an advert.

But gclid autotagging has some major weaknesses for advertisers to detect click fraud. It may be that Google could use web server log files to identify click fraud, but it is not a usable technique, on a mass scale, for advertisers. Why on earth not?

This is going to get technical. And financial. And microeconomic and other things… So strap on your hard hat, buckle up, and do whatever else you do to protect yourself online. We’re going for a ride.

Basics of AdWords & Click Tracking

Imagine that I open a Google AdWords account, and create a campaign and an AdGroup and an advert, and a keyword. I enable auto-tagging, which appends extra information to the destination URL, either “?gclid=(stuff)” or “&gclid=(stuff)” (depends on whether there is a previous tag on the URL). You can enable Auto Tagging on the “My Account” tab (up with the “Reports”, “Analytics” and “Campaign Management” tabs), under “Account Preferences”. It appears to be linked to using Google Analytics - you can register for Google Analytics, enable Auto Tagging, but you don’t have to actually use Google Analytics…

The “(stuff)” that is added appears to be unique for each advert impression, and appears to be unique in a clever way… The first part of the ID varies rapidly and the last part varies slowly. This is clever because when you are looking for string matches, you get an early failure in the string match, helping to speed the search up - an indication that some smart people may have been working on this.

Note that this is yet another way to identify Editorial Review related visits to your web site. That’s something that most web analytics packages fail to identify - but are incredibly useful to know. With an editorial review click, if you use a macro, such as “{keyword}” or “{creative}” in the destination URL, then those are not substituted by the real value, as would be the case if the advert was being shown to a real user. Additionally, a gclid tag is not added to editorial review visits. Note this carefully - we have come across some sites (mostly eCommerce sites with URLs that are a database query) where the addition of the gclid tag can cause the site to fail with a 404 (page not found). Because Google’s site and content detection systems do not add the gclid tag, they will gleefully direct hundreds of dollars of spend to the site, with very little likelihood of conversion. The only ways that you can find out if this affects you is manually browse to the destination URL you use, appending a gclid parameter, or to click on one of your own adverts, in real search. Clicking an advert shown in the AdWords User Interface to check, is no confirmation that the gclid is harmless to your site.

What is in the gclid?

I’ll guess that the last part of the gclid value encodes, or more likely references in some way, the advertiser ID, the keyword, adgroup, campaign and account ID’s. The first part, that changes rapidly, is probably some combination of timestamp and instance ID or advertising channel (where the advert was published). I suspect that the account and keyword part is a database ID that delivers a row with the account ID, campaign and so on - rather than being an encoding. I suspect that the first part is a timestamp and instance ID, which will also be recorded on Google servers and will tell them when the advert impression was delivered, on which site and how long it was between that impression and the click.

But the advertiser *only* knows that they’ve seen a gclid. Not that it has been delivered by Google, and not that it is related to their advertising. The gclid tag can be faked, in any request to the web server. You should see later in this article that there is an incentive for click fraudsters to forge gclid’s on requests…

Now, slightly more confusing is what happens on the web server. A request comes in shaped like “/foo.html?gclid=juiuyvyuvuyvjhasfdhgkhj”. The user’s browser will show what the server delivers - the web page that was requested. The tag? That’s left in place. So if the user finds the page helpful, the page may be submitted to delicious or another bookmarking system, complete with a tag that does not specifically identify the page - it’s just some random tracking information, left behind because the web is still pretty primitive and undeveloped. Advert tracking information is not a page name… at least not unless you can conceptually separate the idea of tracking from the page that is addressed - you reach the same content with or without the tag.

If you do some snazzy stuff with your web servers, you can both do the tracking and remove the gclid tag (and other tracking tags), so that users will bookmark only the base page name. You could, for example, use a rewriter that strips off standard tags; that is, when you request “/foo.html?gclid=xxx”, you get a redirection that is just “/foo.html”. A rewriter does, of course, introduce a few more delays in page delivery and increases the points of failure. It is also pretty rare. I can’t think of any major web site that does this. However, the user ends up with a more readable URL, better handled in bookmarks. And you’d get fewer appearances of gclid in web server logs. We’ll revisit this later - it is more important than it might seem at first.

OK, so that’s the basics. Where the gclid comes from. What it probably consists of. How it might be used by Google (I claim no inside knowledge). How it ends up being re-used legitimately. How it appears in web server log files and search engines.

Useless to advertisers? Come on… Get real.

Assume that I have enabled Autotagging and that I am using the uniqueness of the gclid to determine unique visitors.

In an ideal world, I get a unique visitor for each click that I pay for. If I don’t, then the advertising channel is lying to me about sending me users. That’s close to Google’s (narrow) definition of click fraud. Have they delivered the click they charge for? Google will gleefully send you the same person, charging for the click each time, if the user conducts a variety of searches with a time delay between clicks, from unique adverts. Google even appear willing to charge twice for sending the same person, if there’s a sufficient delay between clicks. Check your AdWords account - you will occasionally find keywords with a low impression rate, where the number of clicks is greater than the count of impressions. Google’s reporting interface is really part of their billing interface - this data represents what they think you should be charged for.

By counting clicks, and counting unique glcid values, I can match up the two and determine that Google is sending me unique visitors, unique at the level of having clicked on different adverts and not having double clicked a link, or suffered from a noisy button (you do get extra clicks from some mice, if the click-debouncing circuitry isn’t good enough).

What value is this to an advertiser? It says that you have seen as many visitors as Google claims it is sending. It tells (for low volume keyword cases) whether Google is charging on second clicks on the same advert, and how long it takes for Google to decide that a second click is payable rather than free. This may match Google’s definition of click fraud, but it isn’t an advertisers definition of click fraud.

The problem with Google AdWords and click fraud is *NOT* only that advertisers think they aren’t getting visitors. It is what the intention of those visitors is. Counting visitors, and making sure they are unique, yeah, that’s important. Making sure they come from the geotarget you’ve selected - that’s important and not specifically addressed directly by AdWords; you need to use Google Analytics, or another web analytics package with a geolocation database or some gelocation databases with your web server log file analysis. I keep meaning to publish an article on geolocation, but it is a hugely complex area. I’ll probably end up publishing several bits of article, so they can be better maintained.

I’ve seen a few users complaining, mostly on the AdWords Help Forum, that they haven’t been delivered clicks. I’m pretty sure that most of these users are getting the visitors they’ve paid for. They simply haven’t tagged the adverts and can’t tell the difference between paid search clicks and ordinary search clicks, and especially they can’t identify paid search clicks from AdWords that bring in visitors from Google’s search partners and content matching programs. I have no serious quibble with Google about receiving the volume of clicks that my clients are paying for - it isn’t volume, but quality. We’ve previously written at moderate length about tagging adverts so you can identify whether you receive clicks from advertising.

Quality of clicks is a whole other story. And very little to do with gclid, as we will see.

In any case, gclid data isn’t authoritative. Assume that a user has bookmarked my site. Last year. So this is an old, old gclid value, that suddenly re-appears. I can’t just look at current day activity and test to see whether gclid and click counts match, I have to identify *unique* gclids, unique on the first usage - and ignore the rest. I have to check back through a year of records to see if this ID previously appeared (strictly, back into some point in 2005, when gclids were first issued).

If there was a financial value to my client in doing this long-record check, it might be worth doing. Of course, I could optimise it. I could just stuff gclid’s into a DB, with a custom written program, and query the DB. Shouldn’t take more than a week or so to develop, with some docs and a web interface of sorts. Then you have to interact with the tech team at the client site to embed this software with the infrastructure that they have, or that they’ve outsourced. Then you need access to web server log files going back to an unspecified date in 2005 (IIRC, it was around Q3 that we first noticed gclid) - which may mean trawling backups. Ever tried getting a backup from a technical organisation, who doesn’t really care whether you can spot a unique visitor for whom you’ve paid less than a couple of dollars? It’s a difficult argument, one that I rarely win, at least. Then it’ll take far longer to manage this exercise of trawling the records than to implement the technology… and at the end of it, you still only know whether this user was unique. A lot of effort for a small increment in confidence.

Intention, not volume

There’s a few of us with a fairly consistent message that Google persistently ignores. They go under names like John K, Richard Ball, CPC Curmudgeon and a few others… You can find them, fairly easy, using a blog search tool.

The message is this. We know that Google sends us visitors on each click. What we don’t believe, is that Google always sends us high quality visitors that are likely to buy. Some of those clicks may be bots. Some may be bored users clicking on anything. Some may be genuinely interested. I won’t speak for the other people that I mentioned above - because I haven’t consulted them - but I’m quite certain that Google understands the concept of the quality of a click. And I’m suspicious that Google manages the click quality to assure Google of a revenue stream. I have no evidence that Google manipulates the quality to benefit customers… Though I believe that large customers can demand that Google removes low quality click sources (such as, in most of my clients’ cases, domain parks and sites identified only by numeric IP addresses, etc).

Now, this allegation of naughtiness by Google is pretty tricky. After all, the quality of a click is in part down to the advertiser. For example, if I am selling saucepans, and I use as a keyword “cheap flights” or “used cars”, then I might expect a low CPC. However, my conversion rate will be low. People looking for cheap flights, a car or shoes, flowers, whatever… well, *some* of them may want new saucepans. Some people will just click on the advert to find out what it has to do with the search. The low conversion rate from this is my fault - I should have used relevant keywords and a more specific advert, and a better landing page (one that explains the relationship between cheap flights and saucepans, for example).

However, if Google or, for that matter, any other search engine, sends me people who searched for something different, or if they send me a lot of traffic from content pages, then my conversion rates will probably decline - and it isn’t my fault. My clients conversion rates depend, at least in part, on the quality of Google’s advertising partners, and how keen Google is to transfer my money to people whose primary interest is making money for themselves, rather than helping users or my clients’ business. That, in turn, depends on how broadly Google interprets broad match, and what web pages Google consider to be suitable for content match (and even for site targeting).

Google will say that they control click fraud, but they are controlling click fraud *FOR THEIR BENEFIT*, not mine. If the definition that Google used included conversion activity, then I’d be more convinced that they cared about my clients. As it stands, Matt, and Shuman and Eric Schmidt are essentially at pains to assure the world that Google is protecting Google’s revenues. Why they come under repeated questioning, is that they have shown no sign that they recognise the quality of a click can represent a fraudulent activity perpetrated, supported or tolerated by Google. Well, that and a bunch of advertisers who haven’t taken relatively trivial steps to allow themselves to identify paid search clicks, and relatively uniquefied paid search clicks (add tags to advert, and enable autotagging - it can’t get much easier to set your mind at rest, but it is so rarely done by default).

Google assures advertisers that we now have more control. It is true that Google has added new reports that help to identify poor content match sites. The new “Placement Report” tells us which sites result in AdWords Conversion Tracking events. Brilliant, if you can use AdWords Conversion Tracking. The Search Query report, which sheds a modest increment of light on the breadth of search queries matched by broad match (but not the immensely more useful report of search queries for which no-one clicked). And the Invalid Clicks data. And the Cost Per Action beta test. And they’ve now started blogging seriously about click fraud.

(I could do a whole aside here about clients that do not trust Google with their sales data, so refuse to use AdWords Conversion Tracking, or that have offline conversion and can not infer quality of the content network without significant effort - I’ll leave that out of this essay).

For example, Google has now removed the 500-site limit for the “site exclusion” mechanism. If you identify a low quality site (lots of clicks, lots of spend, no conversions) then you could remove that site from your content match advertising. However, identifying a good site is not cheap. Take one UK account, targeting the UK only. There’s about 10,000 rows of placement data, per month. Sites that have no conversions in one month account for more than 90% of the rows, and a large fraction of the expense. But most sites without conversions have too low a spend to reject them as useless… So we’ll continue to spend at a high rate, because we can’t (yet) reject these 8,000 to 9,000 sites, without an additional data source (in addition to impression, click and conversion data).

Advertisers Costs, Buying Short and Selling Long

When I find a low quality click source, I reject it. But that costs the advertiser money, to discover… Let’s investigate what Google makes out of this learning process, shall we? I’ll make up some numbers. These are representative, but do not accurately portray any specific client that I have.

Assume that I see a conversion rate of 1%. Assume that I have an AvCPC of $0.10 for the keyword. I’ll spend about $10.00 for a conversion on keyword search. Assume that the content network averages the same conversion rate (this is not usually true - the true conversion rate is much lower for reasons discussed in another article). I can then afford to spend $10 on 100 clicks - the $0.10 AvCPC that we saw for keyword search. Google picks up 50% of that. They make $5.00 and the AdSense partners share $5.00. If there was just one site in the list, then that’s one AdSense partner that receives $5.00. I can easily see that this is good value and I’ll invest more to get more placements with them.

Now, if Google spreads the love, and displays my clients adverts on ten sites… well, if I get one conversion from this $10.00 spend, I can’t disprove that the other nine sites were useless. Assume an even spread of clicks (it isn’t - it looks more like a power law distribution than anything else, from the data that I have - a lot of sites with a few clicks and a few sites with a lot of clicks). That means that I have to spend (at least - the real cost is higher and more difficult to calculate - I’m simplifying, OK?) $100.00 to prove that the ten sites are worth advertising on. Google picks up $50.00 and I’ve still only got one conversion… So my response is to drop the bid price. If Google is going to spread my advert everywhere, then I need to spend less per click, in order to compensate for lower quality clicks on the content network. I need to drop the bid to $0.01 to justify the conversion rate.

Next step in the arms war is that Google spots this and introduces “Smart Pricing”. This means that if I bid $0.10, some sites get the whole sum, and most of the others get a lot less. The others get a lot less because Google has decided that they are lower quality. Lower quality in what way? Hmm, interesting question. I’ll bet it has to do with Quality Score type metrics, traffic volume and CTR. And of course advertisers don’t know which are the premium sites and which aren’t.

Now, if Google manipulates the average price paid using those measures to allocate the payment, then they can make sure that most of the clicks my clients see come from a wide range of sites. The more sites used by Google, the more money spent to assure that these sites do not yield a conversion. So low quality sites, and many of them, work for Google’s benefit, drive up my clients costs, don’t yield a lot of conversions but don’t allow a lot of rationally based decisions on site exclusion.

Rational advertiser response to finding a low quality site is to add it to the site exclusion list. But identifying a low quality site, even for a high volume client, may take many months or even years.

Contrast what happens when a publisher starts to see a declining revenue… It takes about ten to twenty minutes to find a new site name, pay for it, and get a hosting plan working and to redirect AdSense to the new site. About 20 minutes after the income on a site starts to decline because it is being excluded, a money-grubbing leech can have a new site up with cheesy scraped content and stuffed with adverts for every network. Advertisers now have to identify this garbage site and exclude it all over again… costing advertising funds and wasting account management time (wasted compared with the case that there weren’t useless sites operating). The people who develop low value, poor conversion sites, can generate new sites rapidly, but advertisers spend a long time individually identifying poor value sites - this weights the system in favour of those who produce poor value sites.

Google, on the other hand, could take a message from advertisers… Get enough exclusions resulting from spend on the site, and you get lost from the AdSense network. Not just as a site, but as a publisher. Low conversion rates are not a problem for Google, unless advertisers make it a problem for them. Low conversion rate sites and publishers are something that Google wants. It generates more revenue for Google.

The point is that Google benefits from distribution click fraud. There is no incentive and no control over Google’s collusion (whether intentional, accidental or systemic) with the publishers of web sites who manipulate clicks for profit, with no intention of responding to the advertising. So, if you can, and if you trust Google with your commercial information, consider using the Beta CPA programme. Regrettably I have no current client that will do so.

So, this is why Google will continue to be the target of criticism about click fraud. While Google manages search quality for their own benefit, and while advertisers use the defaults that Google gives them, there will continue to be allegations that paid search traffic from Google is subject to fraudulent charges. It is because the things that Google believes to be click fraud are only part of what advertisers identify as click fraud.

Gclid and user behaviour tracking

Well, if gclid is at best peripheral to the problem, at least we can use the gclid for something useful. Identifying user behaviour. Or can we…

You can use the gclid as a proxy for a cookie. If your advertising includes the gclid, and the gclid is unique for each impression, then you can spot returning users from bookmarks - though you may pick up some bookmarks from social networking sites. You can therefore extract two more measurements… The number of times that a page is referenced in social networks (that referer_info field), and the ratios of bookmark using users with cookies versus those who have deleted the cookie. So you can infer the additional success of your programs that depend on cookies for measurement; while it is still an exercise in stats, it is at least a numerically based exercise, with your own data, rather than that of an industry pundit or terrifying percentage estimation from a commercial vested interest who wants to flog you an authentication based service, or a flash cookie service.

Other than that? Well, Google doesn’t publish a spec for the gclid. If we knew what the parts meant, we could do more with them. As could the bad guys…

Faking the gclid

So, what happens if there is a third party running a site designed to allow bots or human networks to click for revenue generation? Firstly, they should be using browsers and bots that neglect to offer the “Referer_info” (sic) field, or forge fake content in it. This field, which could be examined by Google during the redirect, tells you what the browser had previously requested. In other words, if the user was using Google to search, you find out that the last request was for a page on “google.com” (or “google.co.jp” or whatever), and you can find out the last query.

Clearly, if you are running something a bit dodgy, you don’t pass on good information, such as the site where you found the advert - carried in the optional referrer_info field from the browser and sent to the server on each request. Confusingly, some (legitimate) versions of the AOL browser suppress the referer_info field. Most bots don’t offer a referer_info field. If it is a fraudulent browser, there’s reasons that it should offer a forged referer_info field. For example, you could be fooled into thinking that Google.com was offering a lot of users that did nothing on the site, if the referer_info was forged to pass on fake queries on Google properties - making content matched sites look more attractive.

Bogus traffic could even deliberately fake competitor information. This kind of anti-competitive activity is visible in the world of computer viruses, where one virus writer may embed messages taunting another writer. So, if I wanted to paint, ohh, MySpace as a bad source, then I could stuff the referer_info field with data for a MySpace page (or pages). That’d deflect attention from my scummy sites and have you add MySpace to your site exclusions list - making it more likely that you’ll see impressions from the poorer quality sites.

Even worse, what if the gclid was forged in other requests? That is, you see an excess of clicks received from Google, compared with the clicks they claim they sent. But if the clicks never originated with Google, then the value of gclid for users become weaker… If most of the traffic you see has a unique gclid, and you can’t tell which were added tags by Google and which were fake gclids added by botnets to confuse the server analysis, then gclid is rendered pointless for advertisers. Note that Google, because it knows which codes it used, *CAN* identify the real clicks in an analytics package, and is the only party that can do so.

Spiders and the gclid of doom

Now, if users have saved pages with a gclid to a social bookmarking site, then the tag is treated as part of the page ID. That is “/foo.html” and “/foo.html?gclid=hiufuyviuybfkjgkhghjkkh” are treated as two completely different pages. You can certainly eject spiders that attempt to crawl, if the tag is present.

As recommended in Matt Cutts blog, you can change the search engine spider response to a tagged page, by adding:

User-agent: *
Disallow: *gclid=*

to your robots.txt file. This will at least mean that you don’t get heavy gclid re-use from random strangers who’ve never seen the advert. But that doesn’t really add much to gclid usage. And why would you want to help reduce your page relevance, just because a tracking tag has been found on it? The search engines should be aware of the use of gclid, and actively removing valid gclid’s from tags. Shouldn’t they, Google? There’s no good excuse, that I can imagine at present, for Google to be crawling pages and identifying them as different pages, just because of tracking tags from their own advertising system.

Could the gclid be more useful?

Absolutely. For example, it should be possible for an advertiser to query the gclid with Google. If I authenticate to an account, I ought to be able to submit a bunch of gclid values and find out:

  • whether this advert impression was for my clients’ account
  • when the advert impression was served
  • where the impression was served (site and AdSense publisher ID)

And I can then infer the delay between impression and click (useful for visitor behaviour analysis). Of course, Google could offer that information, too. And they could save a bunch of work by confessing as to whether they treated the second and subsequent clicks from that advert as being uncharged (e.g. key bounce or users that double click links) or charged (e.g. too long an interval between the first and second clicks to ignore the second click as intentional revenue bearing activity by the user).

I do realise that publishing this information allows bot writers to tailor their bots to avoid detection. In information security, security through obscurity has long been a failed defence. That is, you have to design systems that are open to scrutiny, but that preserve trust. Google’s approach to click quality is equivalent to a failed strategy in information security. It didn’t work for spies. It won’t work for advertising.

Summary

Google’s use of gclid helps Google to identify user activity in response to advertising.

You can enable Google’s use of gclid by turning on autotagging in the “My Account” area of AdWords.

With appropriately written web analytics, gclid will currently allow you some insight into user behaviour, and offers clues about Google’s unpublished policies on counting second clicks as revenue.

gclid can be subverted, and if use were widespread in analysing fraud, one possible reaction from fraudsters could render gclid useless for advertisers.

gclid could be more useful, if advertisers were allowed to verify tag values and extract information from Google - but only if they authenticate as an advertiser, and only about their own clicks. This would help mitigate the effect of forged or erroneous gclids.

The real click fraud problems with Google are undocumented policies on double clicking charges and Google controlled click quality. Not whether you get clicks, but from whom, via which sites.

We are not aware of any other paid search vehicles offering similar unique advert tagging mechanisms.

We are not aware of any web analytics package, other than Google Analytics, that uses gclid by default to help identify paid search adverts and unique users. This is a shame, because there’s a lot you can learn about users at the moment, if you recognise this tag. I’ll gladly maintain a list of analytics packages here, that *do* correctly and usefully (my opinion) handle gclid without special configuration. I expect the list below to be empty of competition for some time:

Web Analytics Packages that use gclid sensibly

  • Google Analytics - basic usage of the tag - doesn’t indicate revisits, whether the tag has been saved to bookmarking sites, extract referrer, etc.

What Should Google Do

Matt Cutts (praise be his name - seriously - the organic search indexes would be a real disaster area without him and his team) has asked what Google should be doing. I’ll suppress my mild annoyance that I’m acting as an unpaid Google product manager and I’ll take the question seriously. As much as I take anything seriously, that is. I haven’t thought a lot about Google’s actions… So expect this section to evolve as I think of stuff or people suggest things.

  1. Document the double click policy. It isn’t fair to advertisers to receive two charges for a single impression, without a rational description for why this is regarded as fair. This *will* open the door for fraudsters - when you know how a system works, it can be easier to subvert it. In InfoSec, the general rule is that if you don’t document, or obfuscate, only the miscreants understand the policy. Sophisticated bot-builders and human clicker fraudsters have already worked out where the edges are… it is advertisers that don’t know.
  2. Check that destination URLs, especially on new accounts, have tracking tags, and use the alerts system to draw attention to adding tracking tags. Document how to use destination URL tags, and autotagging, and what to look for in web server log files, for at least the major web analytics systems - or even allow web analytics vendors an area to document how to use their tags. I’ll drop rank for my articles, here, when Google does document this stuff properly, but that’s the right thing to recommend. Call me Cut-Me-Own-Throat Dibbler and see if I care.
  3. Look at what some of your competitor paid search vehicles are doing. Some of them make tagging *much* easier. For example, one competitor makes it easy to use a redirection server (e.g. Nedstat/Sitestats’ redirection service) and to append standard tags to destination keywords. Google requires an editorial review when I change a tracking tag - painful, that is…
  4. Add some more values that are substituted in AdWords. In addition to keyword, creative, and placement, could do with matchtype, position, search query, at least. Why? It’s a real sweetener to induce people to tag - that’ll help reduce your workload of advertisers who have no real data on which to base their claims, and helps build the third party analytics industry.
  5. When auto-tagging is enabled, add a dummy gclid (e.g. “gclid={gclid}”). This will protect advertisers and help Google to deliver adverts that have fewer 404’s.
  6. Allow advertisers to submit gclid values for checking. At the most basic, Google could simply confirm whether they were issued for an authenticated AdWords account. At most extensive, allow information on the timestamp of impression, position, publisher ID, and URL for the advert - allowing advertisers insight denied when browsers or bots suppress or fabricate referrer_info. I see a difference in ratio of clicks with and without referrer_info, when I compare keyword search and content match - this implies that there’s a different audience for content match and some of it is probably not acting in the best interests of my clients. If advertisers can compare what you think you’ve done, and what is reported, it makes it harder for the bots to hide and for browsers to obscure fraudulent activity.
  7. Allow automated web server log file submissions from advertisers - report on the valid and invalid gclid data found - until web analytics vendors get a clue about analysing marketing information (yeah, I might criticise Google, but I reserve my main loathing for useless web analytics). I’ll guess that you guys have a tool that does this sort of thing for your analyses. You could use it as a pre-screen. See no glcids? then diagnose “enable auto-tagging”. See a lot of wierd gclids and you guys will want to investigate.
  8. Allow advertisers a dial for experimentation. I’ve found ways to control content match that the account strategists I talk to at Google (previously known as maximisers) seem to think are novel. I think you could turn this into a way to allow advertisers to control their risk with content match. Turn the dial down, and you get exposed to better sites, with less traffic. Turn it up and you get more traffic, but it may be less tightly focused and hence lower converting. Also applies to broad match… Just because I bid high to gain position doesn’t mean that I want “chicken sandwich” matched with “turkish feudalism” (I made that one up, it’s based on the etymology of the sandwich).
  9. Automatically remove valid gclids from organic search indexes - combine the page ranks. Tracking tags are not a separate page id. And, for that matter, not just gclid, but Urchin tags (utm_.*), Core Metrics tags (cm_mmc), Nedstat tags (ns_.*), etc. Just because users end up linking to pages with tracking tags, doesn’t mean that Google has to reflect the tracking data as if the page was unique. Does it?
  10. Search is a very powerful tool. Why do so few web analytics packages do anything sane with the data from search? This include Google Analytics. They aren’t ducking my venom, either. :) Have a conference on “ways to extract data from search, about user behaviour, for marketing purposes”. Have someone from Google ready to talk about anonymity and privacy - there’s a lot of nonsense in the industry and a bit of dancing on eggs that Google does. We don’t identify someone by name. We don’t pass on their IP address. But it is immensely important to be able to do stuff like saying “this users’ search evolved in normal ways” and “this user search is not evolving or represents an unusual search evolution that falls outside statistical bounds of normalcy”. We extract that information in order to understand what users are trying to achieve - but it accidentally sheds some light on click fraud…

Material Disclosure

Merjis manages paid search and does a little SEO. We also have some custom web server log file analysis, and a redirection service, search analysis and other similar software - unproductised, so far…

Change History

2007-08-17 - added first paragraph link to the new Google Click Traffic Quality resource, and noted, only here, the similar Yahoo!Search Marketing click traffic quality resource.

2007-07-26 - added link in the second paragraph to John K’s blog about Google results. Tightening up? Could be. I’ve some more analysis coming up shortly and may be able to infer changes. I’ve already screwed the hatches closed for my clients, so detecting the effect of Google cleaning up may not be possible for me. The loss of business to those clients is small, but the cost was high - really poor ROAS, couldn’t justify the spend, and we’re not usually compensated on a percentage of spend basis; no skin off my nose to implement ways to get a better ROAS.

2007-07-23 - added reference to the PaidContent.org article. And fixed an embarrassing typo in the change history, where I’d put 25% and not 15%. Oops. Adjacent key error and haste. Tut tut.

2007-07-21 - more error corrections and clarifications, mostly surrounding security-through-obscurity paragraphs.

2007-07-20 - tightened up some language; corrected typos; now a little less nasty about the self-serving scum that generate AdSense pages with no intention of helping users or advertisers - I was in need of committing a random act of kindness, I guess, but I’m feeling better now, thank you; added new section on “what Google should do”, following Matt Cutts comment. Added new second paragraph about the Forbes report. And yes, I do believe in an average of around 15% for low quality clicks for an unsophisticated advertiser using Google defaults for an account and bidding to be on page 1. I’ve got some numbers here…

2008-03-15 - updated link target for “other naughtiness” in the first paragraph - the draft had changed state and was getting a 404. Link rot.

2008-03-23 - added link to new Google Blog article by Shuman, about the kinds of data that they inspect and the inferences that they make.

"Click Fraud, Google AdWords and gclid" was published on July 16th, 2007 and is listed in google, intent, advert automation, adwords, click fraud, web analytics, conversion, content match, trust.

Follow comments via the RSS Feed | Leave a comment | Trackback URL

Click Fraud, Google AdWords and gclid: 20 Comments

  1. Jeremy Chatfield wrote,

    Google’s new property (FeedBurner) reports 25 visits from Palo Alto to this posting, within 24 hours of publishing. Highest response to any posting here, from a specific location. Hmm, I wonder…

  2. Matt Cutts wrote,

    I read through this and enjoyed the summary at the end, but I would have liked if you’d included 3-4 “I wish Google would do X” items, where X is very concrete/actionable. You hint at some of it (e.g. ability to verify or authenticate gclid tags).

  3. Jeremy Chatfield wrote,

    Thanks for the comment Matt!

    Mildly depressing though… You only liked the Summary? I tried so hard with the rest, on a really dry topic… :)

    Anyway, added ten points of “Things a Googler Can Do”. And tweaked and tightened up the text a bit. Ohh, just see the changelog at the end :)

    Cheers, JeremyC.

  4. Ari wrote,

    Excellent post, as usual!

  5. Richard Ball wrote,

    Thanks for the links. Nice that Matt commented. Pity there’s not a Google blogger on the PPC side to respond. We could do with an advocate, eh?

  6. Jeremy Chatfield wrote,

    Ari - thanks. I must do some catch up on your research, soon!

    Hi Rich, no problem with the links. You write up some insightful stuff.

    There’s the Inside AdWords crew, but it’s mostly pretty basic stuff. Looks more like corporate messaging from Google to the masses, rather than Matts personal voice plus experience. He’s a pretty canny blogger, steering the shoals the way he does.

    Oh, and I found your comment in my Akismet spam folder. I guess WP does the Akismet check before it checks to see whether you have posted before. Useful, in a way… but I almost missed your comment and submitted you to deletion. If I wasn’t so interested in spotting blogspam trends, I’d have missed it.

    Cheers, JeremyC.

  7. Google, Trust, Content Match, Placement Reports | Merjis Internet Marketing Blog wrote,

    […] You could try to look at each site, as soon as they appear on a placement report. (Look at my article on gclid to understand why you need this report, rather than using referrer_info.) But reviewing sites manually has a cost, and the frequency of appearance may be low. That is, of the 572 sites that appear in the report, more than 550 have two or fewer clicks. So you could invest five minutes per site to check the quality… a dozen to twenty sites per hour. 160 sites per day if you get fast… almost four full days to check this campaign. At developed economy rates, you’re looking at around £500 to £2000 to get this done (assuming you pick up someone that doesn’t need training). Well, that cost would blow the ROAS out of the water. It’s about 25% of the spend of the campaign, and way over the margin that most clients offer. […]

  8. Igor wrote,

    Great article!
    Have you checked the salesforce-adwords Adwords Lead Tracking functionality?
    Looks like Google opened its gclid tracking functionality to those guys…

  9. Jeremy Chatfield wrote,

    Hi Igor - nope, haven’t looked at SalesForce since the original announcement of cooperation with Google. Good idea.

    Cheers, JeremyC.

  10. Nicki C wrote,

    >>>>slightly more confusing is what happens on the web server. A request comes in shaped like “/foo.html?gclid=juiuyvyuvuyvjhasfdhgkhj”. The user’s browser will show what the server delivers…

    Sometimes…..

    From Google Help:
    “A small number of websites do not allow URL parameters and serve an error 404 page”

    So what happens if you spend money on Adwords and you happen to have one of those small number of web sites? You check all the URLs within Adwords and then find out weeks later that the Adwords investment has sent people to 404 error pages.

    What exactly is a small number? How many Adwords users have ended running Adwords campaigns that deliver people to 404 pages?

  11. Jeremy Chatfield wrote,

    Hi Nicki C -

    covered briefly in the tenth paragraph of the article. IME, this affects people using web stores with “primitive” search queries embedded in page URLs, who are directing visitors to a page in the web store. From my client base, it’s around 1% of “normal” small advertisers. That’s still in the area of few hundred million dollars since this was introduced - I’ll hazard the guess that there could be a US class action suit in this observation.

    My fifth recommendation would solve this - if Google added a dummy gclid when auto-tagging is enabled, then Google’s bots, Editorial Reviews and the account holder clicking on adverts, could all see the problem. Diagnosing the cause of a 404 with a valid base URL would be somewhat tricky and may be why Google hasn’t done this, considering how easy it should be, technologically.

    Cheers, JeremyC.

  12. John Nagle wrote,

    Advertisers here are complaining about “low-value clicks” draining their ad budgets. Typically these are from “bottom-feeder” web sites made for advertising. There’s interest here in trying to generate better “exclusion lists” for Google AdWords, but nobody seems to have anything that works.

    We may have something.

    We have a system, available at SiteTruth.com, which evaluates web sites for business legitimacy. SiteTruth tries to find the real-world business behind the web site, reading through the site, checking business directories, checking SSL certs, and performing other tests. We’re working on this as a technology for improving search, and a patent is pending.

    The online advertising community may be able to use these ratings, by extracting the referring site from incoming clicks, using SiteTruth to identify low-value domains, then blocking them via the Google AdWords exclusion list.

    So we’d like to run a test. If you are running an AdWords campaign with high per-click costs, know which clicks actually generate revenue, and have the tools to extract the referring domain from your clicks, we’d like to talk to you. We’ll want you to send us a list of referring domains (up to 10,000 or so). We’ll rate them and send the ratings back. Recalculate your ROI with the bottom-feeder domains excluded from both ad cost and revenue, and tell us the results.

    John Nagle / SiteTruth

    (Data available for US and UK only, please; we only have business databases for a few countries.)

  13. Spam in Comments, Unattributed Content | Merjis Internet Marketing Blog wrote,

    […] An older article has an interesting and recent comment by John Nagle, about a system to reduce exposure to spammy sites. Personally, I don’t think it will work for most advertisers. IME, the sheer quantity of low volume sites means that such a system will expend a lot of effort to deny sites that wouldn’t re-appear anyway. Put it like this… In a few tens of minutes I can sign up to new free domain host, and be publishing a spammy site, with an aged domain. However, getting traffic from each advertiser, sufficient to trigger a rejection, is a slow process - the site will make money. I believe that while making new, ranked, spam sites is faster than detecting and removing them (for all advertisers - not just the one) the problem will continue to drag down all content advertising. […]

  14. John Nagle wrote,

    The anonymous comment above “while making new, ranked, spam sites is faster than detecting and removing them … the problem will continue to drag down all content advertising” deserves an answer.

    The technical problems of blocking the bottom feeder ad sites quickly can be solved. SiteTruth goes out and looks at any site on request, and comes back with a rating immediately for ones it knows about. New sites take a minute or so to rate. So it is technically possible to keep up with the domainers.

    The conventional wisdom is that this problem can’t be solved. The conventional wisdom is wrong.

    John Nagle / SiteTruth

  15. Jeremy Chatfield wrote,

    Hi John - the “anonymous” comment was mine - it’s a pingback from a more recent article here. I’ll write a more comprehensive article about the problems that I ran into when I tried something similar, two or three years ago. Perhaps you’ve solved them :)

  16. Nancy wrote,

    I found this page when searching for the “gclid” code info. I was doing his because I realized that for weeks now, I have had no business from my google ppc advertising and yet they are charging me more than ever. When looking at some server info from my backroom at bluehost, I discovered that people were receiving a 404 page not found message and the links they were going to were my website link with the gclid code tacked on to the end. So then I did searches on google that pulled up my ppc ads and when I clicked on them, sure enough–page not found and the links in the address bar had that code tacked onto the end. I won’t bore you with all the details except to say that altho Google is quick to suspend my ads when my website is down, they clearly don’t notice when their own codes debilitate the link. After finding your article I discovered the feature of autotagging (which I had never heard of) and went into my google adwords account and sure enough it had been enabled. I disabled it and now the links are working fine. But, Google owes me hundreds of dollars…and I didn’t set that autotag to on…so who did? I recently read a NY Times report on Googles high profits.. could this be why? Because my business had tanked..I went in and added new search words and upped my bids etc. etc…..all the while, people were clicking on my ppc ads and going nowhere. This is not good.

  17. Nancy wrote,

    Additional note –I forgot to mention that both my hit counter and my stats at 123count.com show fewer than 1/4 of the hits (on just one of my landing pages) than google has charged me for since I put that page up.

  18. Jeremy Chatfield wrote,

    @Nancy - IME, this affects a small fraction of users. It is findable, and fixable, by Google. It is worrying that autotagging does seem to be enabled without positive action, and that automatically enabling autotagging without a check by Google that it is safe, can destroy any value that you might have gained from advertising.

    I’m reasonably certain that automatically enabling content match, “edgy content” pages and low quality domain parks accounts for more profit to Google than auto-enabling autotags.

    Since you can’t use gclid tags, and you probably don’t have any other tags on the destination URL, you won’t be able to positively identify whether visitors arrive from Google AdWords - when you have Content Match enabled you may see a large fraction of visitors arrive from sites that are not Google owned. Given that you have had gclid problems, you may well see reduced visitor counts.

    If you still see reduced counts after disabling autotagging then there’s probably a different problem - Google is usually *very* good about delivering paid clicks. The main question with Google should be the quality of the click, rather than the volume. I have cross verified hundreds of thousands, perhaps millions of clicks over the years, and delivery is simply not a significant problem.

  19. Dave Lineman / ClickTrue.net wrote,

    Too add to this thread about click quality and click fraud, our company has been looking at this for many months as well. Advertisers are seeing an overall increase in volume and lowering of quality across the board. What we are seeing is that the obvious “click farms” are giving way to a large group of sites on the “official” partner networks from Google and Yahoo. We have a system that immediately “crawls” a web site as soon as you receive a click from it. While this won’t eliminate click fraud, it allows people to almost immediately detect suspect clicks and quickly block these sites. Web 2.0 style voting allows advertisers to benefit from the voting and blocking of other advertisers. Until Google and Yahoo dramatically increase their scrutiny of new sites, this is the best way we have seen to help stem the tide.

  20. Mark wrote,

    Hi All,

    I’m pretty new to this search marketing, so forgive me if this is a silly question.

    I use affiliate marketing to generate revenue for my web site (Getting paid when per sale for Experian & Equifax credit reports etc.).

    One of the affiliate marketing companies now captures all referall details (Yahoo / Overture id’s, keywords etc. & Googles gclid).

    With the Yahoo information, I can easily identify which keywords have successfully converted for me.

    Is there a way for me to find out which particular ppc keyword the gclid actually relates to in the first instance?

    This would really help me to work smarter with my ppc budget, and evaluate where my profit is actually coming from

    Many Thanks,
    Mark

Leave Your Comment

Is this article any good? What helped you? What made you think it was wrong? What else would you like to know or discuss?

Merjis Internet Marketing Blog is powered by WordPress and the YUI-Mainstream Theme by Buzzdroid.com