Effective Internet Marketing Strategy and Technique Through Experiments, Measurement and Audit

Link Spam, Google Analytics and Content Match

What concepts join link spam, Google’s web analytics program and AdWords/AdSense Contextual Match? It feels like the challenge the Flying Karamazov Brothers offered. Bring any three items to the show and they’d juggle them. I liked the sticky-slippery combination of bread dough, water melon and whole fish…

The answer is, of course, Matt Cutts. Specifically the two week old posting about reporting paid links to Google. When I saw this posting, I immediately thought, “gee whillickers” (or words to that effect)… “why not use your own blog to catch link spammers”?

I’m going to ignore my own qualms about whether I should be helping Google, even indirectly. I’m no longer entirely convinced that an information citation model should be responsible for guiding so much wealth around the internet. However, I’m putting that issue to one side for examination elsewhere.

The importance of failure

Actually, I’d previously had that thought about using blog spam for a positive effect. I have an aged unpublished article in my WordPress queue entitled “Identifying Bad Content Publishers Through Your Blog”. Article 60, right between the unpublished “Did Spambots Eat My AdWords Budget?” and the unpublished “Where’s Jeremy? - a Geotargeting Investigation”. I tend to get an idea, and start writing the article background, while the experiments are in progress.

Scientists often fail to to write about, or have published, experiments that don’t pan out. It’s pretty easy to design an experiment badly, or simply to not get a result that would be regarded as positive. Its also possible to design an experiment, but just not get results that other people find positive - the Ig-nobels being the result. While scientists should report negative results, you don’t get good citations or reputation from trying a lot of stuff that doesn’t work, even if well designed and executed.

I’ll admit that I tried to use my blog spam (and email spam) to identify spammers and see if I could match them to AdSense publishers.

Experiments - What Use Are They?

Many years ago, a weak understanding of science was underscored by Maggie Thatchers’ new government. I was working on low energy housing, in a University research group. Lady T’s government wanted to eliminate research done by non-industrial bodies that could and should be done by industry. I’ll avoid crowing too much about our success (better than 95% of data was collected in the target measurement period, versus some of the industrially sponsored projects that collected less than 10% of the useful data)… the key point is the visit from the civil servant, charged with finding us to be useless. “So, Mr Chatfield, what will success mean?”. “We’ll have more data so that we can test the model of domestic energy usage.” “But if you are successful, how much energy will you save?”

“If we knew the answer, it wouldn’t be an experiment.”

The actual dialogue stumbled in this vein for a while, as I slowly realised this civil servant was from the Greek-Latin-History side of the British educational gulf, and he realised that I wasn’t going to state a specific figure of energy saving was the goal. Even the concept of what an experiment does, was missing from the discussion.

It’s clear that consciously or unconsciously, there’s a feeling that experiments have to produce a result that is more than “we collected some good data and it has a meaning”. You actually need to have detected the neutrino, saved energy, or eliminated spammy publishers - or your work is of less value.

For good science, you really do need negative results. That is you need to know the cases when a prediction was made, data was collected and the answer wasn’t what you wanted. So, I’m offering the results that…

With my dataset, I couldn’t identify blog spammers who were also AdSense publishers

I had about 10^6 content match clicks to examine, and 10^4 blog spam comments to look at. It wasn’t enough.

In any case, it was never clear to me that there should be a link between spamming blogs and being a weaker AdSense publisher. A publisher that is attempting to use SEO to promote their site is a good thing. At least they’d be less likely to be using AdWords Arbitrage. That is, driving volume by SEO is probably is probably better than driving volume by PPC - another piece of research in the queue.

But…

Google has many orders of magnitude more data

Not only that, but Google is interested in paid links.

What is blog spam? It is at least partially, an attempt to affect Google’s page rank algorithm by automated means, sponsored by a site that wants an improved organic result and more traffic. That’s pretty close to the data that Google wants to find. Only a subset of link spammers will use blog spam, and it is a technique that they could stop using. But wouldn’t it be nice if at least one form of spam was solidly squelched?

Could Google use blogs, such as Matt Cutts’ own blog, to identify sites that are being actively promoted?

Detecting blog spammers

Amongst the techniques that one can use, is the execution of JavaScript. Bots tend not to run JS. The result is that bots that spam blogs should be invisible to JS based web beacon/page bug based web analytics packages. Google also offers a free web analytics tool that uses a JS based web beacon. So the next question is…

Does Google Analytics Count Blog Spammers as Visitors?

The answer is a resounding “hooray!” or maybe just “yay!”

My readership seems to be dominated by people who read on weekdays. At weekends my direct blog readership rate drops below my blog spam rate. This suggests that Google Analytics is dismissing blog spammers from web analytics results (or that there’s a way to submit comments without triggering the JS, e.g. via a blog reader that also submits comments without using JS to do so). In turn that gives an independent way to check on tools like “Akismet” - if Akismet marks something as spam, and I can’t see a web server log trail that shows a real user, then the comment is very probably link spam.

Examining the content of the comment can drive up the confidence. The comments that I see tend to be pretty link heavy, and tend to have a poor linguistic structure - I think that tools from Natural Language Processing could probably identify the text as unusual.

At least for a while, Google should be able to use a combination of Matt Cutts’ blogs, other honeypot blogs, web analytics, NLP, and a big database of links, to identify sites that are definitely being promoted by at least one type of link spammers.

Conclusions

If Google used blogs to reap sites that are using blog spammers, it could point to at least a significant subset that is clearly prepared to pay for links, using entirely automated techniques. Google could be using this, already, but simply hasn’t told anyone (at the time that I started this research, I did a search for relevant articles, but couldn’t see anything to suggest that this was being done already).

This technique might, perhaps, pretty much overnight, crush my daily blog spam volume as the word gets out that using a botnet to spam blogs is now useless or even positively hazardous to a paid link program. There’ll be an eventual rise, again, as the smarter botnets avoid using Google blogs and honeypots, but at least we’ll get some decent quality blog spam from someone that has had to work for it, in the interim.

Of course, it could also be used by SEO’s to report on each others’ behaviour, anonymously. If you think that a competing site is using paid links, and you don’t want to be identified by Google, use a blog spam campaign targeting the competitor - so their links come under scrutiny. I like game structures in which the community gains a benefit from having the bad guys expending effort on each other. It can get closer to a positive sum game. This could even be a way for Google to pay blog spam projects such as Akismet - the data that Akismet gathers is potentially economically valuable to search engines…

Reaping blog spam may, though, have the perverse effect of reducing the value of content match for advertisers… as sites that previously used SEO to drive visitors are forced to consider using AdWords Arbitrage to gain visitor volume again. I simply don’t have access to enough data to answer that… It’s a good question, though.

Anyway, time to get back to thinking about what makes for a good keyword, advert and landing page. My day job :)

"Link Spam, Google Analytics and Content Match" was published on April 30th, 2007 and is listed in google, advert automation, adwords, SEO, web analytics, content match.

Follow comments via the RSS Feed | Leave a comment | Trackback URL

Leave Your Comment

Is this article any good? What helped you? What made you think it was wrong? What else would you like to know or discuss?

Merjis Internet Marketing Blog is powered by WordPress and the YUI-Mainstream Theme by Buzzdroid.com