This blogs visitor volume slid for a few weeks, a couple of months ago. So did the spam comment volume. It was actually easier to see the slide in the Akismet 15-day spam queue, than anything else. Spam went down 10% over a period of less than two weeks, and was strongly correlated with visitor volume. This wasn’t immediately obvious. The largely professional readership of this blog has a strong Work Week (Mon-Fri) peak. In contrast, automated blog spamming is largely a 24×7 activity. Reader volume changes were only visible after several days, because there’s normally quite a variable level of interest – normal random changes of interest.
Blog spam is interesting for several other reasons, of course – not least being that there is a widely used technique to de-rate comments – the NOFOLLOW attribute for links. As with a bunch of other bloggers, I tend to prefer using spam rejection tools rather than nofollow. IMO, it’s better for users to see real comments than to (ineffectively) defuse spamming efforts with NOFOLLOW. CAPTCHA is moderately effective, but still places a burden on real users – and also still allows outsourcing spam generation to low cost economies.
As should be quite obvious (try to submit a comment) we don’t use CAPTCHA, so we get human and machine submitted spam. A lot of machine submitted spam. I normally only see it when I read the spam queue. Yes. I do that. OK, I’m wierd. I like to know what the buzz is. If big companies are sponsoring spam, knowingly or unwittingly. I discover all sorts of stuff – some of it seriously unpleasant.
I wondered whether there was a way to use machine submitted spam to measure blog rank. I couldn’t find anyone writing about measuring spam to individual blogs and the implications for the blog. The articles probably don’t have enough search rank to appear… I did find a recent article about spam and blogs and particularly about the value of closing comments to old articles. I’m not entirely convinced by the article – although there’s clearly a lot of research gone into it. I think there’s a key point of disagreement – Lorelle VanFossen asserts that blogs are found by robot crawlers, not by people using search engines. She says that blog spam doesn’t start until there is an inbound link. However, AFAICS, it’s the inbound link that causes search engines to rank blogs and allow the results to appear. It seems to me that the evidence for the link being involved is high, but I infer that the link gives pagerank… Something else I need to add to the research queue, I suppose.
Here’s an extract from an email that we’ve recently received, offering me low-cost-economy “link building”. Note the emphasis on relevance. IOW, this is built using search engine results, not webcrawling. This is typical of the unsolicited proposals that we get. These approaches clearly work or they wouldn’t be using this kind of text. They aren’t offering traffic directly, but rank, and they are clearly using search engine results to identify the targets.

My concern isn’t so much how spam starts, as the changes in volume. And that, I think, casts an incidental light on origins. If spam is a consequence of rich inbound links, then spam volume will tend to increase as a blog gains links. A model based on inbound links as the source would not tend to show decreases in volume. Yet… that can happen. Blog spam can decrease as well as increase – just like your other investments.
Why does spam volume change?
The point of spam is usually to improve the page rank of the targeted content, and partially so that readers will click on the links (direct traffic generation rather than indirect through rank increases). Spammers look for high page rank blogs that accept comments, especially when the area of interest is close to the spammy link to be planted.
Monitoring search engine queries that lead to this blog is often informative. Several times per day, I’ll see a search query like “internet marketing blog comments”, “shoe blog comment”, or “adwords blog add comment”. These guys are not looking for a substantive blog in order to join the community – they are looking for a blog that has reasonable page rank and offers the possibility that their spam will be shown. Because I’ve used shoes in examples of advertising in various articles over the years, this blog used to be quite highly ranked amongst shoe blogs. Amusing.
You can see this effect on other blogs that offer no moderation or no-despamming of submissions. Here’s a screen shot of part of a page on a fairly new blog that currently has low rank, and no significant protection:

So why does that blog page get no comments for weeks, and then suddenly have hundreds of spam comments? Why can some blogs survive for years with a handful of spammy comments and no identifiable protection, while other blogs get thousands of spam comments submitted, even though there are none or few visible on the blog?
Spam is non-linearly proportional to rank.
Once the observation has been made, I think there’s a pretty obvious interpretation:
The higher the page rank of a blog, the more that it will be spammed.
There is other evidence. An odd quirk of cut and paste in Google’s results used to lead to an embedded space in the search query. We’ve seen this signature every so often, in our web server logfile analysis. In addition to the obvious spammers looking for a blog to dribble on, there’s an even smarter crew doing product specific searches and then cutting and pasting to get to the site.
From observations of this blog and those of client accounts, the particular spam rejection technique seems irrelevant. Whether a “silent killer” like Akismet is used, or one of the CAPTCHA implementations, or the religious use of NOFOLLOW, spammers will try to submit spam. This implies that it is cheaper to submit spam to a lot of blogs, some of which are protected against it, than to develop the software that verifies whether the spam is accepted and published.
I had a look at web server logfiles, to see if there was any pattern between IP addresses that might indicate a separate spider followed later by a spammer. I can’t see one. I didn’t really expect to do so. Botnets, dynamic IP addresses, anonymising networks – the only reason to be closely correlated these days would be if you *wanted* be found. The nature of automation is such that there should be no real reason why a technique like measuring packet time of flight should work – because there’s no significant time dependencies involved. So I gave up on the idea of genuinely identifying and correlating the sources of spam…
The essence appears to be that spam targeted blogs are found, not by crawling the web, but by using search engines. This is fairly smart – because blogs that are higher in search engine results are obviously being crawled by the search engine spiders, and will contribute more to weighting, than a blog that isn’t even indexed by the search engines. Different spammers will have different search queries for finding blogs to target, but the overall effect will be that higher rank blogs will tend to attract more spammers than lower ranked blogs – because higher ranked blogs will appear in a wide range of results for different but related search queries – kind of like a Levenshtein distance of blogs. Higher ranked blogs will tend to appear on a wider range of searches, as well as being higher in each set of results – so there is an intrinsic non-linearity, where higher ranks attract significantly more spam than would be expected. There is a selection based on position – the higher the position the greater the likelihood of being a target – and on breadth of matching (a higher ranking blog will tend to be more authoritative about a wider range of stuff, with the way that Google ranking works). Hence a massively increased interest as blogs get higher ranked. Or a significant decrease in attention for minor slippages of rank.
Changes in Blog Rank
Changes in search engine rank appear to drive changes in volume of spam. The question is, what provokes changes in rank? This is a standard exercise in SEO, at least as I practice it.
I’ve used this blog to experiment on the factors that make a blog more, and less, interesting to users. I’ve also got experience of our clients blogs. This experimentation accounts, in part, for the frequency of postings here, and the variation in types and quality of article. A few articles (which I am in the process of marking up) are of dubious quality – the question in my mind was “will readers call me on it?”. Some articles are clearly antiquated – events have superceded them – what happens to those? Does the style of an article dictate the number and quality of inbound links? Which is the most popular page, and the page with the most and highest quality links?
Experimental Results
Really interesting results.
Linkbait articles, with or without logical flaws or internal inconsistency, are the most highly linked-to, but receive almost no traffic today, only a few months later.
A long, detailed, one year old, technical article about gclid, and another about adwords conversion tracking from two years ago, attract the most consistent day-in, day-out traffic. The third highly read, long duration article is about AdWords Editorial Review, but nearly all the people who read that are *not* coming from search engines, but from articles posted in the AdWords Help Forum. That’s probably because Editorial Review is still a pretty specialist term, much less likely to get attention than the problems it causes :)
Commenters pay no attention to the date of publication. An article that was correct at the time of publishing, may be superceded by events, and will then receive comments as if the article was current.
Embedded links in either the top or tail of an article, referring to updated information, are almost completely unused. The average time spent reading this blog is typically more than two minutes – some days it hits more than 6 minutes as the average time, and only drops significantly below two minutes when we’ve been Stumbled. In other words, the articles are read, in depth – but less than 5% of readers even follow a primary attribution. If I’ve written in response to another article, the source is rarely clicked on.
One of the least clicked links, that I notice, is the one in the gclid article, near the end, referencing Cut-Me-Own-Throat-Dibbler. So far as I can recall, it has been clicked three times in about 18 months, despite being in the most popular article. A few dozen people spend about 4 to 8 minutes each reading that article, each day, or about 10,000 readers spending a cumulative 1,000 hours of reading (an entire solid six weeks of reading). Three in 10,000 = 0.03%. Not a lot of onward clicking, is it? OTOH, the idea of writing an article that has cumulatively taken six weeks of reading is pretty awesome and makes me feel guilty about article content… Important to avoid wasting that much time!
Rank is not significantly affected by even quite extensive article rewriting. Some articles have been revised over the years to reflect changing information. They get remarkably consistent volumes and subjects of spamming, before and after rewrites. Comments don’t appear have much effect on spam volume either. Highly commented articles are no more likely, AFAICS, to attract spam, than uncommented articles. An article is more likely to get comments, if someone comments shortly after publication. That is partially because most articles have a short lifetime. Once the lifetime is over, they get a few visitors, who rarely comment.
The comment policy page is one of the least visited pages on the site. I’ve written all sorts of stuff on there, over the years, mostly for my own amusement. It has, at times, been a draconic rant against spammers with severe words about leaving meaningful comments.
Changes in the comment policy have no effect whatsoever on spam volume.
Changes in the wording around the comment box affect the volume and type of real comments – but do nothing to dissuade spammers.
Properly constructed outbound links within articles with a FOLLOW attribute (implied by default), to credible sources, won’t damage rank, and won’t cause many readers to leave the site. If the article is interesting, they’ll stay with you. If it isn’t, they’ll leave rather than click on a link in a useless article.
Summary
Monitoring spam levels is an unusual and intriguing way to detect changes in search engine behaviour and the overall rank of the blog.
Linkbait, unsurprisingly, still works to get inbound links. That’s still a great way to get rank. Be contentious. Be provocative. Standard journalistic conventions for dispute and disagreement… Doesn’t engender wisdom, peace and happiness, but does get you noticed. If the article that follows is sufficiently provocative, it’ll get lots of links within minutes.
Article topic and treatment has a huge impact on longevity of traffic. A comprehensive article that solves user problems, lasts for ages and may become permanently popular. A newsy article has a lifetime of hours to days and then gets random low traffic.
New articles do more for rank than refreshing old articles. Something in the range of one decent (as in “attracts inbound links”) article per week, will be just about enough to maintain rank. Two would be better. I infer and have observed that search engines will rapidly give rank to a new article, but do not do the same for revised content for an old article.
IME, popular articles covering technical issues should be corrected in place – even if it means significant rewriting. Strictly, I think that re-writing should force a re-dating of the article. Possibly it means that the article should be migrated to “static” content on the main website rather than being a blog article. I’m still muttering to myself about this conclusion, so the advice may change. The reason is that users don’t follow even very obvious and repeated links to later and better information.
Readers rarely call out the author, even on internal inconsistency. That’s actually worrying. If there’s an intent to convey factual accuracy, then the responsibility falls hard on the author. Blogs aren’t peer review…
Use plenty of links to authoritative references and similar articles. Isolating your blog doesn’t make it more authoritative. Referring to other blogs binds you in to the community and generates more inbound links, which brings you more readers than you lose to the links.
Most comments that pass spam filters will also pass any reasonable moderation – but occasionally some comments are clearly intended for the author of the article, or are amazingly off-topic (but not spam – requests for help with specific technical issues only peripherally related, or to do homework exercises, etc).
Very few readers and commenters look at the Comment Policy. Have one on your blog, but make it liberal. Most people are either spammers or really interested, interesting and helpful. I learn a lot by looking at the comments. I’m grateful for the insights you share with me, and exposure of the problems of interpretation that I inflict in the articles.
Use a welcoming message near your comment box, rather than some aggressive statement about spam. Assertive anti-spam statements will dissuade some real commenters and will do little or nothing to change submitted spam volume.
Monitor the searches that lead to your blog – changes in the nature of the searches may reflect changes in how the search engines perceive your content. I’ve found Lijit to be a helpful tool – more helpful than FeedBurner for understanding searches. If you visit this blog frequently – once every few weeks, or so – you can watch how the search cloud changes.
Consider joining the No NOFOLLOW crew. Nofollow hasn’t stopped blog spam, but it has provided a mechanism to penalise sites outside the trusted set. It reinforces isolation of high volume and high ranking sites from the rest and does almost nothing to stem the tide of blog spam. I wish Google had put their efforts into publicly accessible spam filters for blogs and web form submissions, not an attribute that creates extra work for little perceptible positive benefit. Actually, that’s not entirely true. I’m pretty sure the tag helps Google. They’ve recruited webmasters and blog writers to do their job for them… identifying the garbage.
Updates
2008-10-21 – edits for clarity and typos. Added new illustration of typical link spamming UCE.


Eugene wrote,
Nice article. Thanks. :) Eugene
Link | October 20th, 2008 at 5:50 pm
website optimization wrote,
Very informative and helpful post for SEO beginners.. Thanks for sharing it..
Link | April 3rd, 2009 at 8:31 am