Effective Internet Marketing Strategy and Technique Through Experiments, Measurement and Audit

Anatomy of a Web Spam Attack

We’ve recently watched spammers at work, from initial contact through to dropping a litter trail across a site. This is on one of our own sites, rather than a client site, so we’re happy to share what we’ve seen. Because this activity doesn’t involve client data, and is clearly activity that is not intended to benefit us or other internet users, we’re happy to share the analysis of the attack.

If you are doing academic research into web spamming, or come from an enterprise involved with search engines, or content management, especially out of the user-interactive Web 2.0 space, we’ll gladly let you have the logfile trace and commentary. We don’t think that any law enforcement agency would do anything about this, but we’ll gladly pass the information on to anyone in law enforcement. We don’t think that sharing this information is in breach of any UK statute, such as the Data Protection Act, and we do think that the intent to deceive for purposes of financial gain negates ethical and moral obligations of privacy.

We wrote a Content Management System for Internet Marketing in 2004, for our clients to use - it’s designed to be a White Hat, search engine friendly CMS, with end user accessibility factors designed in. In one of the presentation modes, it is a Wiki, and it is as a Wiki that we saw the the attack. So this isn’t the same as blog spam, or spamming a discussion forum.

Is this particular spam activity massively significant? No. It’s not even the most common form of spam attack that we see, but it shares many features with other attacks. We hope that when you’ve read the analysis, you’ll understand a bit more about why people spam in the places that they do, how they spam, what you can do to defend your site(s) from spam, and to cast a little light on how this has become important and who the other significant actors are.

Outline of the activity

On the 9th August, we see a first usage, using a real email account. The future spammer makes a legitimate and mildly helpful change to the sandbox site. The following day, at roughly the same time of day, new but spammy links are added. On the third day, same sort of time, two active users from the same IP address make some spammy changes. And then the day after that, a flood of spammy links and text are submitted by yet another user.

A look at the anatomy of the changes gives some insight into how these guys think. On the first day, the arrival is from a site that lists CMS’s. The changes on the second day, when the first spammy links are added, start with a direct jump to the site. There is no referer_info - the spammers jump straight to the site. Is this browser bookmarks? Possibly - this spammer is using Opera, and we suspect that they are using the Speed Dial mechanism of Opera. IMO, Opera is a lot easier for fast local bookmarking than MSIE, FF, Safari or Camino (the other browsers I regularly use for various purposes).

The spammer then checks to see that what pages rank best, verifies the best search for that page, and submits some spammy links. How do we infer this? The search string to find the page on the site includes a space prefix for the search. When you copy/paste from Google search results, that initial space character is fairly hard to avoid. It could be another search engine with the same characteristics, but why research well ranked pages on something that isn’t a target? All of this circumstantial evidence means that we suspect that the search was performed on Google, but can offer no direct evidence for that - not having Google’s records, and all…

Spammer searches for ” Installing on PostgresQL 8″ - inferred from logfile entry for http://sandbox.merjis.com/_search?q=+Installing+on+PostgreSQL+8

After the first spammy submission, the spammers come back from the initial IP address, but get a new cookie. Hmm. Cleared the cache? New Browser? The User Agent signature looks the same as the previous sessions, so probably just cleared the cookies… In this session, it looks as though the spammer has performed concurrent activities to see what the site gets up to, and to test interaction with a second user editing content. So this is probably not one user, with two sessions, but two users.

We next see that the spammers are not coming from a consistent IP address. After the initial concurrent activity, we can see that the IP address changes. One of the users, but only one, is switching between IP addresses, independently of the activity. If this is really an asynchronous activity, this may form a recognisable signature. You’d need something other than the usual web analytics to detect this, though, as the switches between IP addresses are pretty quick. You’ll need something that thinks the cookie is more important than the originating IP address.

User sessions from a mobile wireless modem from someone in a car or on a train are a bit like this, but usually better localised to a network address range. Simply reacting to a changing IP address isn’t right.

Oh ho… later the same day at the same time we first saw the developer, we get the Ukraine IP address back with the initial cookie. So this suggests that we have at least two Ukranian users.

Then a little while later the third spammer starts up. Activity starts on the same IP address that was used before, but has a new cookie. Is this the same machine+user, but with a cleared cookie, or a different machine+user? Bit of weak design to keep starting at the same IP address. Oh, and some of the other IP addresses that it uses are also the same, or in the same IP address range, as the previous session, in the same order. That’s lame.

Fun! After pausing activity for a minute, looks like the second spammer is back, same IP address, but needs a fresh cookie - is this a *fourth* user? Ah… probably not - you need to do cookie clearing to allow logging in from different accounts. Does some work, and a minute later, back again from the same IP address as the last activity, but needs a fresh cookie again. Looks like testing of the login process. Each of these new cookie groupings is followed by a new login and registration of a new address. Oh, *not* testing cookies. That must have been done before. This is setting up for usage of multiple identities. So we capture the accounts registered for each of these logins. Ah ha, we see all three cookies, with a chain of deletion and re-issue, being used, over a period. So this looks like three machines, possibly three different users, at work, from behind the same NAT-ed firewall, and at least one of the users is using proxies, perhaps open proxies, to try and disguise the origins.

Once the setup of new accounts is done, the spammer is off on a trip around the site looking for any formatting rules and examples, and ways to trigger emails to users and mailing lists.

Then we get a burst of posting from multiple identities. Typically one posting from each identity, unsynchronised with the IP switching. So you could get two users on the same transient IP? Nope, looks like the switching is too fast for that. Looks like you might get a request to edit content from one IP address, and the spammed content is submitted from another IP address.

Looks like one person/browser does initial research and page identification, and one does user account authentication and the third adds spammy links? Yes, from the logs, the first user does little on the site other than identifying it and doing some test postings. The second user seems mostly to set up accounts, with some test postings. It’s the third cookied user/browser that does the spam postings, I think. Three machines or three users? Hard to tell.

IP Addresses

For the first few days, it is a consistent IP address that appears to be in Ukraine. Later we see new cookies issued to users who start on the Ukrainian IP address, but rapidly switch off to IP addresses around the world.

Even more interesting is that the software is switching the IP address during an activity. That is, it can start by navigating to a page in the CMS from one IP address, and edit a piece of spam from a second address. This implies that whatever switches the proxy is not coordinated with user activity.

Think… Tor? Yes! These IP addresses look to be Tor exit sites. Kerching. These guys are using Tor to try and disguise their origins, but aren’t particularly clever about it - or they’d have started using Tor, and probably visited a few innocuous sites first, to help disguise their origin. They are also pretty dumb about clearing cookies, or we wouldn’t have detected this. We did not do a DNS log cross check, but I expect that we’d have seen some DNS lookups associated with these accesses, probably directly from the spammers - they don’t appear to be using the latest best practice efforts in anonymity, only the most obvious tools, poorly understood.

User Identity

These spammers use multiple user identities. They log in as users from:

  • Yahoo Mail
  • Google Mail
  • Microsoft’s Hotmail

They also register a few made up mail services, too, but of course can’t authenticate those, so don’t use them in the actual spam deposition. This usage of fake email addresses appears to be a check that the accounts are actually used for verification. I suppose we shouldn’t be surprised that the main email addresses used for generating web spam are all from free email services sponsored by companies with search engines. It is rather ironic, though. If the search engines want to control web spam, one thing they could do would be to better control opening new webmail accounts. Yeah, right.

One way to cause costs to spammers is to require additional verification steps. They’ll be more likely to use a service that doesn’t require verification by email. However, the more difficult you make interaction with a site, the less the interaction with users. This suggests, at least to me, that grades of identity are probably useful on the internet. A single user might want to maintain multiple grades of identity. The highest grade allows authenticated financial transfers - much like a bank. The lowest level grade is fully anonymised, protects identity above all, but has no trust - you wouldn’t let an authentication at that grade do anything on the site, except, perhaps, to browse.

Why so draconian on the authentication issue? Because trust is on both sides of the fence and trust is what allows positive sum games to develop. The more paranoid users who don’t want to allow a business any insight into how users use a site, deny the business information that potentially allows the site to be improved. Businesses don’t (IME) collect data to be used personally. They collect data so that they can see which pages work for users and which pages cause frustration. You don’t usually get that data from feedback forms, you get it from users abandoning the site on specific pages. It doesn’t matter who the user is, just that they have abandoned their session and don’t return. So cookies and authentication can help users to a better web experience - even though they can also be used for targeting adverts (umm, is that such a bad thing? Wouldn’t adverts be less annoying if they actually related better to your interests? How about using an AdBlocker rather than preventing tracking within a site?)

Here’s another intriguing number… How many different accounts do these spammers use? One identifiably real email address. And 13 other spam-validation addresses from search engine webmail services. So expect that any claims from the SE’s for user accounts are inflated by a large factor - it seems likely that most of them are bogus users… as if you hadn’t guessed that from the email spam you receive every day ;)

Automation and Botnets

When we first saw this, we suspected that it was a developer programming a bot to spam. We now think that the costs of this development and the relative inflexibility of software response, means that much spam is actually generated by humans. There certainly are bots active, but the analysed activity, and much of the activity on this blog, is human driven via real browsers, or astonishingly good mimics of those browsers.

Put it like this… When Indian SEO’s offer unique text links to your site, from topic relevant forums, blogs and social networking sites for $7.50/link, and article writers offer unique articles for $10.00/article, then the cost of developing content and links is so low that developing smart enough software to identify relevance, is too much. Even a smart AI programmer in a low cost economy can probably make a higher daily rate, in the short term, by personally spamming sites, than by writing software to do so.

This is because one of the costs of webspam, to a spammer, is making rejected postings. For example, this blog rejects about 99.9% of all postings, because most postings are spam. Most of the postings here also appear to be made by a series of bots. This is why they are rejected - they are insufficiently unique. Those that are unique are usually rejected by me or my peers here, because the content is dull, uninformative and irrelevant (”Your site design is good. Visit my site too.”, followed by a list of pills, porn, gambling, shoe and car links).

So a high quality spam effort won’t be automated, yet. It’ll be personal. Until the AI is improved. Work on beefing up AI and self-improving system continues, but until we make a jump, the cheapest way to infer meaning is to use low cost human effort. And humans leave characteristic signatures so far not emulated by bots - such as setting up Tor connections badly, and leaving cookie trails.

What Can CMS’s do to protect themselves?

Open proxies are a problem for many Information Security reasons. Given how much has been written about the evils and dangers of open proxies, you’d think system admins would have stopped using them. However, there’s still a bazillion of them, and plenty of people running probes to find them.

So, do yourself and the rest of the world a favour. Find out if your organisation is hosting an open proxy and see if there’s a way to make it more secure. Even having it authenticate against validated users would be better than leaving it completely open…

You might want to consider restrictions on users coming from free webmail services and using anonymising services or open proxies. What you can usefully do and what you can legitimately do, will depend on the environment, but this analysis suggests that teams work on spamming, or at least multiple browsers are used - meaning that the spam attack may come from a related source, but not always one on which you’ve left a cookie or could have left any trace or tracking id. Simple cookie tracking, or even Flash Cookie tracking aren’t going to be enough. The spammer may well come from a different address than the person that first identified the site. If you can, you want to head them off, early… but beware that making signup into an onerous burden may put off the customer. If you make the validation sufficiently difficult that you’d trust a financial transaction, *before* you allow access, then you won’t see much access.

One possible way through this may to be establish trust chains. That is, you only get to add content and links if you have been endorsed by someone else, who has been endorsed. Then, if you get spam injections, all the endorsements in that tree may become suspect. Trust chains are pretty rare on most sites - though places like Facebook, LinkedIn, Orkut are based on the friend of a friend type model, with introductions and described relationships.

Perhaps the owners of such sites could offer the internet a valuable service, extending their APIs to allow authentication. If you see an authenticated user, then you trust the posting more, even if looks spammy. If it actually is spammy, then either individually, or collaboratively, you could reduce trust. Of course, this is also hard to do - you can have malicious trust attacks… Take a look at Charles Stross’ stories of Macx, in Accelerando for an entertaining science fictional account.

I predict between 300:1 and 1,000:1 ratio of spam to real comments for this article, over the next few weeks. That’s how bad it is. After the first two months, the ratio will worsen, because older articles are mostly commented on by spammers.

What Can Search Engines Do?

Drat. You would ask that. This is really, really hard to manage, I think.

Since the SE’s are merely using content and have no access to the CMS web server log files, they are even less likely to spot dodgy content.

The main thing they could do now, would be to monitor changing web pages. In our case, we reverted these edits. The CMS has a full version control system, and we simply rolled back the edits. And then blocked the various real and fake accounts…

For a search engine, the signature should be that the pages were modified, saw new links, and then a little while later the pages were reverted and the links removed. If there’s been a spate of other activity for those links, then it may be worthwhile to slow the rate at which PR is given to these sites… Though that would make it likely for the malicious to end up paying spammers to negatively promote competitor sites… Very, very tricky.

What this means, I think, is that pure citation based models face an end-game… But I’ve already blogged endlessly about that. It means that organic indexes will become increasingly dominated by sites put there by increasingly complex software, and humans from low cost economies, rather than a consequence of human judgement. At least, until the SE’s work out how to compensate for this rather insidious mechanism.

The other thing is to use FaceBook and LinkedIn to offer authority chains. And lo and behold, we hear this week, that FaceBook and Google are in talks… admittedly the publicly stated focus is to allow FaceBook users to be listed on Google. But once that relationship is in place, there’s the opportunity for looking at TrustRank relationships, and business relationships. So once you say that you are related to other users and to a website, you establish a degree of trust - in a way that PageRank type algorithms can probably use.

As you might guess, it takes me weeks to write on of these postings, between other activities. So the Google and FaceBook stuff still isn’t fully resolved in my head. I’ll probably return to this relationship after I’ve thought about it a bit more. I actually started this piece almost a month ago, just after the spam attack and our defensive reaction, long before the G/FB relationship was public.

Summary

Spammers identify higher page rank sites, and then look for the better ranking pages. They identify how they can use the sites, then start adding spammy content, using a range of appropriate forms and multiple identities.

We inspected our web server log files around the time frame of this attack series, manually looking for re-use of identified and previously used Tor exit nodes and the signature of an IP address changing frequently with the same cookie, going back a day before, and up to the point at which access was effectively prevented for this spammer. The primary use of Tor, on this specific site, in this time range, was the described spam activity. We infer that you should expect spammers to try and cover their trails, and expect them to use free webmail email addresses.

Note that the first edit was a meaningful change. This means that a system admin looking for new users to behave badly, would see some helpful changes, and may assume that these users were legitimate. Only the later changes are spam - and the volume of spam is far higher, just a few days later, than the initial changes.

Webspam probably can’t be fully tackled without default installation of webspam tools, like Akismet, on all blogs and discussion forums, and the next step will be to add user authentication, so you know from other sites, whether this identity can be trusted to post sane and relevant information.

Things To Do

If you can demonstrate that you come from an organisation that develops CMS’s or web server log file analysis, or are an academic security researcher, we’ll share the web server log file data of the attacks.

Make sure that you and your organisation aren’t offering open proxies.

Make sure that you use content spam rejection tools, to ensure that your organisation is not hosting spammy links and articles that reflect poorly on your brand value. This will help your own site reputation, trust, whatever.

Join FaceBook, LinkedIn and other professional accreditation sites. If you can link your identity to a professional accreditation and a series of credible users, then there’s a reasonable chance that next generation search will find even your most inane postings to be profound and useful - more so than some anonymous dweeb who posts via anonymising services to offer badly spelled and gramatically incorrect links to replica watches. This is a pre-emptive recommendation - but, as I’ve said in earlier postings, I think we’re at a time on the Internet where economic needs of users and businesses are forcing a review of what constitutes a good link and a good search result.

Don’t accept links on social networks from people you don’t know and trust. This will be the fastest way to have your own reputation damaged, and the future value of your postings and links devalued. At present this may mean offending new a casual acquaintances - I know I feel guilty when I deny links to New Zealand based holiday promoters and South African photographers that I’ve never met… but FOAF spam will become an important tool for people intending to deceive.

Material Disclosure

We do mine anonymised user behaviour from web log file analysis to improve websites. So any opinions above about how harmless this activity is, to users, may be treated with suspicion or with endorsement, depending on the settings of your current memeplex. It comes down to Trust. Do you trust the bulk of postings that I’ve put on here, to reflect a truthful insight into the ways in which a marketing organisation *can* use data? Or am I really a lying and scheming bastard who is clearly depriving you of a right to spam us and others - a right that I regard as being the net equivalent of the right to causelessly shout “Fire” in a theatre? Trust and responsibility is a two way street - users and businesses need to be able to trust each other, in order to make the best use of the internet.

Updates

2007-09-09 Shava Nerad of The Tor Project pointed out that a sentence in the original summary could be read as implying that we were making the claim that the primary use of Tor across the entire Internet was to add blog spam. The summary was amended to make it more clear that this is a time limited analysis of a connected series of spam injections to one web server. We’re grateful to the Tor Project for the opportunity to improve the article.

2007-09-10 minor edits to correct spelling and improve clarity of some sentences, none intended to change the sense, except for one. Under the IP addresses heading was a sentence left over from the period when we thought the attack was part of a software development exercise, and it referred to the code doing something, when we now believe that the attack was purely human in origin.

2008-07-24 FaceBook now launches an online identity platform - single sign on, competing with Open Identity, Google’s own authentication services and the early-to-mass-market Microsoft Passport.

"Anatomy of a Web Spam Attack" was published on September 9th, 2007 and is listed in internet strategy, web analytics, trust, spamfighting.

Follow comments via the RSS Feed | Leave a comment | Trackback URL

Anatomy of a Web Spam Attack: 5 Comments

  1. Julian Evans wrote,

    Very interesting article. We are leading the way here in the UK with driving education and awareness to consumers and SME’s.

    Where did you get your source information from? May well be of interest to our visitors. Julian

  2. Jeremy Chatfield wrote,

    Hi Julian,

    This was an analysis of web server log files from one of our servers, running a site we control, using our software. As for the rest - over the years I’ve worked with a bunch of fine people, who have educated me. The errors of interpretation are mine :)

    Your blog was pretty interesting. The overlap with social network sites was intriguing.

    Cheers, JeremyC.

  3. Julian Evans wrote,

    Thanks for your reply. The blog is being used as an awareness driver for consumers… amazed by the interest in it actually ;).. last month we improved the look and feel as well as adding some very useful content. Why not take a look?

    p.s. the social networking issue is a real biggie here in the UK- esp with Facebook.

    Cheers, Julian

  4. Paul Swearingen wrote,

    I did not set up e-DXN although I administer it, and of course I have to deal with spammers, many of them Ukranian and Polish (I found your site via a Google search for “ukraine spam IP address”).

    e-DXN is a phpBB site, a subscription-only site to radio hobbyists. Those who register must not only provide a real name (so that I can cross-check payments via PayPal or personal check) but a real e-mail address as well as the contents of a graphic posted on the registration page. (We cater to blind hobbyists, so several times a year I find myself registering blind subscribers who cannot see the graphic.)

    If I am lucky enough to catch a spammer online, hacking into the members-only forums, I can easily ban his IP address and even range. But recently when I tried to block 91.*.*.* after I noted one or more Ukranian spammers trolling our site, I also blocked two of our legitimate members who live in Finland. One was able to provide his ISP’s dynamic range, the other one not, so I’m now trying to figure out how much of a range I can block to keep the Ukranian spamscum off (one of them seems to be promoting child porn) the site. Therefore, I’m trolling the web to see if I can find known Ukranian spammers’ IP addresses, plus others, but inasmuch as we have members from Europe and other parts of the world, I may just have to nuke each spam registrant as they pop up, currently at the rate of 2-4 per day. (We don’t have any kind of IP tracker installed now, as the National Radio Club is a non-profit organization.)

    Your comments were helpful and informative to a neophyte. Thanks -pls.

  5. Jeremy Chatfield wrote,

    Hi Paul,

    Given the wide availability of open proxies, anonymising proxies and botnets, trying to block spam by IP address is pretty pointless. It doesn’t take much effort for pests to bypass IP address restrictions. You end up hurting real users, who neither know nor want to know the arcane processes to connect - so you penalise real users when you restrict IP addresses, while causing minimal inconvenience to the pests.

Leave Your Comment

Is this article any good? What helped you? What made you think it was wrong? What else would you like to know or discuss?

Merjis Internet Marketing Blog is powered by WordPress and the YUI-Mainstream Theme by Buzzdroid.comBoosted by FeedBurner