This is the other article on the End Of Life As We Know It. See also “Rev A“, published previously.
A recent posting by Matt Cutts to invite reporting of paid links to Google set me thinking. One thought train lead to a fairly predictable (for me) posting. The other has lead into some stranger directions, involving psychology, microeconomics, and game theory. I think it leads to characterising the next generation of search engines…
First, the collapsed history… How did we get here, and where is here, anyway?
Yahoo!
In the beginning was the web, and the web was without navigation other than embedded hyperlinks. And Yahoo! looked on the web and saw that it was without form. And Yahoo! said “let there be a Directory”. And there was. Users looked upon the Directory and saw that it was good.
In 1994, Yahoo! was *the* way to navigate. Human reviewed directory entries guaranteed that you reached sites that were plausibly relevant to what you were trying to achieve. Of course, there is a limit… a moderated directory depends on the rate of reviewing, and the number of reviewers. At some point, there is a diminishing return to the directory service - adding another reviewer doesn’t increase the value of the directory, even if the reviewer is supported by extensive technology. (I’m going to ignore DMOZ, for the sake of brevity).
The big idea to carry away is that a human moderated directory is one way to navigate, reliant on a shared perception of the categorisation system, and limited by the number of humans that review.
Taxonomy and the Long Tail
People researching user behaviour find that users look for all sorts of stuff on the web. While a large proportion of searches fall into easily identifiable categories, another comparably large chunk consist of unpredictable words and unpredictable combinations of words. This observation falls under a rule sometimes known as “Zipf’s Law“, the Power Law or the Long Tail. This type of behaviour means that a categorisation system will tend to run into problems… You really need a search system to supplement or replace the directory system.
This puts “search” into the same space as “directory”. It’s just a different way to navigate a set of resources. The neat thing about search, as we know it, is that it is automatable. Useful directories require a human to spot the category and review the site, and require a shared concept of the categorisation system. Search needs a server farm and categorises on the fly by identifying the search query words in, or referencing, the documents being indexed. That makes “search” scale better than “directory” - hence the popularity of “search”.
Search Engine Systems
The first generation of search engines used words in the document, in order to rank. If you asserted that the document was about “ball bearings”. frequently enough, in the document, then you ranked for that term. That’s not too bad an idea. Documents about a topic tend to mention it.
It doesn’t mean that the document that mentions the term most frequently is the best document. I accidentally stumbled into this in first generation search engines. I was working for a startup in 1994 and I put together pages that named third party products, prominently, in the title, keyword, description, first header and in the first paragraph. Those pages ranked above the original manufacturer pages, because the phrases were so heavily seeded in the right places. We had people writing to us under the impression that we were the manufacturer. Their assumption was that by being first listed, we were the best answer for a product specific search, and that must mean we were the manufacturer, even if we denied it.
Environmental Influences on Search
If you are an academic researcher, you think in terms of documents being rationally organised. If this paper (here) refers to this other paper (there), then there is a link between the two, with authority flowing through citations. So long as the people submitting documents are in an environment where citation is a reward system and incorrect citation is penalised (by reducing the probability of being published and by being regarded as an idiot), you get good linking and can infer important papers from the amount of citations they get. You can use the linking system of citations to identify the most influential papers in a sphere of knowledge. In an academic environment, PageRank looks like a good idea.
At the time that Google got going, the majority of links were formed “naturally” with no thought apparently applied to how the links between documents would affect document popularity. Google’s PageRank was therefore able to trump directories and first generation search engines. Documents cited other documents on the same “virtuous reward” cycle as in academia.
The problem is that navigation is valuable. By appearing in position 1, as a highly referenced document, that page is marked as being popular. It’s the best. Even if it doesn’t rationally appear to be the best response to a search, there’s a built-in tendency to go with the crowd. Popular is good - popular means that someone doesn’t have to do the research and become expert. They can go with the answer that many others have picked and can trust that, even if not the best, it’s a pretty good answer.
Again, that’s not a problem… until documents have value. As soon as documents have an economic value, then links have a value. Once links have a value, then they can be formed for reasons other than the page being the best. It might simply have the most paid links.
Links have a value
As the web has matured, the people that develop it have become increasingly aware of how valuable traffic reaches web pages. If your site is well designed and you naturally get inbound links, then you rank well. If your site is designed for reasons other than search engine ranking (for example, you use AJAX in a way that isn’t SE friendly), then even though your site is potentially popular, it won’t be ranked - unless you buy links or do something else that provokes links.
So the position now is that you can increasingly tell only how well well a page is referenced, but you can’t tell whether a population of users would find the page to be relevant. The original basis on which PageRank worked is now broken. Documents cite other documents, not because they are trusted, but because the link is of value, because the terminal web page has monetary value. The cycle of virtuous links has been broken, because citation is no longer based on merit, but on value.
The result is that Google has, over the years, added patch after patch to the PageRank algorithm, in order to try to avoid believing in the votes from untrustworthy sources. Meanwhile, Black Hat search engine optimisation experts find other ways to create links that Google might believe, and marketers try to find ways to get pages mentioned in viral (social networking) sites. This causes costs on all sides. It costs Google to use talented staff to dismiss links as irrelevant to the citation. It costs companies who pay SEO’s to create links that may later be dismissed as irrelevant by some future patch that Google does.
Identifying Influence
There’s clearly some techniques that constitute a method for presenting link value to search engines, purely in order to influence ranking. Matt Cutts has written about penalising hidden links, for example. A system that identifies hidden links, even if hidden using CSS, is potentially do-able.
What about other links? I think we can take as a model, spam.
Assume that spam costs next to nothing. So does a link. Assume that automation is used to eliminate spam (using heuristics and Bayesian statistical systems and shared notifications and so on). Assume that link spam can be eliminated from a list of links, in the same sort of way - a bunch of software can clean the list, using the same sorts of statistical techniques that work for email spam.
Let’s assume that the engineers for link despamming and email spaim rejection are of similar competence. For comparison, I’ll use my Gmail account for some statistics, and we’ll assume that Google uses similar qualities in the staff for Gmail and the search results.
My Gmail account shows a few hundred spam per day, of which a couple handfuls creeps through every month - about one a day. That’s a false negative success rate of about 0.5%. I also lose real email to my spam filter. I have to check the traps every so often and try to recognise email from people I know, to rescue the message. I find a variable amount per month - recently it’s been around 10-20 false positives per month (emails identified as spam that I actually wanted) - that’s about 0.25%.
If you were interested in link spamming, then using a variety of techniques you might be able to achieve a similar result. In other words, of every 1,000 links, a few would count. So it becomes economically interesting to generate large tranches of links in the hopes that some of them evade the filters. Do this enough, and the system starts categorising good links as bad.
Of course, you can add filters that check on the rate of link formation - Matt has already written about doing this. So the link spammers just need to throttle the rate to a level that seems rational. Of course, this might mean that sudden net surges in popularity could also be penalised. Humorously, if there was a better competitor to Google, we might not learn of it because enthusiastic user endorsement would trip the excess link rate filters?
I’ve recently started seeing sites that simulate user contributed content (high entropy text, English-like sentence structure, appropriately embedded links, forum and blog presentation) clearly intended to attract the attention of search engines. So long as the postings look plausible and don’t either duplicate content or suffer from characteristic signatures, they are likely to evade the filters. I have no idea whether these currently evade the filters. I don’t, particularly, care… Because there is a compelling economic incentive to develop a Turing capable system - something that, to a blogger or discussion forum, looks like a real user, but is actually just a piece of software.
Forget the even more difficult issues of what constitutes a paid link. The big issue is that we are now, I think, within a few gnats whiskers of having systems that are indistinguishable to a search engine spam filter, from a human. Add a botnet so the sources become unidentifiable… Well, when that happens, the whole citation based model will fail, catastrophically.
We are now up to date…
Moving Forward
I’ll avoid further describing more problems. If you’ve got this far, you can probably infer what else I’d say. If not, think about it for a while.
Are there possible solutions that result in “fair” ranking systems, when monetary value is a consideration, and links themselves have become a currency? I’m not going to describe a specific search engine, but the kinds of techniques that should work, and that might avoid some of the timewasting tactics that are used now. Are all of these techniques right? Probably not - today, I’m a search marketer, not a search engine developer. I spend most of my time thinking about segmentation and offers, bids and bidding strategies, snippets and page design - not search engine design.
Because ranking has value, links have value. Inevitably people will be compensated to generate links (either manually or by machine). So a future system should probably allow advertising (paid ranking) - otherwise we’ll end up with the same situation that “natural” linking strategies are influenced by payment.
Separation of paid and organic results stems from the feeling that organic results are in some way cleaner, purer and more noble. Are organic links better than paid search links, when they are subject to undeclared and (if correctly executed) undetectable influence? What’s better - knowing that the advert has been paid for because that company wants you there? Or that the results are unguessably tweaked by a combination of unknown and unknowable links and patches made to a hidden ranking system?
Why separate paid search and organic results?
Google (and other search engines) have been scrupulous in separating paid search and organic. Once you accept that organic results will inevitably be influenced by paid linking, the division begins to look rather arbitrary. If it costs as much to run a linking strategy as to run paid search… what’s the difference?
Well, accuracy of targeting, for one thing! You can’t use negative keywords on organic search, and you can find that you get ranked for geoterritories that are not of interest (or fail to be ranked more highly for relevant geoterritories). Paid search is a much more precise system for targeting than organic. For certain types of search, paid search is arguably more useful to users than organic. Note that caveat - *ONLY* for certain types of search - those towards the end of the buying process, not the early phases.
This suggestion will inevitably be criticised by people who never buy anything on the internet, or who find the whole concept of commercial messaging repellent, or who use search engines primarily to find non-commercial resources. Annoyingly for those people, there are a bunch of users who are seeking commercial responses. These are typically searches of the form “{insert company name here}” or “{insert product name here}”. Paid search activity tells me that paid search adverts can achieve more than 60% CTR on very specific brand and product keywords - so at least some users think that paid search results are currently comparable in value (in terms of presenting their traffic) to organic search. (Note: I do know that some users can’t tell or don’t need to tell the difference between a URL bar and a search form and the effect that has on paid search).
In fact, for some search users, I’d argue that *only* adverts represent the best results. The costs of advertising in a crowded market mean that you get quite focused on making sure that the right advert shows to the right consumer. Accidentally showing an advert to the wrong consumer can eat your budget for no return.
For other searches, there are no plausible adverts… So the model needs to allow research oriented citation based linking. But if the end page is economically (rather than informationally) valuable, then links become equivalent to a paid advert… How to achieve this transition and recognition? I have no idea. Yet. Give me an incentive to think about it :)
Geotargeting
The next system that takes over from PageRank will probably need geographical relevance. Problem there, of course. Current web site designs make finding the business location very difficult. Worse, PageRank focuses on the “World” part of “World Wide Web”. There is no current mechanism that consistently presents search results because the business is local - though Google Local is making an attempt to do so. This is probably a hangover from academic days - geography isn’t a barrier to citation, but it sure disincentivises a lot of business.
If I look up “lawnmower repair”, I’m probably not considering travelling more than a few tens of minutes from where the mower is. It’s cheaper to buy a new mower if I have to travel much further than that (especially if I include my opportunity costs as a consultant). Ideally, when I enter “lawn mower repair”, I’d be offered opportunities for a local repair shop, as well as documents about DIY repair. The way in which indexes are formed, doesn’t make this a natural factor.
Local search is really important, though. An awful lot of business is conducted very locally, and this is likely to become increasingly important for environmental reasons. A “green” search engine would offer local resources first. It’s the old environmentalist mantra - “Think globally. Act locally.”
Accountability
Search directs billions, trillions of dollars of spend, worldwide, with nothing other than a promise to do no evil… if you are lucky to get even that. The move to online business means that search engines, which used to be economically irrelevant, will probably not be allowed to get away with making and breaking businesses on a whim. I suspect that the avalanche has yet to start. When the right snowball hits the right part of the mountain… it will collapse. Is there a defence against this?
Right now, the Search Engines often try to disguise the links that they use in order to choose rank. Thats because the links have value, both to other search engines (discovering the links is part of the battle for improving results) and to obscuring the techniques used to deny artificial boosts to rank. Unfortunately, the current preoccupation with links means that using other techniques that depend on an unpolluted set of links, can’t work. I suspect that one way to handle the paid search problem is to identify all the links that were used to increase the position.
This will mean that subtractively one can guess that a spidered resource is being deliberately ignored or zero-weighted. Is that a problem? Yes. Ideally, I think one should be able to offer all links that were considered. When you have billions of dollars of sales riding on the results, being able to demonstrate that you haven’t got an arbitrary weighting that biases the results, is pretty important.
I expect that a new generation of search engine may actively publish the resources that it used to determine the assigned rank. That way, users can actively challenge resources that they think are incompatible with a fair rank. But since the system shouldn’t be reliant entirely on weight flowing from other pages, links alone should be a less important part of the control system.
Positive Sum Game
The critical factor for a new generation of search engines will be, I think, that there is a positive sum game from interaction between users, suppliers and the SE. Right now, I suspect that the game is negative sum - each side gives up something to make a profit (users don’t get the best results, Google expends effort to reject “unfair” links, and businesses either don’t rank as they should, or pay for activities to promote rank). If it can be made into a positive sum game, then everyone wins. How to do that? Hmm. Good question.
Material Disclosure
We perform search marketing on behalf of clients. We advise on search marketing. We have not, yet, paid for a link for ourselves or on behalf of clients, other than from a human mediated directory service (e.g. Yahoo!) or from a professional association (who may link to members as part of a legitimate service to identify professional membership).
Our stance so far has been that White Hat SEO is ethically the correct position. Web sites that engage users can benefit both user and brand. These sites should be rewarded by higher ranking than sites that… don’t deliver user needs. However, good sites are not always rewarded by links. Some businesses are intrinsically pretty boring online and aren’t going to be enthusiastically endorsed by links. For example, sites that are largely oriented to a drive offline (from web visitor to phone call, such as is done by many professional organisations and consultants), are less likely to have content online that generates links, no matter how highly their clients value them. Their competitors will use paid links… Makes me wonder about abandoning my pure white horse, and rubbing at least a little mud on my white hat. Perhaps I’m just growing up, and out of idealism, at last.
Update
26th April 2007
Danny Sullivan has a really good description of PageRank. It made me realise that I’m mixing the use of PageRank to mean both the original, citation like system, and as a model for inferring value to users based solely on information held in pages and links. There may be a Rev C, then, as I clarify the differences.
30th April 2007
Should cite some earlier stuff - especially Jeremy Zawodny’s 2005 article about paid link value, referencing his 2003 article on PageRank.
Also see Gray Wolf’s contribution to the “what is a paid link, anyway” memethread, and a similar “Oops, I think we broke Organic Search” moment.
I was actually trying to steer clear of the whole “what is a paid link” problem, and address the basic problem. Why is a link model based on citation, a vote for economic utility for a searcher? Information searchers and buyers are two different groups - aren’t they? Is there a better way to organise rank for people trying to buy stuff? Is a better method, in fact, used, while Google waves a magic wand and calls a whole bunch of different techniques PageRank, because PageRank used to be a magic word of power?
7th May 2007
Ian Feaveryear (a prolific and helpful AdWords Help Forum poster) has a response to Jill Whalen’s article about relying on SEO to generate traffic. Much though I respect Jill’s opinions on how to do SEO, businesses do rely (perhaps not solely) on ranking highly. The causes are complex - such as users being unable or unwilling to distinguish between URL bars and search fields, and so your site has to rank highly on the terms that define your URL.

Jeremy Chatfield wrote,
Cool - John K says I should be hired by Google. I was thinking Baidu. A billion users in a fast growing economy? I’ve been trying to learn Chinese using internet resources…
Link | April 25th, 2007 at 11:47 am
Rev A: SEO, Game Theory and Intrinsically Corruptible Systems | Merjis blog wrote,
[…] I’m pretty sure that I can see a rising interest in, and reasons for, dislodging Google’s search and search advertising dominance. However, the causes are complex and the way in which it would happen are even more subject to unpredictable accidents. I’ve now written this article twice, with different perspectives… And I couldn’t think of why I shouldn’t publish both. So here’s Rev A. […]
Link | May 4th, 2007 at 7:01 am
Google is destroying the web! | Merjis Search Marketing Blog wrote,
[…] Adam Lasnik, Google’s missionary to the heathens, fired up to convince webmasters to use links only if they fit the citation model, wrote a few months ago that he and Matt joked that people are often bragging they have an undetectable technique to raise rank. The interview (second link in this paragraph, to Stone Temple) makes it clear that the focus of their effort is paid links on high PR sites. But the economic significance of Google has another effect: web design and web marketing decisions are influenced by the commercial reality that high rankings drive wealth. […]
Link | July 30th, 2007 at 9:29 am
Internet Adviser delivers reliable marketing results. wrote,
Very nicely done. Thanks very much for the excellent post.
Link | August 13th, 2008 at 11:03 pm