Identifying link farm spam pages
From the Abstract:
In this paper, we present algorithms for detecting these link farms automatically by first generating a seed set based on the common link set between incoming and outgoing links of Web pages and then expanding it. Links between identified pages are reweighted, providing a modified web graph to use in ranking page importance.
To summarize:
They have developed an algorithm that performs three steps:
1. Generate a seed set from the whole data set.
2. The expansion step to propagate the initial badness
value to additional pages.
3. The ranking step which will combine the badness value
together with normal link-based ranking algorithm,
such as ranking by popularity, HITS, or PageRank.
To create the initial seed set, they simply look for pages that link to each other at the DOMAIN level, so even if a subpage of one site links to the home page of the other, they’re still considered to be interlinking. So if A links to B, and B links to A, both A and B are added to the seed set. Google can perform this evaluation when they are spidering and computing PageRank.
For the expansion step, they determine a threshold, for example, 2. If site C has 2 or more links to sites in the seed set, it’s added to the list of “bad sites”. So, if C links to BOTH A and B, then it is also considered bad, but if it only links to A, it isn’t brought into the set.
Once the sets of bad sites are identified, then they suggest simply not using these links to each other in the equation when measuring the number of incoming links. So when performing a backlink check for site A, the links pointing to site A coming from B and C would NOT be taken into account. So those links would simply be ignored or downgraded in value.
What does this mean for SEO’s?
This seems like an obvious, simple check for all engines to make, plus it seems to be scalable. Therefore it is a pretty safe assumption to think that most engines are already implementing this.
Therefore, if you are interlinking your own sites, or exchanging direct links with other sites, they have a simple algorithm that can detect this, and ignore the links. Additionally, if you LINK TO more than 1 site that is already in the seed set, your site will be considered as part of the same spam alliance and your links won’t count.
For example, if you have a site about cars, and you link to a site about Toyota and a site about Hondas. If the sites about Toyota and Honda link to each other, and you link to both of the sites, your site will be seen as associated with these 2 sites and you’ll be brought into the “penalty” set.
So it’s also important to be careful who you link to, and to make sure that you’re not linking to sites that are engaging in extensive reciprocal links campaigns. If you are linking out to them, and you do receive links from some sites that are part of that reciprocal links scheme, then all the sites will most likely be pulled into the penalty set, and none of those links will count for you.
So what can you do about this?
1. Try to get one way links in any way possible
2. Be careful who you link to - try to make sure you’re not linking to sites that are engaging in reciprocal links campaign, and that contain links to your site. If you do link to sites that have reciprocal links campaign, you could be pulling your site into the penalty set and other links coming to you from a few degress of separation that are part of that same set will result in invalid links
Instead, follow the standard SEO advice of creating valuable, unique content to generate legitimate one way, inbound links.