Understanding Trust Rank

0

Understanding
TrustRank

Disclaimer: This is our INTERPRETATION of the data that
we read in this paper. As we are not programmers, mathematicians,
or IR specialists, we have used our knowledge of marketing
and SEO to extrapolate meaning and ideas from the information
in this paper. Feel free to email us if you disagree with
our conclusions or if you would like to give us your own
ideas.

Abstract

”Web spam pages use various techniques to achieve
higher-than-deserved rankings in a search engine’s
results. While human experts can identify spam, it is too
expensive to manually evaluate a large number of pages. Instead,
we propose techniques to semi automatically separate reputable,
good pages from spam.

We first select a small set of seed pages to be evaluated
by an expert. Once we manually identify the reputable seed
pages, we use the link structure of the web to discover other
pages that are likely to be good. In this paper we discuss
possible ways to implement the seed selection and the discovery
of good pages. We present results of experiments run on the
World Wide Web indexed by AltaVista and evaluate the performance
of our techniques. Our results show that we can
effectively filter out spam from a significant fraction of
the web, based on a good seed set of less than 200 sites.”

  1. Preliminaries
    a. Web Model

    i. Web is modeled as a graph consisting of pages and a set of directed links that connect pages.
    ii. Self links and multiple links from same site are removed

    b. PageRank
    i. The proposed algorithm relies on pagerank (the importance of a page
    influences and is being influenced by the importance
    of other pages)
    ii. PageRank assigns a static score to
    each page, but a biased Page Rank version may break this
    rule. A non-zero static score can be assigned to a set
    of special pages only. The score of these pages is then
    spread during the iterations to the pages they point
    to.

  2. Assessing Trust
    a. Oracle and Trust Functions

    i. Oracle assigns value: 0 if page is bad, 1 if page
    is good
    ii. As this is expensive and time consuming, the oracle should only review a
    subset of pages
    iii. Approximate isolation of the good set: good pages seldom link to bad pages
    iv. Trust Function: yields a range of values between 0 and 1. It should give
    probability that a page is good or not.

    b. Ordered Trust Property: The Trust function should
    predict the likelihood of a page being good, so
    the results can be ranked by their trust value (high
    probability
    of being good means pages get ranked higher in a
    list, and vice versa)

    c. Threshold Trust Property: if a page receives
    a score above a

  3. Evaluation Metrics
    a. Pairwise Orderedness: signals if a bad page received
    an equal or higher trust score than a good page (violation
    of ordered trust property). This evaluates the accuracy
    of T

    b. Precision: fraction of good among all pages in X that
    have a trust score above a threshold

    c. Recall: ratio between the number of good pages with
    a trust score above a threshold and the total number
    of good pages in X

  4. Computing Trust
    a. Ignorant Trust Function: For pages not reviewed and
    given a value by an oracle, they are given a value of ½ which
    means that no data is known for those pages

    b. Trust Propagation: The oracle is invoked to check a
    random selection of L pages. Then, expecting that good
    pages only link to good pages, we assign a score of 1
    to all pages that are reachable from a page with positive
    trust in M or fewer steps ( 1 and 2 steps gave the best
    results)

  5. i. The problem with this is that sometimes
    good pages link to bad pages. The further away we are
    from good pages, the less certain we are that a page
    is good.

c. Trust Attenuation: Essential to remove trust the
further we are from seed pages

    i. Trust Dampening: the trust factor is reduced the
    further away we are from a good site. So if good seed
    A has a score of 1, site B has a score of b < 1,
    and site C has a score of b * b (reduced more the further
    away you are from good site)

    ii. Trust Splitting: This handles pages with multiple outlinks. That is, if a
    good page has only a handful of outlinks, then it is likely that the pointed
    pages are also good. However, if a good page has hundreds of outlinks, it is
    more probable that some of them will point to bad pages.

    1. Trust score is split amongst the outbound links
    based on the amount. So if a good seed has 2 outbound
    links, its trust score of 1 is split into 2, so each
    page gets .5 trust points.

    2. The actual score of the page will be the sum of the score fractions received
    through its inlinks. The more “credit” , the likelier it is to be
    good.

    iii. Trust splitting can be combined with trust
    dampening.

  • The TrustRank Algorithm
    a. Select Seeds: used to identify desirable pages for the
    seed set (the most useful in identifying good pages).
    Needs to be relatively small.

      i. Inverse PageRank: Number of outbound links
      (the higher the outlinks, the more likely of
      getting picked) – importance of a page depends
      on its outlinks, not on inlinks.
      ii. High PageRank : Preference is given to pages
      with high page rank as they are more likely to
      link to other high page rank pages.

    b. Generate a corresponding order of the seeds according
    to their desirability as seeds

    c. Select Good Seeds: invokes oracle...so person reviews
    those sites and gives them value.

    d. Normalize static score distribution vector: this only
    allows to have a trust of max 1

    e. Compute TrustRank scores: uses biased pagerank computation,
    with the uniform distribution factor being replaced.

    1. i. Uses trust dampening and splitting where trust
      score is split amongst its neighbors and dampened by
      a factor
      ii. TrustRank “refines” the original
      scores given by the oracle according to the structures
      of links, since it has more information to use..

    f. Unreferenced pages have score of 0, unless
    they are selected as seeds.

    g. Pages can be organized first by PageRank,
    and only pages with high enough pagerank
    are used to
    compute TrustRank,
    otherwise its’ a waste of resources.

    Conclusion:

    Basically, this enables them to modify PageRank. PageRank
    can be easily manipulated as it doesn’t care about
    quality. By using a combination of both, the basic PageRank
    formula can be used (is cheap to use and works well), then
    modified according to trust factors.

    Adding human interaction enables them to then compute
    scores automatically. People manually review sites and
    assign a trust score. Then, this trust score is split amongst
    its outbound links using the algorithm. So the trust score
    of other sites, even if they are not manually reviewed,
    is then based upon the trust score received from other
    sites (with a max of 1). Sites with a higher trust score
    can then rank higher.

    What this means for SEO’s

    • Try to identify good sites in your industry. These
      sites were chosen by number of outbound links as well
      as by high page rank scores. Remember that those pages would’ve
      been reviewed by a person, so only select sites that
      are genuinely valuable.
    • Good sites are bound to be ranking in the serp’s
      as they will have high TrustScores, thus modifying their
      pagerank and excluding spam sites
    • Try to get links from those good sites, or at
      least from pages that are receiving links from good sites.
    • Use up to 3 levels of separation from the good
      sites
    • The more links you receive from good sites, the
      higher your TrustScore.
    • If you have too many links from bad sites, it’ll
      lower your score. Bad sites can be sites considered “unworthy” by
      human reviewers, or sites that received low points from
      other sites
    • Avoid having too many links from bad sites, as
      the more you have, the more it’ll work against you
      based on your “trust score”.
    • Try to only have links from trusted sites, and
      to have as few links as possible from bad sites
    • The higher your trust score, the higher you rank, as
      it’ll modify your Page Rank score positively.
    • Page Rank is still used, so you still need to
      get links, but try to get links mainly from good sites.
    • They are “collapsing” multiple links
      from one URL, and only counting it as one link
    • Self links are removed and only links from external
      sites are taken into account

    Make sure this is an important aspect of your SEO campaign,
    for these trusted links enable you to get past the sandbox
    and to boost your rankings significantly.

    Related posts:

    1. Understanding Local Rank
    2. Why Reciprocal links don’t work
    3. Understanding Kleinberg’s Hubs and Authorities
    4. Understanding Block Level Link Analysis

    Comments are closed.