A White Paper on changes to the PageRank algorithms.

Reg Charie NBS-SEO & DotCom-Productions


If you are a developer, or SEOer, hobbyist or professional, you cannot but help notice the changes being brought about in search.
Panda is the latest to stir the pot, but current, basic, major changes, have roots that reach back to the beginnings, with major changes implemented a couple/few years.
http://nbs-seo.com
http://dotcom-productions.com 

 ALL Links open in new window. Close to return.

Let me quickly define the state of PageRank back in October '09.

Before this point the effect of  PageRank on SERPs was still under debate. Google had posted in 2007 how PR was being devalued, but the community did not listen.

On or about the 13th of Oct, Google removed the PR toolbar from it's webmaster tools and told the dev community that they "shouldn't focus on PageRank so much".

They added further comment by saying "..worry less about PageRank, which is just one of over 200 signals that can affect how your site is crawled, indexed and ranked."

Note that they did not say anything about how much of an influence PageRank was on SERPS after the change.
Testing has shown that while the browser toolbar is still very much in evidence, the metrics of PageRank and SERPS are divorced.
There is no measurable affect of links on SERPS anymore See #1.
 

Defining Relevance.

Compliant SEO

Let's all take 3 giant steps back.

Google's Susan Moskwa ( has come out again and stated that the PageRank metric is not a good choice.

Beyond PageRank: Graduating to actionable metrics

Thursday, June 30, 2011 at 10:18 PM

Webmaster level: Beginner

As Susan pointed out, Udi Manber, VP of engineering at Google wrote in his blog in 2008:
“The most famous part of our ranking algorithm is PageRank, an algorithm developed by Larry Page and Sergey Brin, who founded Google. PageRank is still in use today, but it is now a part of a much larger system.”

Let's look at the problems with PageRank.

The basic problem is that those doing SEO ignore the "organic" requirements by building links in an effort to influence the search results.
When the PR system was first built, the calculations were done on a strictly mathematical basis with the basic formula being,
(New Page) PR=.85 PR (of linking page)/number of links on the linking page.  A full outline may be found here. http://en.wikipedia.org/wiki/PageRank

The rank value indicates an importance of a particular page. A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it "incoming links". A page that is linked to by many pages with high PageRank receives a high rank itself.  (My bold).

  Amount of PR algo changes made from '04 to '09. 
For a deeper look into the History of SEO - Click here.

This is the crux of the problem.  Links were assumed to have relevance as they would in the academic citation world.
Without the influence of relevance the authority becomes blurred.
Because a linking page has a high value does not mean that the value bears relevance to the topic on the linked page.
In the old PR system, all things being equal, a link on a high PR page means more PR for the linked page.

They go on to confirm this by stating:
In practice, the PageRank concept has proven to be vulnerable to manipulation, and extensive research has been devoted to identifying falsely inflated PageRank and ways to ignore links from documents with falsely inflated PageRank.

But how about links from documents with genuinely acquired PR, but which are not authorities on the subject?
How would you value a link to a page about knitting a sweater on a PR8 site that is concerned with IT Security?
The PR value assigned would not be accurate.

The old PR had 2 fractions that caused problems.
  1. The organic component was ignored or faked by linking schemes. Links that were gained by other than the organic counted, unless they were found out and classed as black hat.  (JC Penney)
    Other than warning people or placing penalties, there was not much that Google could do.
     
  2. The calculations did not include any consideration of subject matter, it was entirely based on a mathematical formula that only considered the amount, formatting, and value of links. 


Fixing the PageRank problems.

 
If the primary calculation is switched from one based on the PR of the linking page, to one calculated on the relevance between linking and linked pages, all sorts of previous problems disappear.

Previous Linking Problems

  1. Reciprocal Linking

  2. Paid Links

  3. Link Farms

  4. Spammy Links

  5. Three Way Linking

  6. Follow/NoFollow

  7. Undeserved PR (Off topic link with no relevance.)

 
Because of the calculation method, (using the linking page's PR as a base), all of the above give PR, if not found to be spam by Google.
By calculating the PageRank using relevance as a metric numbers 1 through 5 do not matter anymore.
If it is a link based on relevance it does not matter if the link is reciprocal, if it is on a paid link page, uses 3 way linking, or if it is on a high or low PR page.
It is the degree of relevance that counts.
The follow/NoFollow tags are rendered un-necessary.
Google just has to follow the links that have relevance, it can ignore all others.
This alone will save Google a ton of time/CPU cycles and improve the information silos, at the same time.

 

Original anatomy of Google as presented pre-production by Page and Brin.

Note the URL resolver flows to "Links" which in turn goes to PageRank then to the searcher.

 

In fact, as of November 1997, only one of the top four commercial search engines finds itself (returns its own search page in response to its name in the top ten results).
 
One of the main causes of this problem is that the number of documents in the indices has been increasing by many orders of magnitude, but the user's ability to look at documents has not.

People are still only willing to look at the first few tens of results. Because of this, as the collection size grows, we need tools that have very high precision (number of relevant documents returned, say in the top tens of results). Indeed, we want our notion of "relevant" to only include the very best documents since there may be tens of thousands of slightly relevant documents. This very high precision is important even at the expense of recall (the total number of relevant documents the system is able to return). There is quite a bit of recent optimism that the use of more hypertextual information can help improve search and other applications [Marchiori 97] [Spertus 97] [Weiss 96] [Kleinberg 98].

In particular, link structure [Page 98] and link text provide a lot of information for making relevance judgments and quality filtering.
Google makes use of both link structure and anchor text
(see Sections 2.1 and 2.2). (My bold)

2. System Features

The Google search engine has two important features that help it produce high precision results. First, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called PageRank and is described in detail in [Page 98]. Second, Google utilizes link to improve search results.

2.2 Anchor Text

The text of links is treated in a special way in our search engine. Most search engines associate the text of a link with the page that the link is on. In addition, we associate it with the page the link points to. This has several advantages. First, anchors often provide more accurate descriptions of web pages than the pages themselves.

We use anchor propagation mostly because anchor text can help provide better quality results

This idea of propagating anchor text to the page it refers to was implemented in the World Wide Web Worm [McBryan 94] especially because it helps search non-text information

NOTE: Very important:
Aside from PageRank and the use of anchor text, Google has several other features. First, it has location information for all hits and so it makes extensive use of proximity in search. Second, Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font are weighted higher than other words.
This refers to position of words in the content and in the code.

3.1 Information Retrieval

Goes beyond exact match and amount of repetitions:
For example, the standard vector space model tries to return the document that most closely approximates the query, given that both query and document are vectors defined by their word occurrence. On the web, this strategy often returns very short documents that are the query plus a few words. For example, we have seen a major search engine return a page containing only "Bill Clinton Sucks" and picture from a "Bill Clinton" query.

Some argue that on the web, users should specify more accurately what they want and add more words to their query. We disagree vehemently with this position.
Given examples like these, we believe that the standard information retrieval work needs to be extended to deal effectively with the web.

3.2 Differences Between the Web and Well Controlled Collections

Documents on the web have extreme variation internal to the documents, and also in the external meta information that might be available.
Another big difference between the web and traditional well controlled collections is that there is virtually no control over what people can put on the web. Couple this flexibility to publish anything with the enormous influence of search engines to route traffic and companies which deliberately manipulating search engines for profit become a serious problem.

Also, it is interesting to note that metadata efforts have largely failed with web search engines, because any text on the page which is not directly represented to the user is abused to manipulate search engines. There are even numerous companies which specialize in manipulating search engines for profit.

Recalculating PageRank on an Individual or  Global Basis.

There have been major changes in the search algorithms and they are beginning to be seen.

  1. The first is that PageRank (PR) used to be based on a strictly mathematical formula.
    Generally speaking, a linking page "assigned" a numerical ranking to a linked page, based on the linking page's PR multiplied by a factor of .85.
    The amount of PR given to the linked page also depended on how many links there were on the page.  (PR = PRL x .85/amount of links on page.)
    This did not work as too many folks tried to influence the search results by placing non-organic links.
     
  2. The influence of PageRank (PR), has been removed from affecting their Search Engine Result Page(s). (SERPs) (To combat the above spam linking).
    Google has told us that PR is no longer an 'actionable metric'. 
    They have been downplaying PR since 2008*1
     
  3. The current PR algos differentiate between previously gained PR and new calculations, unless manually reset.*2
    If you gained PR back before the major changes were effected, you keep the assigned PR and the new algos only calculate what is currently happening with your new links.
    If the site got it's old links by automated or less than 'organic' methods, and a lot of links are on pages that are not relevant, a complete PR recalculation based mainly on relevance would hurt standings, if the manual recalculation was allowed to influence the SERPs. 
     
  4. There is a manual override for Google if they intend to set examples and weed out the non-organic offenders.
    Panda is an example of manually instigated filters.