An incomplete history of the Google search engine from a SEOers viewpoint.
"Incomplete" as Google is constantly evolving.
Reg Charie Sept 16, 2010.
Being of a curious nature and wanting to make money online, I naturally got into trying to understand how search engines work.
Back in '96 when I started developing websites my main interest was to generate traffic.
This was done by promoting the websites by placing links and writing webpages to conform with the different algorithms each search engine used.
This was done by building doorway pages for each search engine and directing their robots to these pages using a CGI script.
In other words, cloaking. The public never saw these pages , only the search engines did.
This would not succeed now as most of the search engines have methods to determine if this is being done, and will remove you from their index when found.
Back in the days before Google asserted it's dominance there were over a dozen search engines competing for market share and each had a different method of calculating the ranking of sites. If one wanted a high ranking site in each it was necessary to conform to their ranking requirements, thus the cloaking.
This was not considered "black hat" at the time and was highly successful. I built one of my client's sales from about $400 a month to over $1000 a day by building a set of doorway pages and cloaking.
When Google was designed by Larry Page and Sergey Brin (1) in 1997 (2) it was coded to return results based on 2 major factors:
- Google stated: "Hypertext-Matching Analysis: Our search engine also analyzes page content. However, instead of simply scanning for page-based text (which can be manipulated by site publishers through meta-tags), our technology analyzes the full content of a page and factors in fonts, subdivisions and the precise location of each word. We also analyze the content of neighboring web pages to ensure the results returned are the most relevant to a user's query."
- Authority. (PageRank).
Google stated: "PageRank Technology: PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results." (ED: Each page indexed is given a vote to pass on to other pages. Depending on the quality and quantity of votes a page has received the PR vote can be worth more than a page with only one vote.)
"PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value. We have always taken a pragmatic approach to help improve search quality and create useful products, and our technology uses the collective intelligence of the web to determine a page's importance."
Content was based using on-page keyword matches, and authority on how many sites linked to the page.
During this initial stage results could be influenced by frequency of keyword use on-page and by placing links on high PR non-relevant pages..
Links were judged according to a mathematical formula (3) which assigned a PageRank (PR) to each page based on the PR of the linking page.
RResults were first sorted by the search terms and then PR was factored in to adjust the results.
The old SERP Rank
The SERP (Search Engine Results Page) is the actual result returned by a search engine in response to a keyword query.
The SERP rank of a web page refers to the placement of the corresponding link on the results page, higher being better.
The SERP rank of a web page is not only a function of its PageRank, but depends on a relatively large and continuously
adjusted set of factors (over 200) (4) (5) (6)
The new SERP algos move the PageRank factor from a primary to a secondary influence, factored in after after relevance is satisfied.
From inception to June 2002, small tweaks were continually made to the ranking process with a minimum of 19 days and a maximum of 54 days between updates.
In June, July, Aug, Sept, Oct 2002 major changes were made to the PR tool bar and Google directory, and another 12 updates were done which culminated in the Nov 16, 2003 "Florida" update.
Florida threw the SEO community into a frenzy as it made some major changes and the timing could not have been worse for commercial sites hit by the update.
Florida brought in a number of new and/or modified algos: The Web Workshop lists the possible modifications. (7)
- SEO filter (search engine optimization filter)
One of the main theories that was put forward is that Google had implemented an 'SEO filter'.
This proposes, when a search query is made, Google gets a set of pages that match the query and then applies the SEO filter to each of them. Any pages that are found to exceed the threshold of 'allowable' SEO, are dropped in the results.
The influencing factors can be from keyword stuffing in page content or in meta tags, using cloaking, hidden text, or just by using too many SEO factors.
"Over optimization" is a concern.
Another idea that has taken hold is that Google have implemented LocalRank. LocalRank is a method of modifying the rankings based on the interconnectivity between the pages that have been selected to be ranked.
- Commercial list
It was noticed that many search results were biased towards information pages, and commercial pages were either dropped or moved down the rankings. From this sprang the theory that Google is maintaining a list of "money-words", and modifying the rankings of searches that are done for those words, so that informative pages are displayed at and near the top, rather than commercial ones.
The most fundamental change that Google made with the Florida update is that they now compile the results set for the new results in a different way than they did before.
This was the first major algo change that affected a large number of sites.
One of the things that Google had been experimenting with was a system for ranking "expert" pages developed by Krishna Bharat who is thought to be a Google employee at the time. He called his system Hilltop and every effect that the Florida update has caused can be attributed to a Hilltop-type, expert-based system.
As you see, Google is not satisfied with their ranking protocols and are seeking to build results showing not only relevance but authority.
On Jan 11 2004, Google brought in the Austin update. Following the Florida shake up, Austin targeted more specific factors.
- Factors like free for all links,
- Big meta tags stuffed with the keywords
- Invisible text
- Cloaking (Google changed its IP addresses world wide to combat IP based cloaking).
- Content that is freshly and regularly updated always helps in getting better rankings.
- Google's Hilltop Algorithm was increased in importance.
Right on the heels of Austin, Google brought in Brandy. Feb 11 2004.
- Increase in file sizes of Google indexes in order to provide larger amounts of indexed pages.
- Latent Semantic Indexing (LSI) giving importance to semantic words or the LSI used in the creation of the website content.
Close analysis of synonyms or other related words to be used in the ranking process.
- Links and Anchor Text.
Google shifted its focus to the nature, quantity and quality of inbound as well as outbound link anchor text of a website. It decreased the importance of PageRank which had been Google’s unique ranking system.
This translates into are the links to and fro from your website with related links or links for the sake of linking? The more the relevancy of the links of your website, the better would be the rankings of your website.
- Downgrading of the Traditional Tag-Based Optimization
This translates into are the links to and from your website related links or links for the sake of linking?
The more the relevancy of the links the better would be the rankings.
The Brandy Update was introduced to improve and include new signals of quality to improve the method by which pages were being ranked
Another factor was to disallow SEO methods used by Search Engine Optimization and the Internet Marketing industry to manipulate indexing of their web pages on Google. This was another step to stop spamming and unethical standards which were being used to achieve highest rankings on SERPs.
May 2005 Saw the Bourbon update. This was introduced in three phases and was the biggest update yet.
- The Google Sandbox.
- Duplicate Content - Including the same or very similar content on more than one page, even on different domains/subdomains.
- Non-thematic Linking - Having links to pages which contain content irrelevant to the source page's subject matter,
- Low Quality Reciprocal Links - Links from "bad neighborhoods".
- Fraternal Linking - Creating a network of sites, which all link back to the same "master" site in an effort to boost the master site's rankings.
- Affiliate marketing is targeted.
(An associate lost her $150,000 annual affiliate income due to this and fraternal linking.)
- Sitemaps possibly becoming a factor.
Some complaints were that something went wrong with the duplicate content algo and this focused attention on URL canonicalization. (8)
Linking was a big factor in Bourbon.
Lessons learned were:
- Do not include reciprocal links with any sites that contain content irrelevant to your own content.
- Do not include links to "bad neighborhoods".
- Stay away from networks of "similar" sites, designed specifically to boost rankings of a Master linked from all sites in the network.
- Using a 301 redirect status code call.
- Use absolute links instead of relative links.
There is less chance for a spider to get confused
- Do not use more than 100 outbound links on a page.
June 1, 2005 saw a notice that Bourbon was not over and only 14% completed. Webmaster world published a thread that grew to 41 pages (1225 posts). (9)
Oct 16 2005 saw the first of the 3 Jagger updates.
Jagger was a major algorithm update with the following results:
As reported by Web Pro News, (10) in November, Google had several issues to deal with:
Faux Adsense Directory Sites
CSS Spamming Techniques
Growing "Generic" SERP Irrelevancy
Reciprocal Linking Abuse
Their solutions were;
In retrospect Jagger dealt with:
- Value of incoming links
- Value of anchor text in incoming links
- Content on page of incoming links
- Keyword repetitions in anchor text
- Age of the incoming links
- Nature of sites linking to you
- Directory links
- Speed and volume of incoming links created
- Value of reciprocal links
- Impact of outbound links / links page on your website
- Sandbox effect / age of your site, domain registration date
- Size of your site’s content
- Addition and frequency of fresh content update
- Canonical / sub domains, sub-sub domains
- Multiple domains on same IP numbers
- Duplicate content on same site or on multiple domains
- Over-optimization, excessive text markup
- Irrational use of CSS
November 15, 2005 also saw the start of Google Analytics
Googler Matt Cutts stated that Blackhat SEOs may be leery of using Google for analytics, but regular site owners should be reassured.
Big brother IS watching.
In December 2005 Google began the "Big Daddy" update which was a software/infrastructure change applied to all their data centers.
This took until the end of March to complete. It was not without it's problems.
These fluctuations may be as a result of Big Daddy but Eric Schmidt of Google hinted that there maybe other factors "machines are full.... we have a huge machine crisis". He estimates that robot generated spam is taking up between 1/5 and 1/3 of Google's Index, which probably explains why they are on such a mission to defeat webspam.
As the update spread across the datacenters, people started to notice that many pages from their sites had disappeared from the regular index. Matt Cutts, a senior software engineer at Google, put it down to "sites where our algorithms had very low trust in the inlinks or the outlinks of that site. Examples that might cause that include excessive reciprocal links, linking to spammy neighborhoods on the web, or link buying/selling." (13)
Matt has stated that, with Big Daddy, they are now indexing more sites than before, and also that the index is now more comprehensive than before.
The rest of 2006, all of 2007 and 2008 saw about 22 recorded changes to PR and back links, including some changes and reversals to changes.
At the beginning of '07 Google came out with an algo to try and defeat Google Bombing.
2009 was visibly about PR with 6 PR results "Google dances" due to PR updates, the removal of the PR tool bar in Google's webmaster tools and the start of the Caffeine update in August.
However Search Engine Land reports Google made between 350 and 550 changes in its organic search algorithms in 2009. (14)
At its core, Caffeine is basically a major overhaul of the Google File System and went totally live on June 8, 2010.
Caffeine provides 50 percent fresher results for web searches than our last index, and it's the largest collection of web content offered. Whether it's a news story, a blog or a forum post, you can now find links to relevant content much sooner after it is published than was possible ever before. (15)
Just before the Caffeine went live on all datacenters, the Google Mayday update came in on the 27th of May.
Mayday was an algorithm change which looked at long tail search terms more closely.
Deeper pages were impacted and exact match abandoned in favor of deeper relevance.
One telling thing that Google said was reported by Vanessa Fox (16) "I asked Google for more specifics and they told me that it was a rankings change, not a crawling or indexing change."
We have to understand that Google is intentionally vague but look at the terms.
- Rankings change
- Crawling change
- Indexing change
While these 3 terms can be interchanged, to me they each suggest a specify function.
"Rankings" refers to PageRank ratings.
"Crawling" refers to how often your pages are visited.
Indexing is all about SERPs.
Throwing in some extra factors for long tail terms shifts the focus away from the REAL effects of Mayday which is the reworking of PageRank.
In an update video (17) a few days later Matt Cutts said "it’s an algorithmic change that changes how we assess which sites are the best match for long tail queries.” He recommends that a site owner who is impacted evaluate the quality of the site and if the site really is the most relevant match for the impacted queries, what “great content” could be added, determine if the the site is considered an “authority”, and ensure that the page does more than simply match the keywords in the query and is relevant and useful for that query."
Since Mayday, I have seen the impact of the changes show in PageRank.
My new site nbs-seo.com rose from a PR0 to a PR4 with only 115 links,
If we were to go by the old judge of quality, (Higher PR sites being higher quality), this would have never happened as the site had on link on a PR5 page, one link on a PR3 page, and 113 links on PR0 pages.
It also has 5 times less links than one of my older sites that has a PR3.
As you can see, over the years Google has been tweaking and reworking linking.
Defeating the scammers and spammers, the link builders that choose to ignore relevance and instead just place links everywhere they can, and the schemers attempting to influence results with black hat.
After numerous changes Google declared PR a dead metric with the removal of the PR toolbar from it's webmaster tools.
The complete algo change means PR is now judged by relevance.
You get more points from having a link on a page that is relevant to your page.
See Links Mean Little Now (18)
SERPs and PageRank are now separate entities.
Serps are decided by on page factors, relevance, presentation, silo and synonym densities.
Links build PageRank.
PR determines how often your site is visited by Google.
The more frequent their spider visits your site, the fresher your content.
The more often you add to your content, the more chances you have for a top ranking page. Each new page is another chance to do well.
The latest Google change is instant search which modifies the SERPs as you type.
It has not been determined if this will be a factor in doing SEO.
best of luck all