SEO Articles

Phrase RankWill the system described in the recent Google’s patent become a new ranking algorithm to augment the existing PageRank?

PhraseRank

From the very beginning, Google’s distinctive feature was the hyperlink induced popularity ranking. Algorithms using text content to evaluate relevancy of web documents played much lesser role. The reasons to this disparity are purely pragmatical: authors of web documents have total control over their content and are at liberty to modify it to deceive ranking algorithms and get higher positions in search results. Hyperlinks however are much less influenced by webmasters and provide a more reliable measure of authority (link weight) and relevance (link anchor).

Now Google introduces a new way to evaluate relevancy of a web document based on its content which might prove itself to be immune to manipulation attempts such as adjusting the keyword density or the automated generation of keyword-rich web pages. Actually the new system can become a remedy against MFA (Made For AdSense) sites that display meaningless scrapped keyword-rich content with paid contextual advertisements.

The new indexing and ranking system is based on the use of phrases. From a user’s point of view search queries in most cases are phrases or ‘concepts’, rather than sets of keywords. Despite this, conventional indexing systems still rely on individual terms. Indexing of phrases is avoided because the identification of all possible combinations of words would require immense computational and memory resources. For example a lexicon of 200,000 unique words could have approx. 3.2×1026 phrases – with no system capable to store such a great amount of data in memory or efficiently manipulate it.

This problem is solved in the new system, which identifies phrases that are sufficiently frequent and distinguished in the crawled documents. By detecting phrases and indicating that they are ‘valid’ the system can identify multiple word phrases. This eliminates the need to index all the possible combinations of words in phrases that vary in length.

Another important feature is the ability of phrases to predict the presence of other phrases in a webpage. For example a phrase ‘President of the United States’ indicates that the document most likely contains the phrase ‘White House’. For every phrase the system creates a corresponding list of related phrases ordered according to their significance. This enables the system to detect spam pages based on the excessive appearance of related phrases.

So how does the system work?

Indexing

The process of indexing includes identification of phrases and related phrases. The system analyses the sequences of words and marks them as ‘good’ or ‘bad’ phrases. ‘Good’ phrases are those that occur quite frequently across the indexed documents or have a distinguished appearance, e.g. are delimited by markup tags, punctuation or other markers. Another distinguishing feature is the ability of a ‘good’ phrase to predict a related phrase – such as in above example ‘President of the United States’ predicts ‘White House’. Some phrases, for example, idioms (‘out of the blue’, ‘sitting ducks’ etc) tend to appear with different and unrelated phrases, and are not able to predict anything. Therefore idioms and colloquisms don’t count as ‘good’ phrases.

At the end of the indexing process the system produces a list of valid phrases along with a co-occurrence matrix as a predictive measure. An estimated size of the list is 650,000 phrases.

List of good phrases, or posting list has the following structure:

Phrase i: list:(document d, [list: related phrase count][related phrase information])

For each phrase i there is a list of documents d containing i. For each document there is the number of occurrences of the phrases related to i, and a bit vector containing the information about related phrases.

Bit vector consists of pair of bits. In each pair the value 1 in the first position indicates that a related phrase k is present in the document d; otherwise the value is 0. The second position indicates if a phrase l related to phrase k is present. The related phrases l of related phrases k are called ‘secondary related phrases of i‘. Bit vector is very important as it is used to determine relevancy of a document when the search results are ranked.

Example of a bit vector

Phrase i: document d: [related phrase counts:{3,4,3,0,0,2,1,1,0}]
related phrase bit vector:={11 11 10 00 00 10 10 10 01}

For phrase i there are 9 related phrases k. Now take a look at the bit vector. First pair indicates that both related phrase k1 and one of its related phrases l are present in the document. Fourth and fifth pairs show that neither k4 and k5 nor their related phrases l are found, The last pair shows that although there is no occurrence of phrase k9 one of its related phrases l is present.

For each phrase i the documents d are sorted in declining order according to the information retrieval-type score assigned to them with respect to the given phrase. This pre-ranking significantly improves performance of the system. To calculate ranking score the system can employ a link-popularity algorithm such as PageRank.

Phrase Identification Process

Phrase Identification. For a detailed description of the process please refer to [1] (paragraphs 0026 – 0102)

Searching

The search system receives a query and identifies phrases in it. Once the set Q of query phrases in created; the system retrieves posting lists for the query phrases in Q. Posting lists are intersected to determine, which documents appear on more than one list.

Phrase Based Document Ranking

Documents can be ranked according to their bit vector values. A document containing the most relevant phrases has the highest bit vector value and gets the highest ranking. Note that this approach uses the information about related phrases to rank search results, so even documents with low frequency of the query phrase q can get high rankings provided they have sufficiently high frequency of related phrases.

To produce the final ranking score the ‘body hit’ scores calculated above are combined with ‘anchor hit’ scores in a form of a linear function with adjustable weights, e.g.

Rank = (body hit score)*weight1 + (anchor hit score)*weight2.

For each phrase the indexing system also creates lists of documents in which the given phrase is an anchor in incoming and outgoing links. So the anchor hit score for document d can be calculated as a function of the related phrase bit vectors of the query phrases Q, where Q is an anchor term in a document that references document d.

Detecting Spam Documents

The new phrase based approach enables the future indexing system to detect and penalize spam documents. A statistical analysis of the document collection shows that normally a web page contains 8 to 20 related phrases. A spam document that deceives a search ranking system with an inflated keyword density is expected to contain an excessive number of related phrases, like 100 and more. Therefore by identifying deviations from the expected number of related phrases can be used to detect and battle spam in search results.

This system can also be applied to identify automatically generated content intended to be displayed along with paid contextual advertisements. Such sort of content is often used in MFA (Made for AdSense) sites and is nothing more than a meaningless sequence of keyword-rich text blocks scrapped from other websites, RSS feeds or search engine results pages. Although the conventional indexing systems are already quite effective in preventing these sites from showing in search results for popular terms, they still can occasionally appear in results for long-tail terms.

To Sum Up

The new indexing and ranking system proposed by Google uses page content (phrases) to rank search results in a way that is highly immune to manipulation attempts. The properties of a web document used to rank documents, i.e. phrases and relations between them, are influenced by the properties of all the other documents in the index, and therefore are out of control of webmasters.

The phrase based approach also enhances the ability of search engines to detect unnatural patterns in text content, such as inflated keyword density or scrapped content. It also enables search engine to provide more topically focused results by culling documents covering multiple topics.

The new approach can be used as an augmentation to the existing link-popularity based ranking systems as an additional parameter in the final score formula. Link popularity values are also used to pre-rank documents in posting lists to improve the performance of the search system.

Reference:

1. Patterson, A.L. “Detecting spam documents in a phrase based information retrieval system“, United States Patent Application, 12.28.2006

Did you like it? Was it useful? Bookmark or share this post:

32 Responses to “Google’s New Algorithm to Rank Pages and Detect Spam: “PhraseRank”?”

  1. Search Marketing Facts » Google?s New Algorithm to Rank Pages and Detect Spam: ?PhraseRank?? Says:

    [...] Read full entry [...]

  2. TYPELiFE Says:

    Great report, I’m really enjoying reading all your articles. One complaint/suggestion though, in a 1024×768 resolution (Firefox 2.0), your site overflow’s on the x-axis .. it’s a minor thing but it’s one of my huge pet peeves when a site scrolls to the side.

    Cheers!

  3. oleg.ishenko Says:

    Yes the overflow is annoying but I don’t see much problem in this particlar case – it is about 15 pixels and there is no content. One of divs is not formatted properly, and I can’t figure out which one.

  4. Bob Says:

    it’s your footer. get rid of the width, you don’t need it.

  5. oleg.ishenko Says:

    thanks, Bob

  6. David Temple Says:

    Thanks, nicely described. It sounds like natural writing will finally win out. My concern is that Google likes those MFA’s and won’t want to get rid of them all together. How do you think they’ll make up for the lack of revenue if one of their streams gets jammed?

  7. oleg.ishenko Says:

    That’s a question of proper balancing between AdSense and AdWords cash flows. By letting more MFA junk into search results Google decreases the covertion ratio of its AdWords clients. Some of them would quit or restrict their PPC budgets. Plus SERPS loaded with MFAs would annoy users and make them turn to MSN or Yahoo, again decreasing AdWords profits (less impressions = less clicks).

    Personally I think Google would be better off without MFAs.

  8. Google’s New Patent - Blogging on Blogging Says:

    [...] Although it was published on January 6th, I only just came across the article Google’s New Algorithm to Rank Pages and Detect Spam: “PhraseRank”? today. [...]

  9. David Harry Says:

    Welcome to the club… Bill (Slawski) and I are trumpters of the age of Phrase Based Optimization. I have a piece here; http://www.reliable-seo.com/knowledge-base/technical_seo/phrase_based_optimization.html

    At the bottom of that page are links to all the other Phrase Based IR patents.. if you haven’t checked them out..do so.

    Also check out Bill’s site (SEO By the Sea) – he has some great info as well

    Cheers

    David

  10. seo ranter Says:

    Thanks for the article M8…this really opened my eyes to a lot of things

  11. johns wu Says:

    great article, i am a amateur SEOer, and 2 weeks ago, my site was completely delisted from goog and then reappeared today! i have a feeling this sudden event was a result of this new algorithm taking affect

  12. Seotopic Says:

    I have noticed that and I had wrote a post about it 8 months ago.

    I have translate my original post from Italian to English with google language tools (my english is not good, and very bad to write a technical document..)

    My post: http://www.seotopic.com/Eng/new-algo-google.php

    I thinks the new algo it will works very Good and will produce a better ranking.

  13. TylerCruz.com: An Internet Entrepreneur’s Journey » PhraseRank: The Next Big Thing? Says:

    [...] The actual article he referred to goes on to explain PhraseRank, Google’s newest patent, as “a new way to evaluate relevancy of a web document based on its content which might prove itself to be immune to manipulation attempts such as adjusting the keyword density or the automated generation of keyword-rich web pages.” [...]

  14. Tim W Says:

    In reading your description of “Phrase Rank” and how it looks for semantically similar keywords, I was reminded of a tool I saw in Google Labs that was designed to derive similar sets of words.

    E.g. You enter Red, Green, Yellow… it will generate a list of related terms: Blue, Pink, Orange.

    One wonders if the technologies are related….

  15. Allcreatives.net » Phrase Rank another step towards a better web from google. Says:

    [...] For now you can take a look in more detail at the system here [...]

  16. pete Says:

    Currently there is carnage going on in the directory world, will be interesting to see how it plays out

  17. guitar hero cheats Says:

    You guys complaining about Firefox reading the page – just get a real browser like IE. – Sorry I couldn’t resist, actually I use both and if you are serious about SEO you should too.

    PR is so over rated. I am sure Google spends tens of millions, maybe 100s of millions of dollars developing anti spam page technology. Phrase rank would be just one prong of the fork.

  18. Mike | J8 Internet Marketing Says:

    Interesting post! Thanks for sharing.

  19. sushilver Says:

    It’s nice information… and good step taken by search engine to track spam pages.

  20. Welcome to our website | Directorio Virtual Colombiano Says:

    [...] Phrase rank the new algorithm, find out about it : seoresearcher. [...]

  21. Welcome to our website | Says:

    [...] Phrase rank the new algorithm, find out about it : seoresearcher. [...]

  22. How to reach number one in Search - EXPERTS ADVICE FORUM Says:

    [...] offering free keyword density checks are: http://www.virtualpromote.com/tools/keyword_analyzer/ Google’s New Algorithm to Rank Pages and Detect Spam: “PhraseRank”? In order to check the keyword density of a text document to be used in the site, the number of [...]

  23. Affan Laghari Says:

    Nice article though I am a bit late. But I will put my 2 cents:

    As far as the claim that the webmaster will find it difficult to manipulate his own page to the best of the system, I think black hat SEOs are way ahead than we realize. I mean suppose they want to optimize the webpage text of an MFA site for the keyword “seo research”.

    So they will just see what documents are ranking for this keyword like the top 20 results. And by analyzing their text, they can get some hints as to what seems to be the normal/usual keyword density of the actual phrase and the related phrases in all these documents. Of course it will be difficult to analyze that in the beginning and for solely one phrase. But if you do the same procedure for 100 phrases and you analyze around 2000 webpages for that, you will get a fair understanding of what Google is doing.

    But what I mean is that it will be easier to optimize the webpage itself. However, the real difficulty will be in optimizing the linking pages. I mean optimizing a page that links to you with the optimized anchor text (actual phrase or related phrase) will be the real tough thing.

    Firstly, it’s way too difficult to analyze the linking pages. I mean even though I can retrieve the top 30 results for the keyword “SEO research”, it’s not easy to get my hands on the humdred thousand documents which are linking to these documents and passing optimal link juice.
    So far as I know, Yahoo Site explorer is the most comprehensive database that tells you how many pages are linking. But out of these, Google discounts a large number such as nofollowed or infamous directories. So the real toughness would come when the Phraserank will be combined the current Pagerank and also the Trustrank.
    But that’s my opinion of course which can be wrong.

  24. IP Says:

    Hey Thanks for good info.

    It seems that you are talking about similar to Latent semantic analysis/algorithm. I would say google has already started to work on its LSA based technique just after Big Daddy Algorithm.

    Although i believe, the most the relevancy with searched keyword is most important and is being considered by Google for its indexing purpose & SERP Results

  25. texxs Says:

    I’m pretty sure this ins’t going to pan out. It’s too easy to check for spam with exsisitng technology. Google just doesn’t want to. they make too much money from it.

  26. proper seo services Says:

    And whats up with this new google algorithm with relevance thing? I heard they’re whining about load of spam in the search results hence this new tech.

  27. Clearsite webdesign Says:

    A bit more techy then the usual stuff i read, but very interesting. Thanks a lot.

  28. Rahul Singh Says:

    Thanks for the information ……..Very nice Article and i am new for this

  29. Google Cracking Down on Plagiarists? Says:

    [...] other things in an effort to stop spam. That was also the year Google got a patent on a system to rank pages by phrase and they updated their spam reporting [...]

  30. How to Compete When Your Niche is Overwhelmed With Spam | ...:: aiogrup.net ::... Says:

    [...] website owners &#1089&#1072n rest positive th&#1072t vital hunt engines &#406&#1110k&#1077 Google w&#1110&#406&#406 detect spamming sites &#1110n time &#1072n&#1281 &#1077&#1110th&#1077r diminution th&#1077&#1110r rankings &#959r reject [...]

  31. Dawn Abraham Business Coach Says:

    I found this article to be very useful even though it was written years ago. I love learning how Google gets it’s information. Thanks for all the great info.

  32. Alex SEO Expert Says:

    Its best to avoid the spam in googles eyes. So many old tactics just don’t work anymore, and PR actually is a great ranking factor in some instances. Though having said that you need to realise that you dont need any PR to outrank a High PR site!

Leave a Reply