<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>SEO and Web Marketing Research &#187; Search Engines Technology</title>
	<atom:link href="http://www.seoresearcher.com/category/search-engines-technology/feed" rel="self" type="application/rss+xml" />
	<link>http://www.seoresearcher.com</link>
	<description>A comprehensive SEO and Web Marketing study</description>
	<lastBuildDate>Wed, 03 Jun 2009 01:01:12 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>How Much Blog Spam? A Study of a Ping Dataset</title>
		<link>http://www.seoresearcher.com/how-much-blog-spam-a-study-of-a-ping-dataset.htm</link>
		<comments>http://www.seoresearcher.com/how-much-blog-spam-a-study-of-a-ping-dataset.htm#comments</comments>
		<pubDate>Mon, 12 Feb 2007 12:44:54 +0000</pubDate>
		<dc:creator>oleg.ishenko</dc:creator>
				<category><![CDATA[Search Engine Optimization]]></category>
		<category><![CDATA[Search Engines Technology]]></category>
		<category><![CDATA[WordPress Blogging]]></category>

		<guid isPermaLink="false">http://www.seoresearcher.com/how-much-blog-spam-a-study-of-a-ping-dataset.htm</guid>
		<description><![CDATA[How    much blog spam is produced in 5 minutes in a quiet Sunday evening?    What is the ratio of spam blogs in the most popular blog services?    To answer this question I present you the results of an  experiment analyzing    ping data [...]]]></description>
			<content:encoded><![CDATA[<p><img width="180" height="165" align="left" src="http://www.seoresearcher.com/images/articles/sunday-spam.jpg" />How    much <strong>blog spam</strong> is produced in 5 minutes in a quiet Sunday evening?    What is the <strong>ratio of spam blogs </strong>in the most popular blog services?    To answer this question I present you the results of an  experiment analyzing    ping data and manually reviewing blogs.</p>
<p>The relative ease of creating and maintaining blogs makes them ideal tools    for spamming search engines. Spam blogs or <strong>splogs</strong> serve two    basic purposes: <strong>making money from advertising and affiliate programs</strong>,    and participating in <strong>link farms</strong>. But making money from AdSense    and providing nepotistic links are not what it takes to call a blog splog. Otherwise    we would have to classify all blogs showing ads or promoting a business as spam;    and there are thousands popular, quality blogs that would fall into this category.    The distinctive feature of a splog, however, is that it has no use for its visitors.    Should Google ban a splog from AdSense and prevent its links from passing on    authority â€“ such a splog would have no more value or purpose of existence.    So my definition of a splog would be â€œ<em>a blog with the <strong>only</strong>    purpose of showing contextual or affiliate ads, or boosting link popularity    of certain target sites</em>â€.</p>
<p><span id="more-51"></span></p>
<div id="advertical"><script type="text/javascript"><!--
google_ad_client = "pub-4068762382585748";
/* 160x600, created 7/15/09 */
google_ad_slot = "4312880109";
google_ad_width = 160;
google_ad_height = 600;
//-->
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>
<p>How active are these splogs? This question calls for a little experiment; similar    to one described by P. Kolari, A. Java and T. Finn in their paper â€œ<a target="_blank" href="http://www.blogpulse.com/www2006-workshop/papers/splogosphere.pdf">Characterizing    the Splogosphere</a>â€. They did their experiment in early 2006, and I    am going to repeat it at a smaller scale now, in the early 2007.</p>
<p>Every time a blog is updated it sends a <a target="_blank" href="http://en.wikipedia.org/wiki/Ping_blog"><strong>ping</strong></a>    to one of many ping servers in order to invite search engine crawlers to index    the new post. I am going to use ping data provided by one of the most popular    ping servers â€“ <a target="_blank" href="http://www.weblogs.com/">Weblogs.com</a>.    Due to the limited scale of the experiment I will be using the smaller dataset covering    the last 5 minutes of pings. Itâ€™s pretty big though: 8117 pings. Iâ€™ve    written a simple Java application to parse the XML file and extract URLs and    names of the blogs in the dataset. Also some of the blogs were classified by    blog platform: <a target="_blank" href="http://www.blogger.com/">Blogspot (Blogger)</a>,    <a target="_blank" href="http://www.myspace.com/">MySpace</a>, <a target="_blank" href="http://spaces.live.com/">Spaces.Live.com</a>    etc. I have discovered a number of popular blog services, that I havenâ€™t    come across yet, such as a popular Taiwanese site <a target="_blank" href="http://www.wretch.cc/">Wretch.cc</a>,    or Italian <a target="_blank" href="http://libero.it/">Libero.it</a> and <a target="_blank" href="http://www.splinder.com/">Splinder.com</a>.    I was surprised to see how few pings came from some other popular blog services;<a href="http://www.livejournal.com/">    Livejournal</a> for instance had only 6 pings! Obviously LJ doesnâ€™t rely    much on Weblogs.com, but LJ has little to do with my experiment, as it is known    to have very small percentage of splogs.</p>
<p>So below is a break down of blogs by platform, according to a ping dataset    retrieved on a Sunday evening, Feb. 11. Do not mix blogs under <a target="_blank" href="http://www.wordpress.com/">Wordpress.com</a>    category with blogs using WP as a <strong>blog engine</strong>. Only those blogs    hosted by Wordpress.com are included into this category.</p>
<p><img width="400" height="312" src="http://www.seoresearcher.com/images/articles/spam02.jpg" /></p>
<p><em>Fig. 1 Popular Blog Services in the Sunday Weblogs Dataset</em></p>
<p>The huge â€˜<strong>Rest</strong>â€™ category consists of <strong>standalone</strong>    blogs and blogs hosted by <strong>minor blog services</strong>.<br />
A few words on the blogs in the dataset: a lot of blogs were not in English,    I think as much as 70% of them. For instance, all Wretch.cc blogs and many Spaces.Live.com    ones are in Chinese, there are also a lot of blogs in Italian, Spanish, Russian,    Japanese and German.</p>
<p>Once dataset was downloaded and processed I started manually reviewing the    blogs and discovering spam. Of course I couldnâ€™t visit all the 8117 blogs,    so I randomly selected 20 blogs from each category.</p>
<p>How did I classify spam blogs? While blogs with automatically generated content    or dictionary dumps are easily classified as spam, those with plagiarized content    or in foreign languages required a bit more of effort. Nepotistic links with    keyword stuffed anchors were a good indicator of spam. <a href="http://www.copyscape.com/">Copyscape.com</a>    helped much discovering plagiarized posts. And finally, affiliate and contextual    ads were the final complement in the spam classification problem. It has to    be noted that very few blogs in languages other than English were classified    as spam. I can be sure about my judgment of German and Russian blogs, since    I know these languages, but when dealing with others I relied only on excessive    advertising and nepotistic links as spam indicators. I skipped Wretch.cc and    Explog.jp samples as I was totally unable to judge Chinese and Japanese blogs.    In total of 177 reviewed blogs 36 were classified as spam.</p>
<p>Below you can see two charts, one indicating a ratio of spam within a sample,    and another showing how much each blog platform contributes to the total amount    of spam.</p>
<p><img width="414" height="309" src="http://www.seoresearcher.com/images/articles/spam03.jpg" /></p>
<p><em>Fig 2. Percentage of Spam Blogs in 20-blogs Samples</em></p>
<p><em><img width="306" height="254" src="http://www.seoresearcher.com/images/articles/spam01.jpg" /></em></p>
<p><em>Fig 3. Contribution of Each Category to the Total Blog Spam</em></p>
<p>With the notable exception of Blogspot, the majority of blogs hosted by popular    blog services are spam free. Of course one can question their quality, as many    of them are of little value to others. But letâ€™s not forget that most    of those blogs are private diaries or personal playgrounds never intended to    have big audiences; and as long as they have value to the author and his/her    close circle of friends we canâ€™t call them spam.</p>
<p>Thus, according to my reviews blogs hosted by beon.ru, Libero.it, Spaces.Live.com,    Livejournal.com, splinder.com, and typepad.com showed no instances of blog spam    in 20 blogs samples. Among 20 MySpace blogs I have discovered 1 splog, and Wordpress.com    sample contained 2. The popular Googleâ€™s service Blogspot has confirmed    its unofficial name of <span style="font-weight: bold">Splogspot </span>with 50% spam ratio. â€˜The Restâ€™    category comprised by standalone blogs and blogs attached to commercial sites    showed even bigger proportion of blog spam: 23 blogs of 27 reviewed were classified    as spam. The relatively low number of splogs hosted by public services can be    explained by anti-spam actions taken by the administration of such services.    The standalone splogs, however, are not subject to such moderation, which allows    them to thrive producing tons of junk content for SE crawlers and overloading    ping servers with spam pings.</p>
<p>As you might have noticed I used the same style of charts introduced by the    famous blog <a target="_blank" href="http://www.modernlifeisrubbish.co.uk/">ModernLifeIsRubbish.co.uk</a>,    which has an excellent tutorial on <a target="_blank" href="http://www.modernlifeisrubbish.co.uk/article/howto-make-pretty-pie-charts">how    to create pretty pie charts in Adobe Illustrator</a>. Highly recommended!</p>
<p>If anybody is interested, here is the dataset I used: <a href="http://www.seoresearcher.com/files/WeblogsDataset.xls">Dataset</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.seoresearcher.com/how-much-blog-spam-a-study-of-a-ping-dataset.htm/feed</wfw:commentRss>
		<slash:comments>24</slash:comments>
		</item>
		<item>
		<title>Google&#8217;s New Algorithm to Rank Pages and Detect Spam: &#8220;PhraseRank&#8221;?</title>
		<link>http://www.seoresearcher.com/googles-new-algorithm-to-rank-pages-and-detect-spam-phrase-rank.htm</link>
		<comments>http://www.seoresearcher.com/googles-new-algorithm-to-rank-pages-and-detect-spam-phrase-rank.htm#comments</comments>
		<pubDate>Sun, 07 Jan 2007 01:32:44 +0000</pubDate>
		<dc:creator>oleg.ishenko</dc:creator>
				<category><![CDATA[Search Engine Optimization]]></category>
		<category><![CDATA[Search Engines Technology]]></category>

		<guid isPermaLink="false">http://www.seoresearcher.com/googles-new-algorithm-to-rank-pages-and-detect-spam-phrase-rank.htm</guid>
		<description><![CDATA[Will    the system described in the recent    Googleâ€™s patent become a new ranking algorithm    to augment the existing PageRank?
PhraseRank
From the very beginning, Googleâ€™s distinctive feature was the hyperlink    induced popularity ranking. Algorithms using text content    to evaluate relevancy of web [...]]]></description>
			<content:encoded><![CDATA[<p><img width="270" height="210" align="left" alt="Phrase Rank" src="http://www.seoresearcher.com/images/articles/phrase-rank.jpg" />Will    the system described in the <a target="_blank" href="http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&#038;Sect2=HITOFF&#038;u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&#038;r=1&#038;p=1&#038;f=G&#038;l=50&#038;d=PG01&#038;S1=20060294155.PGNR.&#038;OS=dn/20060294155&#038;RS=DN/20060294155">recent    Googleâ€™s patent</a> become <strong><strong>a </strong>new ranking algorithm</strong>    to augment the existing <strong>PageRank</strong>?</p>
<h2>PhraseRank</h2>
<p>From the very beginning, Googleâ€™s distinctive feature was the <strong>hyperlink    induced popularity ranking</strong>. Algorithms using <strong>text content</strong>    to evaluate relevancy of web documents played much lesser role. The reasons    to this disparity are purely pragmatical: authors of web documents have total    control over their content and are at liberty to modify it to<strong> deceive    ranking algorithms </strong>and get higher positions in search results. Hyperlinks    however are much less influenced by webmasters and provide a more reliable measure    of authority (link weight) and relevance (link anchor).</p>
<p>Now Google introduces a new way to evaluate relevancy of a web document based    on its content which might prove itself to be <strong>immune to manipulation    attempts</strong><span id="more-49"></span> such as adjusting the keyword density or the automated generation    of keyword-rich web pages. Actually the new system can become a remedy against    <strong>MFA</strong> (Made For AdSense) sites that display meaningless scrapped    keyword-rich content with paid contextual advertisements.</p>
<div id="advertical"><script type="text/javascript"><!--
google_ad_client = "pub-4068762382585748";
/* 160x600, created 7/15/09 */
google_ad_slot = "4312880109";
google_ad_width = 160;
google_ad_height = 600;
//-->
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>
<p>The new indexing and ranking system is based on the use of <strong>phrases</strong>.    From a userâ€™s point of view search queries in most cases are phrases or    â€˜conceptsâ€™, rather than sets of keywords. Despite this, conventional    indexing systems still rely on <strong>individual terms</strong>. Indexing of    phrases is avoided because the identification of all possible combinations of    words would require immense computational and memory resources. For example    a lexicon of 200,000 unique words could have approx. 3.2&#215;10<sup>26</sup> phrases    â€“ with no system capable to store such a great amount of data in memory    or efficiently manipulate it.</p>
<p>This problem is solved in the new system, which identifies phrases that are    sufficiently frequent and distinguished in the crawled documents. By detecting    phrases and indicating that they are â€˜validâ€™ the system can identify    multiple word phrases. This eliminates the need to index all the possible combinations    of words in phrases that vary in length.</p>
<p>Another important feature is the ability of phrases to predict the presence    of other phrases in a webpage. For example a phrase â€˜<em>President of    the United States</em>â€™ indicates that the document most likely contains    the phrase â€˜<em>White House</em>â€™. For every phrase the system creates    a corresponding list of related phrases ordered according to their significance.    This enables the system to detect spam pages based on the excessive appearance    of related phrases.</p>
<p>So how does the system work?</p>
<h2>Indexing</h2>
<p>The process of indexing includes identification of phrases and related phrases.    The system analyses the sequences of words and marks them as â€˜goodâ€™    or â€˜badâ€™ phrases. â€˜Goodâ€™ phrases are those that occur    quite frequently across the indexed documents or have a distinguished appearance,    e.g. are delimited by markup tags, punctuation or other markers. Another distinguishing    feature is the ability of a â€˜goodâ€™ phrase to <strong>predict a related    phrase</strong> â€“ such as in above example â€˜<em>President of the    United States</em>â€™ predicts â€˜<em>White House</em>â€™. Some    phrases, for example, idioms (â€˜<em>out of the blue</em>â€™, <em>â€˜sitting    ducks</em>â€™ etc) tend to appear with different and unrelated phrases,    and are not able to predict anything. Therefore idioms and colloquisms donâ€™t    count as â€˜goodâ€™ phrases.</p>
<p>At the end of the indexing process the system produces a list of valid phrases    along with a co-occurrence matrix as a predictive measure. An estimated size    of the list is 650,000 phrases.</p>
<p>List of good phrases, or <strong>posting list</strong> has the following structure:</p>
<pre>Phrase i: list:(document d, [list: related phrase count][related phrase information])</pre>
<p>For each phrase <em>i</em> there is a list of documents d containing <em>i</em>.    For each document there is the number of occurrences of the phrases related    to <em>i</em>, and a bit vector containing the information about related phrases.</p>
<p><strong>Bit vector</strong> consists of pair of bits. In each pair the value    1 in the first position indicates that a related phrase <em>k</em> is present    in the document <em>d</em>; otherwise the value is 0. The second position indicates    if a phrase <em>l</em> related to phrase<em> k</em> is present. The related    phrases<em> l</em> of related phrases <em>k</em> are called â€˜<em>secondary    related phrases of i</em>&#8216;. Bit vector is very important as it is used to determine    relevancy of a document when the search results are ranked.</p>
<h3>Example of a bit vector</h3>
<pre>Phrase <em>i</em>: document <em>d</em>: [related phrase counts:{3,4,3,0,0,2,1,1,0}]</pre>
<pre>related phrase bit vector:={11 11 10 00 00 10 10 10 01}</pre>
<p>For phrase <em>i</em> there are 9 related phrases <em>k</em>. Now take a look    at the bit vector. First pair indicates that both related phrase <em>k<sub>1</sub></em>    and one of its related phrases <em>l</em> are present in the document. Fourth    and fifth pairs show that neither <em>k<sub>4</sub></em> and <em>k<sub>5</sub></em>    nor their related phrases <em>l</em> are found, The last pair shows that although    there is no occurrence of phrase <em>k<sub>9</sub></em> one of its related phrases    l is present.</p>
<p>For each phrase <em>i</em> the documents <em>d</em> are sorted in declining    order according to the information retrieval-type score assigned to them with    respect to the given phrase. This pre-ranking significantly improves performance    of the system. To calculate ranking score the system can employ a link-popularity    algorithm such as PageRank.</p>
<p align="left"><img width="405" height="310" alt="Phrase Identification Process" src="http://www.seoresearcher.com/images/articles/phrase-identification.jpg" /></p>
<p align="left"><em>Phrase Identification. For a detailed description of the process    please refer to <a target="_blank" href="http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&#038;Sect2=HITOFF&#038;u=/netahtml/PTO/search-adv.html&#038;r=1&#038;p=1&#038;f=G&#038;l=50&#038;d=PG01&#038;S1=20060294155.PGNR.&#038;OS=dn/20060294155&#038;RS=DN/20060294155">[1]</a>    (paragraphs 0026 &#8211; 0102)</em></p>
<h2>Searching</h2>
<p>The search system receives a query and identifies phrases in it. Once the set    <em>Q</em> of query phrases in created; the system retrieves posting lists for    the query phrases in <em>Q</em>. Posting lists are intersected to determine,    which documents appear on more than one list.</p>
<h2>Phrase Based Document Ranking</h2>
<p>Documents can be ranked according to their<strong> bit vector values</strong>.    A document containing the most relevant phrases has the highest bit vector value    and gets the highest ranking. Note that this approach uses the <strong>information    about related phrases</strong> to rank search results, so even documents with    low frequency of the query phrase <em>q</em> can get high rankings provided    they have sufficiently high frequency of related phrases.</p>
<p>To produce the final ranking score the â€˜<strong>body hit</strong>â€™    scores calculated above are combined with â€˜<strong>anchor hit</strong>â€™    scores in a form of a linear function with adjustable weights, e.g.</p>
<pre>Rank = (body hit score)*weight1 + (anchor hit score)*weight2.</pre>
<p>For each phrase the indexing system also creates lists of documents in which    the given phrase is an <strong>anchor</strong> in incoming and outgoing links.    So the <strong>anchor hit</strong> score for document <em>d</em> can be calculated    as a function of the related phrase bit vectors of the query phrases <em>Q</em>,    where <em>Q</em> is an anchor term in a document that references document <em>d</em>.</p>
<h2>Detecting Spam Documents</h2>
<p>The new phrase based approach enables the future indexing system to detect    and penalize spam documents. A statistical analysis of the document collection    shows that normally a web page contains 8 to 20 related phrases. A spam document    that deceives a search ranking system with an inflated keyword density is expected    to contain an excessive number of related phrases, like 100 and more. Therefore    by identifying deviations from the expected number of related phrases can be    used to detect and battle spam in search results.</p>
<p>This system can also be applied to identify automatically generated content    intended to be displayed along with paid contextual advertisements. Such sort    of content is often used in MFA (Made for AdSense) sites and is nothing more    than a meaningless sequence of keyword-rich text blocks scrapped from other    websites, RSS feeds or search engine results pages. Although the conventional    indexing systems are already quite effective in preventing these sites from    showing in search results for popular terms, they still can occasionally appear    in results for long-tail terms.</p>
<h2>To Sum Up</h2>
<p>The new indexing and ranking system proposed by Google uses page content (<strong>phrases</strong>)    to rank search results in a way that is highly immune to manipulation attempts.    The properties of a web document used to rank documents, i.e. phrases and relations    between them, are influenced by the properties of all the other documents in    the index, and therefore are out of control of webmasters.</p>
<p>The phrase based approach also enhances the ability of search engines to detect    unnatural patterns in text content, such as inflated keyword density or scrapped    content. It also enables search engine to provide more topically focused results    by culling documents covering multiple topics.</p>
<p>The new approach can be used as an augmentation to the existing link-popularity    based ranking systems as an additional parameter in the final score formula.    Link popularity values are also used to pre-rank documents in posting lists    to improve the performance of the search system.</p>
<h2>Reference:</h2>
<p>1. Patterson, A.L. &#8220;<a target="_blank" href="http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&#038;Sect2=HITOFF&#038;u=/netahtml/PTO/search-adv.html&#038;r=1&#038;p=1&#038;f=G&#038;l=50&#038;d=PG01&#038;S1=20060294155.PGNR.&#038;OS=dn/20060294155&#038;RS=DN/20060294155">Detecting    spam documents in a phrase based information retrieval system</a>&#8220;, United    States Patent Application, 12.28.2006</p>
]]></content:encoded>
			<wfw:commentRss>http://www.seoresearcher.com/googles-new-algorithm-to-rank-pages-and-detect-spam-phrase-rank.htm/feed</wfw:commentRss>
		<slash:comments>27</slash:comments>
		</item>
		<item>
		<title>Duplicate Content &#8211; What You Ought to Know About</title>
		<link>http://www.seoresearcher.com/duplicate-content-what-everybody-ought-to-know-about.htm</link>
		<comments>http://www.seoresearcher.com/duplicate-content-what-everybody-ought-to-know-about.htm#comments</comments>
		<pubDate>Wed, 22 Nov 2006 17:17:28 +0000</pubDate>
		<dc:creator>oleg.ishenko</dc:creator>
				<category><![CDATA[Search Engine Optimization]]></category>
		<category><![CDATA[Search Engines Technology]]></category>

		<guid isPermaLink="false">http://www.seoresearcher.com/duplicate-content-what-everybody-ought-to-know-about.htm</guid>
		<description><![CDATA[Take  a look at your website. How much of your content might be considered as duplicate  by a search engine algorithm? Even though you never copy anyone you can&#8217;t answer  &#8216;none&#8217; because someone can be copying you. Duplicate content is  one of the biggest issues both for search engines trying to [...]]]></description>
			<content:encoded><![CDATA[<p><img width="225" height="186" align="left" alt="Duplicate Content" src="http://www.seoresearcher.com/images/articles/duplicated.jpg" />Take  a look at your website. How much of your content might be considered as duplicate  by a search engine algorithm? Even though you never copy anyone you can&#8217;t answer  &#8216;none&#8217; because someone can be copying you. <strong>Duplicate content </strong>is  one of the biggest issues both for search engines trying to keep their results&#8217;  relevancy high, and webmasters trying to avoid search engine penalties.</p>
<p><strong>Penalties</strong> for having duplicate content can be really harmful.    This is not just a downgrade in rankings but a move to supplementary results    which are hardly visible to the most of the web users. Normally it is expected    that Google would select one URL over another to display in SERPs, while duplicates    could be found in <strong>supplemental results</strong>. Unfortunately this    is not always so. In this thread <a target="_blank" href="http://www.webmasterworld.com/forum30/31430.htm">[1]</a> of the WebmasterWorld forum you can read    about a case when an original high quality and authoritative page was removed    from Google&#8217;s index together with its duplicates. Considering that this can    happen even to the most honest webmaster, one can imagine the amount of attention    this issue gets on any SEO forum.<span id="more-43"></span></p>
<h2>Types of Duplicate Content</h2>
<div id="advertical"><!--adsense#vertical_post--></div>
<p>Duplicate content has a wider definition than the &#8216;copy-paste&#8217;    plagiarism; it is not just content scrapped from a competitor&#8217;s site,    a SERP or a RSS feed. Apart from this there are few more aspects that are generally    referred to as duplicate content.</p>
<h3>Circular Navigation</h3>
<p><strong>Jake Baille</strong> from <em>TrueLocal</em> vaguely defines circular    navigation as having multiple paths across website <a target="_blank" href="http://www.seroundtable.com/archives/003398.html">[2]</a>.    This can be understood as the same content being accessible via different URLs.    An example of the circular navigation could be an article that is retrieved    by links like<br />
<em>- www.example.com/articles/1/ ,<br />
- www.mysite.com/article1/<br />
- www.mysite.com/articles.php?id=1 </em></p>
<p>Another legitimate use of multiple URLs is forum threads. Each thread can    be accessible by a link like <em>www.myforum.com/index.php/topic.1201.html</em>    , and each message within the tread has a URL like <em>www.myforum.com/index.php/topic.1201.msg.01.html</em>    . In the eyes of a search engine all the links lead to different pages with    identical content. Solution? Think of a consistent way of linking, or apply    <em>robots.txt</em> <strong>exclusion rules</strong>.</p>
<p>This can also be the case when other people link to you using differently looking    URLs. Since these external links are out of your control, you should create    a 301 redirect to the canonical URL you choose to be displayed. A tutorial on    301 redirects can be found here <a target="_blank" href="http://www.beyondink.com/howtos/301-redirect.html">[3]</a>.</p>
<p>.</p>
<h3>Printer-Friendly Versions</h3>
<p>Making a printer friendly version is a common practice and it adds value to    the visitors. But printer-friendly version is also a prominent example of duplicate    content! Fortunately a simple solution like adding a &#8216;noindex&#8217; meta    tag to your print pages solves the issue.</p>
<h3>Product-Only Pages</h3>
<p>Product pages looking similar are common among online stores. Typically they    are created using a single template. Often two different product pages share    a description that varies in just few words or numbers, which causes them to    be filtered out as duplicate content. This issue has no easy solution. Either    you rewrite robot.txt to allow only one product description to be crawled and    lose SE traffic to the rest of them, or you roll up your sleeves and add something    different to each product page, like testimonials, which is time consuming or    nearly impossible depending on the number of product types in your stock.</p>
<h2>How Do Duplicate Content Filters Work?</h2>
<p>There are several algorithms in data mining aiming to detect similar text passages.    The one claimed to be used by search engines <a target="_blank" href="http://www.seroundtable.com/archives/003398.html">[2]</a> is w-shingling <a target="_blank" href="http://en.wikipedia.org/wiki/W-shingling">[4]</a>. Each document    has a unique fingerprint or shinglings &#8211; the contiguous subsequences of tokens    (blocks of text). The ratio of magnitude of union and intersection of two documents&#8217;    shinglings can be used to determine their resemblance. Other algorithms that    can be used for duplicates detection are Levenshtein&#8217;s distance <a target="_blank" href="http://en.wikipedia.org/wiki/Levenshtein_distance">[5]</a> and    Soundex <a target="_blank" href="http://en.wikipedia.org/wiki/Soundex">[6]</a>.</p>
<p>It is naturally to expect from a duplicate content filter to be able to discover    the origin and rank it higher. The simplest way to detect the origin would be    comparing the date of indexing implying that the original source is uploaded    and crawled earlier than its copies. But with the advent of the RSS feeds the    new content can be distributed instantaneously and this approach is no longer    valid.</p>
<p>Concerning the origin&#8217;s right to be ranked higher &#8211; this is not always    implemented. In this article <a target="_blank" href="http://www.seochat.com/c/a/Google-Optimization-Help/Duplicate-Content-Penalties-Problems-with-Googles-Filter/">[9]</a> you can read about an experiment of an article    distribution. An article was syndicated twice scoring as many as 19000 copies.    After some time Google, Yahoo and MSN have purged their indices leaving just    few of the duplicates. MSN&#8217;s filter managed not only to discover the origin    but also put it to the top of the search results. Yahoo has also discovered    the origin, but in the results page to the title of the article, the origin&#8217;s    position fluctuated obviously responding to the way Yahoo counts relevancy and    authority.</p>
<p>To the tester&#8217;s amusement Google&#8217;s refined index did not include the original    at all! Evidently Google featured only those pages with copies of the same article    which it considered relevant and authoritative with no regard to the original    source of the content! I&#8217;ve already mentioned a thread <a target="_blank" href="http://www.webmasterworld.com/forum30/31430.htm">[1]</a> where a similar problem    is discussed. The both stories took place in 2005 and early 2006 and so far    I found no evidence that this issue is resolved.</p>
<h2>References and links to read about Duplicate Content</h2>
<ol>
<li>&#8216;<a target="_blank" href="http://www.webmasterworld.com/forum30/31430.htm">Duplicate      Content Observation</a>&#8216;. WebmasterWorld.com</li>
<li>&#8216;<a target="_blank" href="http://www.seroundtable.com/archives/003398.html">Duplicate      Content Issues</a>&#8216;. SERoundtable.com.2006.02.28</li>
<li>&#8216;<a target="_blank" href="http://www.beyondink.com/howtos/301-redirect.html">301      Redirect &#8212; a How-To</a>&#8216; BeyondInk.com</li>
<li>&#8216;<a target="_blank" href="http://en.wikipedia.org/wiki/W-shingling">W-Shingling</a>&#8216;.      Wikipedia</li>
<li>&#8216;<a target="_blank" href="http://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein      Distance</a>&#8216;. Wikipedia</li>
<li>&#8216;<a target="_blank" href="http://en.wikipedia.org/wiki/Soundex">Soundex</a>&#8216;.      Wikipedia</li>
<li>&#8216;<a target="_blank" href="http://www.webconfs.com/duplicate-content-filter-article-1.php">Duplicate      Content Filter: What it is and how it works</a>&#8216;. WebConfs.com</li>
<li><a target="_blank" href="http://www.copyscape.com/">CopyScape.com</a> &#8212;      discovers copied and similar pages.</li>
<li>&#8216;<a target="_blank" href="http://www.seochat.com/c/a/Google-Optimization-Help/Duplicate-Content-Penalties-Problems-with-Googles-Filter/">Duplicate      Content Penalties Problems with Googles Filter</a>&#8216; by J.S.Cassidy, published      at SEOChat.com</li>
</ol>
<p><!--reddit_2--></p>
]]></content:encoded>
			<wfw:commentRss>http://www.seoresearcher.com/duplicate-content-what-everybody-ought-to-know-about.htm/feed</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Search Engines vs. SEO Spam: Statistical Methods</title>
		<link>http://www.seoresearcher.com/search-engines-battle-against-seo-spam-statistical-methods.htm</link>
		<comments>http://www.seoresearcher.com/search-engines-battle-against-seo-spam-statistical-methods.htm#comments</comments>
		<pubDate>Mon, 13 Nov 2006 23:31:37 +0000</pubDate>
		<dc:creator>oleg.ishenko</dc:creator>
				<category><![CDATA[Search Engine Optimization]]></category>
		<category><![CDATA[Search Engines Technology]]></category>

		<guid isPermaLink="false">http://www.seoresearcher.com/search-engines-battle-against-seo-spam-statistical-methods.htm</guid>
		<description><![CDATA[High    placement in a search engine is critical for the success of any online business.    Pages appearing higher in the search engine results to queries relevant to a    site&#8217;s business will get higher targeted traffic. To get this kind of competitive    advantage Internet [...]]]></description>
			<content:encoded><![CDATA[<p><img width="244" height="173" align="left" src="http://www.seoresearcher.com/images/articles/web-spam.jpg" />High    placement in a search engine is critical for the success of any online business.    Pages appearing higher in the search engine results to queries relevant to a    site&#8217;s business will get higher targeted traffic. To get this kind of competitive    advantage Internet companies employ various <strong>SEO</strong> techniques    in order to optimize certain factors used by search engines to rank results.    In the best case SEO specialists create relevant well-structured keyword rich    pages, which not only please the eyes of a search engine crawler but also have    value to the human visitor. Unfortunately it takes months for this strategic    approach to produce feasible results, and many search engine optimizers use    so-called <strong>&#8220;black-hat&#8221; SEO</strong>.<span id="more-41"></span></p>
<div id="advertical"><!--adsense#vertical_post--></div>
<p>&#8220;Black-hat&#8221; SEO is responsible for the immense amount of <strong>search    engine spam</strong> &#8212; pages and links created solely to mislead search engines and boost rankings for client web sites. To weed out the web spam search engines can use statistical methods that allow computing distributions for a variety of page properties. The outlier values in these distributions can be associated with web spam. The ability to identify web spam is extremely valuable to search engine not just because it allows excluding spam pages from their indices but also using them to train more sophisticated machine learning algorithms capable to battle web spam with higher precision.</p>
<h2>Using Statistics to Detect Search Engine Spam</h2>
<p>An example of an application of statistical methods to detect web spam is presented    in the paper <em>&#8220;Spam,    Damn Spam and Statistics&#8221;</em> by<strong> Dennis Fetterly</strong>, <strong>Mark    Manasse</strong> and <strong>Marc Najork</strong> from <em>Microsoft</em>. They    used two sets of pages downloaded from the Internet. The first set was crawled    repeatedly from November 2002 to February 2003 and consisted from 150 million    URLs. For each page the researches recorded HTTP status, time of download, document    length, number of non-markup words, and a vector indicating the changes in page    content between downloads. A sample of this set (751 pages) was inspected manually    and 61 spam pages were discovered, or 8.1% of the set with a confidence interval    of 1.95% at 95% confidence.</p>
<p>Another set was crawled between July and September 2002 and comprises 429 million    pages and 38 million HTTP redirects. For this set the following properties were    recorded: URL, URLs of outgoing links; for the HTTP redirects &#8211; the source    and the target URL. 535 pages were manually inspected and 37 of them were identified    as spam (6.9%).</p>
<p>The research concentrates on studying the following properties of web pages:</p>
<ul>
<li>URL properties, including length and percentage of non-alphabetical characters      (dashes, digits, dots etc.).</li>
<li>Host name resolutions.</li>
<li>Linkage properties.</li>
<li>Content properties.</li>
<li>Content evolution properties.</li>
<li>Clustering properties.</li>
</ul>
<h2>URL Properties</h2>
<p>Search engine optimizers often use numerous automatically generated pages to    massively distribute their low PageRank to a single target page. Since the pages    are machine generated we can expect their URLs to look differently from those    created by humans. The assumptions are that these URLs are longer and include    more non-alphabetical characters such as dashes, slashes or digits. When searching    for spam pages we should consider the <strong>host component</strong> only,    not the entire URL down to the page name.</p>
<p><img width="670" height="452" alt="Distribution of lengths of symbolic host names" src="http://www.seoresearcher.com/images/articles/host-name-length.jpg" /></p>
<p><em>Figure 1. Distribution of lengths of symbolic host names. Source [1]</em><br />
See above the distribution of the length of the URL&#8217;s host component in    the Set 2. The outliers in the right part of the chart are expected to contain    spam pages. However the manual inspection of the 100 longest hostnames had revealed    that 80 of them belong to adult site and 11 refer to the financial and credit    related sites. Therefore in order to produce a spam identification rule the    length property has to be combined with the percentage of non-alphabetical characters.    In the given set 0.173% of URLs are at least 45 characters long and contain    at least 6 dots, 5 dashes or 10 digits &#8212; and the vast majority of these    pages appear to be spam. By changing the threshold values we can change the    number of pages flagged as spam and the number of false positives.</p>
<h2>Host Name Resolutions</h2>
<p>One can notice that Google, given a query <em>q</em>, tends to rank a page    higher if the host component of the page&#8217;s URL contains keywords from <em>q</em>.    To utilize this search engine optimizers stuff pages with URLs containing popular    keywords and keyphrases and set up DNS servers to resolve these URLs to a single    IP. Generally SEOs generate a large number of host names to rank for a wide    variety of popular queries.</p>
<p><img width="670" height="450" alt="Distribution of number of different host-names mapping to the same IP address" src="http://www.seoresearcher.com/images/articles/host-name-resolutions.jpg" /></p>
<p><em>Figure 2. Distribution of number of different host-names mapping to the    same IP address. Source [1].</em></p>
<p>This behavior can also be relatively easy detected by observing the number    of host name resolutions to a single IP. Take a look at the <em>Figure 2</em>    showing this distribution for the Set 2. For a better display the<strong> logarithmic    scale</strong> is used. You can see that the majority of IP addresses are referred    by 10 or less host names &#8212; the points in the upper left corner of the chart.    So 1,864,807 IP addresses are mapped to only one host name, and 599,632 IPs    &#8212; to 2 host names. In the lower right part of the figure you can observe some    extreme cases with hundreds of thousands host names mapped to a single IP, and    the record-breaking IP referred by 8,967,154 host names (the rightmost point).</p>
<p>To flag pages as spam a threshold of 10,000 name resolutions was chosen. About    3.46% of the pages in the Set 2 are served from IP addresses referred by 10,000    and more host names and the manual inspection of this sample proved that with    very few exceptions they were spam. Lower threshold (1,000 name resolutions    or 7.08% pages in the set) produces an unacceptable amount of false positives.</p>
<h2>Linkage Properties</h2>
<p>The Web consisting of interlinked pages has a structure of a <strong>graph</strong>.    Therefore in graph terminology the number of outgoing links of a page can be    referred to as the <strong>out-degree</strong>, while the <strong>in-degree</strong>    equals to the number link pointing to a page. By analyzing out- and in-degrees    values it is also possible to detect spam pages which would represent the outliers    in the corresponding distributions.</p>
<p><img width="605" height="465" alt="Distribution of out-degrees" src="http://www.seoresearcher.com/images/articles/out-degree.jpg" /></p>
<p><em>Figure 3. Distribution of out-degrees. Source [1].</em></p>
<p><em>Figure 3</em> shows the distributions of out-degrees in the Set 2 with    both axes drawn on a logarithmic scale. The graph is linear over a wide range,    a shape characteristic of a <strong>Zipfian</strong> distribution. Notice the    outliers in the blue oval. They show higher out-degrees than those expected    according to the Zipfian distribution. For example there are 158,290 pages with    out-degree 1301, while according to the overall trend only 1,700 such pages    are expected. Overall 0.05% of pages in the Set 2 have out-degrees at least    three times more than suggested by the Zipfian distribution, and according to    the manual inspection of a cross section, almost all of them are spam.</p>
<p><img width="615" height="470" alt="Distribution of in-degrees" src="http://www.seoresearcher.com/images/articles/in-degrees.jpg" /></p>
<p><em>Figure 4. Distribution of in-degrees. Source [1].</em></p>
<p>Similarly the distribution of in-degrees is drawn in <em>Figure 4</em>. Here    there is even a larger portion of outliers. For example 369,457 pages have the    in-degree of 1001, while according to the trend only 2,000 such pages are expected.    Overall, 0.19% of pages in the Set 2 have in-degrees at least three times more    common than the Zipfian distribution would suggest, and the majority of them    are spam.</p>
<h2>Content Properties</h2>
<p>Despite the recent measures taken by search engines to diminish the effect    of <strong>keyword stuffing</strong>, this technique is still used by some SEOs    who generate pages filled with meaningless keywords to promote their AdSense    pages. Quite often such pages are based on a single template and even have the    same number of words which makes them especially easy to detect using statistical    methods.</p>
<p><img width="610" height="430" alt="Variance of the word count of all pages served up bu a single host" src="http://www.seoresearcher.com/images/articles/word-count-variance.jpg" /></p>
<p><em>Figure 5. Variance of the word count of all pages served up bu a single    host. Source [1].</em></p>
<p>For Set 1 the number of non-markup words in each page was recorded, so we can    draw the variance of word count in pages downloaded from a given host name.    The variance is plotted on the x-axis and the word count is shown on the y-axis,    both axes are drawn on a logarithmic scale. Points in the left side of the graph    marked with blue represent cases where at list 10 pages from a given host have    the same word count. There are 944 such hosts (0.21% of the pages in Set 1).    A random sample of 200 these pages was examined manually: 35% were spam, 3.5%    contained no text and 41.5% were soft errors (a page with a message indicating    that the resource is not currently available, despite the HTTP status code 200    â€œOKâ€).</p>
<h2>Content Evolution</h2>
<p>The natural evolution of the content in the Web is slow. In a period of a week    65% of all pages will not change at all, while only 0.8% will change completely.    In contrast many spam SEO web pages generated in response to an HTTP request    independent of the requested URL will change completely of every download. Therefore    by looking into extreme cases of content mutation we search engines are able    to detect web spam.</p>
<p><img width="605" height="440" alt="Average change week over weekof all pages served up by a given IP address" src="http://www.seoresearcher.com/images/articles/content-evolution.jpg" /></p>
<p><em>Figure 6. Average change week over weekof all pages served up by a given    IP address. Source [1].</em></p>
<p>For this purpose the graph shown in <em>Figure 6</em> was drawn for Set 1.    The vertical axis shows the number of pairs of successful downloads from a single    IP. Each pair of downloads was performed with 1 week interval to capture the    content changes. The horizontal axis represents the grade of the content change:    85 &#8212; no change, 0 &#8212; total change. The outliers are marked blue and represent    IPs serving the pages that change completely every week. Set 1 contains 367    such servers with 1,409,353 pages (97.2%). The manual examination of a sample    of 106 pages showed that 103 (97.2%) were spam, 2 were soft errors and 1 adult    pages counted as a false positive.</p>
<h2>Clustering Properties</h2>
<p>Automatically generated spam pages tend to look very similar. In fact, as already    said above, most of them are based on the same model and have only minor differences    (like inserting varying keywords into a template). Pages with such properties    can be detected by applying <strong>clustering analysis </strong>to our samples.</p>
<p>To form clusters of similar pages the <strong>&#8217;shingling&#8217;</strong> algorithm    described by <em>Broder et al.</em> [2] will be used. <em>Figure 7</em> shows    the distribution of the cluster sizes on near duplicate pages in Set 1. The    horizontal axis shows the size of the cluster (the number of pages in the near-equivalence    class), and the vertical axis shows how many such clusters Set 1 contains.</p>
<p><img width="617" height="475" alt="Distribution of sizes of clusters of near-duplicat documents" src="http://www.seoresearcher.com/images/articles/cluster-size.jpg" /></p>
<p><em>Figure 7. Distribution of sizes of clusters of near-duplicat documents.    Source [1].</em></p>
<p>The outliers can be put into two groups. The group marked with red did not    contain any spam pages, pages in this group are more related to the duplicated    content issue. In the same time the group marked with blue is populated predominantly    by spam documents. 15 of 20 largest clusters were spam containing 2,080,112    pages (1.38% of all pages in Set 1)</p>
<h2>To Sum Up</h2>
<p>The methods described above are the examples of a fairly simple statistical    approach to spam detection. The real life algorithms are much more sophisticated    and are based on <strong>machine learning </strong>technologies which allow    search engine to detect and battle spam with a relatively high efficiency at    an acceptable rate of false positives. Applying the spam detection techniques    enables search engine to produce more relevant results and ensures a more fair    competition based on the quality of web resources and not on technical tricks.</p>
<h2>References:</h2>
<p>1. Dennis Fetterly, Mark Manasse, Marc Najork. &#8220;Spam, Damn Spam, and Statistics:    Using statistical analysis to locate spam web pages&#8221; (2004). Microsoft    Research. Available at: <a target="_blank" href="http://research.microsoft.com/%7Enajork/webdb2004.pdf">http://research.microsoft.com/~najork/webdb2004.pdf</a></p>
<p>2. A. Broder, S. Glassman, M. Manasse, and G. Zweig. &#8220;Syntactic Clustering    of the Web&#8221;. In 6th International World Wide Web Conference, April 1997.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.seoresearcher.com/search-engines-battle-against-seo-spam-statistical-methods.htm/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Distribution of Clicks on Google&#8217;s SERPs</title>
		<link>http://www.seoresearcher.com/distribution-of-clicks-on-googles-serps-and-eye-tracking-analysis.htm</link>
		<comments>http://www.seoresearcher.com/distribution-of-clicks-on-googles-serps-and-eye-tracking-analysis.htm#comments</comments>
		<pubDate>Fri, 27 Oct 2006 00:29:44 +0000</pubDate>
		<dc:creator>oleg.ishenko</dc:creator>
				<category><![CDATA[Search Engine Optimization]]></category>
		<category><![CDATA[Search Engines Technology]]></category>

		<guid isPermaLink="false">http://www.seoresearcher.com/distribution-of-clicks-on-googles-serps-and-eye-tracking-analysis.htm</guid>
		<description><![CDATA[What    is the distribution of clicks on a search engine results page? What percentage of clicks gets each search result according to its rank? How much more    usersâ€™ attention gets the first listing compared to the second? Or how    often do users click the listing below [...]]]></description>
			<content:encoded><![CDATA[<p><img width="220" height="153" align="left" alt="Eye-tracking study" src="http://www.seoresearcher.com/images/articles/eye-tracking.jpg" />What    is the distribution of clicks on a search engine results page? What percentage of clicks gets each search result according to its rank? How much more    usersâ€™ attention gets the first listing compared to the second? Or how    often do users click the listing below the page fold? The way users interact  with SERPs is one of the most frequently discussed topics in the SEO community    and is also a very important field of study for the search engine specialists. To answer the above questions researchers employ the so-called eye tracking  experiments.</p>
<p><span id="more-36"></span></p>
<h3>Eye-Tracking Studies</h3>
<p>The objective of eye tracking studies is gaining insight into how users    browse the presented abstracts and select links to click. The results of eye    tracking research provide Internet marketers with information on clickthrough    rates, thus allowing them to make correct predictions on traffic changes as    their rankings are gained or lost. For SE engineers the results provide a basis    for improving the interfaces of search engines and metrics to evaluate the relevancy    of the presented search results.</p>
<p>To detect usersâ€™ interaction patterns the eye tracking experiment observes    a number of indicators of ocular behavior using a CCD (charged couple device)    camera similar to the appliance used to read bar codes. The indices of ocular    behavior include eye fixations, saccades, scan paths and pupil dilation. Eye    fixations are defined as a stable gaze lasting for 200-300 milliseconds representing    visual attention to a specific area of a SERP. Pupil dilations or pupil diameter    changes represent a measurement of interest in a particular listing. This variable    is especially important as it helps interpreting an implicit user feedback to    the relevancy of the presented search results.</p>
<h3>Cornell University Eye-Tracking Analysis of SE Users&#8217; Behavior</h3>
<p>One of the most recent eye tracking studies was performed at Cornell University    by Laura A. Granka, Thorsten Joachims and Geri Cay ([1]). They used a sample    of undergraduate students instructed to perform search in Google for 397 queries    o topics covering movies, travel, music, politics, local and trivia. This study    has produced the following results.</p>
<p><img width="451" height="420" alt="Google Click Distribution map" src="http://www.seoresearcher.com/images/articles/click-distribution-serp.jpg" /></p>
<p><em>Fig 1. Google SEPR Click and Attention distribution &#8216;heat-map&#8217;</em></p>
<h3>Study Results: Clicks and Attention Distribution</h3>
<p>As you can see from the graph below and a SERP â€˜heat-mapâ€™ based    on it, the first two listings capture over a half of the userâ€™s attention    in terms of time of the eye fixation. Whereas the attention is shared almost    equally, the difference in number of click between the first two listings is    much more surprising: over four times! After the second listing the eye fixation    drops sharply. Search results number 6 to 10 receive roughly equal attention.    Here an interesting thing is that the 7th listing gets less attention than the    succeeding 8th â€“ apparently here we can observe the effect of the page    fold. The 7th listing is just below the screen edge and is often skipped as    users scroll the page down to the bottom (during the study the 7th listing was    clicked only once). On the graph you can also see the 11th listing from the    second page of the search results. It gets only about 1 percent of clicks and    user attention â€“ 2.5 times less than the lowest ranked result on the page    one.</p>
<p><img width="503" height="278" alt="Click and attention distribution" src="http://www.seoresearcher.com/images/articles/click-distribution-serp-2.jpg" /></p>
<p><em>Fig 2. Time spent on viewing each results compared to the number of clicks.    Source [1]</em></p>
<p>Often people consider getting to the â€˜top-tenâ€™ of Google as a measurement    of the SEO success. Evidently this is a rather rough approximation. The â€˜top-tenâ€™    itself is a very diverse group with the number of clicks increasing almost logarithmically    as your rank grows. For instance, the first five positions get over 88% of the    traffic, and the first three â€“ 79%.</p>
<h3>SERP Browsing Patterns</h3>
<p>Another important result of this study is the discovery of the browsing pattern:    the way people read a SEPR. To assess the performance of the search algorithm    it is vital to know how users evaluate the presented abstracts before clicking    one of them. For example, if a user clicks the third listing, did he look the    abstracts above and below it? The following figure shows how many results above    and below of the selected listing are scanned on average.</p>
<p><img width="478" height="292" alt="Browsing pattern" src="http://www.seoresearcher.com/images/articles/click-distribution-serp-3.jpg" /></p>
<p><em>Fig.3 Number of results scanned above and below the selected abstract.    Source [1]</em></p>
<p>The effect of the page fold is clearly demonstrated here as well. While the    first 5 listings are clicked after browsing through 1 to 2.68 listings above    and below, the 7th listing is clicked after the entire page is examined! The    listings below the page fold (8-10) are clicked after the first five or four    listings are scanned. You can also see that the number of listings scanned above    the clicked result is much bigger than the number of listings below. This indicates    that users browse the list from top to bottom.</p>
<h3>To Sum Up</h3>
<p>While the study deals only with the first page of the organic search results,    it can be assumed that similar results can be produced for other pages and perhaps    even for the list of the paid ads in the right sidebar.</p>
<p>In addition to the academic researches there is a number of companies producing eye-tracking    studies for the commercial use. The most notable of them are <a target="_blank" href="http://www.eyetools.com">Eyetools.com</a>    and Poynterextra (<a target="_blank" href="http://www.poynterextra.org/EYETRACK2004/index.htm">http://www.poynterextra.org/EYETRACK2004/index.htm</a>)</p>
<h3>References:</h3>
<p>1. Laura A. Granka, Thorsten Joachims, Geri Gay. &#8216;Eye-tracking analysis of    user behavior in WWW search&#8217;, SIGIR, 2004. Available at <a target="_blank" href="http://www.cs.cornell.edu/People/tj/publications/granka_etal_04a.pdf">http://www.cs.cornell.edu/People/tj/publications/granka_etal_04a.pdf</a>    Retrieved on 26.10.06</p>
]]></content:encoded>
			<wfw:commentRss>http://www.seoresearcher.com/distribution-of-clicks-on-googles-serps-and-eye-tracking-analysis.htm/feed</wfw:commentRss>
		<slash:comments>143</slash:comments>
		</item>
		<item>
		<title>The 5 Myths about Google PageRank</title>
		<link>http://www.seoresearcher.com/the-5-myths-of-google-pagerank.htm</link>
		<comments>http://www.seoresearcher.com/the-5-myths-of-google-pagerank.htm#comments</comments>
		<pubDate>Fri, 06 Oct 2006 22:10:34 +0000</pubDate>
		<dc:creator>oleg.ishenko</dc:creator>
				<category><![CDATA[Search Engine Optimization]]></category>
		<category><![CDATA[Search Engines Technology]]></category>

		<guid isPermaLink="false">http://www.seoresearcher.com/the-5-myths-of-google-pagerank.htm</guid>
		<description><![CDATA[The recent Toolbar RageRank update once again has generated a lot of discussion in the SEO community. Webmasters report their websites receiving not much more traffic despite the increased visible PageRank. In numerous forum threads people question the reliability of toolbar values. By unveiling the following five myths I hope to answer to some of [...]]]></description>
			<content:encoded><![CDATA[<p>The recent Toolbar RageRank update once again has generated a lot of discussion in the SEO community. Webmasters report their websites receiving not much more traffic despite the increased visible PageRank. In numerous forum threads people question the reliability of toolbar values. By unveiling the following five myths I hope to answer to some of the uncertainties caused by this update. <span id="more-32"></span></p>
<p><strong>1. PageRank values range from 0 to 10.</strong></p>
<p>While some people believe that PageRank is an integer number or at least converge    to an integer after intensive recursive calculations, actually it is a floating    point number. Google rounds up the real value to the closest integer and puts    it on the 0-10 scale which is displayed in your browser toolbar.</p>
<p><strong>2. PageRank value displayed in the toolbar is the one used to rank    the results.</strong></p>
<p>As you might have noticed, the toolbar value is updated every few months with    no regular intervals. In the present time Google continuously calculates    and updates PageRank so that sometimes actual PageRank and its toolbar values    can differ. The toolbar value should be considered not as a current rank but    as a level your page has reached by the time of the latest toolbar update.</p>
<p><strong>3. PageRank is the primary factor to rank the search results.</strong></p>
<p>Not exactly. PageRank was the backbone of the Google success as a search    engine because of its integrity, ability to use the unique democratic nature    of the web and hyperlinks, and relatively high immunity to abuse. But as years passed the Google technology became far more sophisticated. Now Google uses    a cloud of factors to rank its search results. Some of them are query specific (keyword    saturation of the page copy and the backlinksâ€™anchor text) and some of    them are domain specific (domain age, keywords in domain name, and of course PageRank).    Nobody outside the Googleâ€™s offices knows the actual weight of each factor    and it is quite possible that PageRank is no longer the primary one.</p>
<p><strong>4. Google toolbar shows an increase of PageRank for my pages. My traffic    is going to skyrocket!</strong></p>
<p>Wrong. There wonâ€™t be any sudden traffic increase after toolbar upgrades    any more. As I said before, the continuous calculation and update of the Googleâ€™s    internal PageRank means that the rankings also adjust gradually    as your pages get or lose backlinks. So the toolbar upgrade itself will not cause    any changes in search results.</p>
<p><strong>5. Toolbar PageRank is of no use, it is just for entertainment.</strong></p>
<p>This is allegedly a quote by one of the Google representatives. This is only    partially true. The reason why Google doesnâ€™t show the actual PageRank    any more is that there have been repeated attempts by hackers to access an exploit    these data. Since 2004 the toolbar values updates are no longer synchronized    with the actual rankings changes, and therefore should not to be considered    too seriously in terms of SEO. However toolbar ranks still remains the easiest    and most obvious way to evaluate the quality of a page and millions of web users    regularly judge websites according to what Google toolbar shows them.</p>
<p>See also the following resources:</p>
<p><a target="_blank" href="http://www.mattcutts.com/blog/more-info-on-pagerank/">Matt Cutts explains PageRank</a></p>
<p><a target="_blank" href="http://www.cre8asiteforums.com/forums/index.php?showtopic=41576">Can Google Multitask?</a></p>
<p><a target="_blank" href="http://forums.searchenginewatch.com/showthread.php?t=3054">Google says: Toolbar PageRank is for entertainment purposes only</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.seoresearcher.com/the-5-myths-of-google-pagerank.htm/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Authority Threshold Algorithm</title>
		<link>http://www.seoresearcher.com/authority-threshold-algorithm.htm</link>
		<comments>http://www.seoresearcher.com/authority-threshold-algorithm.htm#comments</comments>
		<pubDate>Wed, 19 Jul 2006 11:20:04 +0000</pubDate>
		<dc:creator>oleg.ishenko</dc:creator>
				<category><![CDATA[Link Popularity Algorithms]]></category>
		<category><![CDATA[Search Engines Technology]]></category>

		<guid isPermaLink="false">http://www.seoresearcher.com/blog/2006/07/19/authority-threshold-algorithm/</guid>
		<description><![CDATA[Authority Threshold Algorithm (AT(k))
The idea behind AT(k) Algorithm is using only k highest authority weights instead of calculating average weight from every authority pointed by a hub. The parameter k  is called authority threshold. A variant of an AT algorithm is MAX algorithm, where k=1, i.e. a hub is as good as the best [...]]]></description>
			<content:encoded><![CDATA[<h2>Authority Threshold Algorithm (AT(k))</h2>
<p>The idea behind <strong>AT(k) Algorithm</strong> is using only <em>k</em> highest authority weights instead of calculating average weight from every authority pointed by a hub. The parameter <em>k </em> is called <em>authority threshold</em>. A variant of an AT algorithm is <strong>MAX </strong>algorithm, where <em>k</em>=1, i.e. a hub is as good as the best authority it links to.</p>
<p>In general AT(k) algorithm uses the same formula as <a title="HITS Algorithm" href="http://www.seoresearcher.com/link-analysis-algorithms-hits.htm"><strong>HITS</strong></a>. The difference is that when calculating the weight of a hub we consider top <em>k</em> authorities only, i.e. <em>Fk(i) </em>is a subset of outgoing links <em>F(i)</em>. If the number of outgoing links <em>|F(i)| </em>is less or equal <em>k</em> than the AT(k) algorithm works exactly the same as HITS.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.seoresearcher.com/authority-threshold-algorithm.htm/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Link Analysis Algorithms: HUBAVG</title>
		<link>http://www.seoresearcher.com/link-analysis-algorithms-hubavg.htm</link>
		<comments>http://www.seoresearcher.com/link-analysis-algorithms-hubavg.htm#comments</comments>
		<pubDate>Wed, 19 Jul 2006 11:17:05 +0000</pubDate>
		<dc:creator>oleg.ishenko</dc:creator>
				<category><![CDATA[Link Popularity Algorithms]]></category>
		<category><![CDATA[Search Engines Technology]]></category>

		<guid isPermaLink="false">http://www.seoresearcher.com/blog/2006/07/19/link-analysis-algorithms-hubavg/</guid>
		<description><![CDATA[HUBAVG Algortihm
To overcome the shortcoming of the HITS algorithm of a hub getting a high weight when it points to numerous low-quality authorities, the following refinement was suggested. While using the same formula to calculate authority weights, the hub score h is now averaged by a number of outgoing links &#124;F(i)&#124;:

So in order to achieve [...]]]></description>
			<content:encoded><![CDATA[<h2>HUBAVG Algortihm</h2>
<p>To overcome the shortcoming of the HITS algorithm of a hub getting a high weight when it points to numerous low-quality authorities, the following refinement was suggested. While using the same formula to calculate authority weights, the hub score <em>h</em> is now averaged by a number of outgoing links <em>|F(i)|</em>:</p>
<p><img alt="HUBAVG weights calculation" title="HUBAVG weights calculation" src="http://www.seoresearcher.com/images/link-algorithms/hubavg.gif" /></p>
<p>So in order to achieve a high weight a hub should link good authorities. Unfortunately this approach has its own flaw. Consider two hubs pointing to an equal number of equally good authorities. The two hubs are identical until one puts one more link to a low quality authority. The average sum of the authorities it points to sinks, and it gets penalized in weight. This is quite illogical but can be fixed by using so-called Authority Threshold Algorithm.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.seoresearcher.com/link-analysis-algorithms-hubavg.htm/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Link Analysis Algorithms: HITS</title>
		<link>http://www.seoresearcher.com/link-analysis-algorithms-hits.htm</link>
		<comments>http://www.seoresearcher.com/link-analysis-algorithms-hits.htm#comments</comments>
		<pubDate>Tue, 18 Jul 2006 17:42:41 +0000</pubDate>
		<dc:creator>oleg.ishenko</dc:creator>
				<category><![CDATA[Link Popularity Algorithms]]></category>
		<category><![CDATA[Search Engines Technology]]></category>

		<guid isPermaLink="false">http://www.seoresearcher.com/blog/2006/07/18/link-analysis-algorithms-hits/</guid>
		<description><![CDATA[HITS Algorithm
This algorithm was first described by Jon Kleinberg in his work â€œAuthoritative Sources in a Hyperlinked Environmentâ€ (1998). The idea behind the HITS (Hyperlink Induced Topic Distillation) algorithm is that the authorities and hubs mutually reinforce each other. Authority weight of a page is calculated as a sum of hub weights pointing to it, [...]]]></description>
			<content:encoded><![CDATA[<h2>HITS Algorithm</h2>
<p>This algorithm was first described by <strong>Jon Kleinberg</strong> in his work â€œ<em>Authoritative Sources in a Hyperlinked Environment</em>â€ (1998). The idea behind the <strong>HITS </strong>(Hyperlink Induced Topic Distillation) algorithm is that the authorities and hubs mutually reinforce each other. Authority weight of a page is calculated as a sum of hub weights pointing to it, and weight of a hub â€“ as a sum of weights of authorities pointed to by it. In other words a hub is as good as the authorities linked by it, and vice versa.<span id="more-11"></span></p>
<p>The notation of the algorithm is as follows. Let S be a set of pages for which hub and authority weights are being calculated, n â€“ the number of pages in the set. Then H is a subset of S containing pages acting as hubs, and A is a subset of S containing authorities. Since each page can be an authority and a hub, A and H overlap. For every page i in its hub role F(i) is the number of outgoing links. For every page i in its authority role B(i) is the number of incoming links. The n-dimensional vector of authority weights is denoted as a, and vector of hub weight â€“ as h. Then hub and authority weights are calculated by the following formula:</p>
<p><img title="HITS Algorithm calculation of weights" alt="HITS Algorithm calculation of weights" src="http://www.seoresearcher.com/images/link-algorithms/hits-weights.gif" /></p>
<p>The process is iterative. First all the weights receive value of 1. Then hubs and authority weights are calculated and the vectors are normalized. This stage is repeated until vectors a and h converge.</p>
<p>The algorithm however has a number of flaws. For example the nature of mutual reinforcement creates the following situation. Consider a hub that points to many authorities (hub B on the picture below), and a number of hubs pointing to a single authority (authority A on the picture). If the number of authorities pointed to by B is larger then the number of hubs pointing to A, then the HITS algorithm will allocate all the weight to the authorities in the right part of the picture and the authority A will get a weight near zero.</p>
<p><img title="HITS Algorithm faults" alt="HITS Algorithm faults" src="http://www.seoresearcher.com/images/link-algorithms/hits-faults.jpg" /></p>
<p>The reason is that hub B will initially get a very high score and propagate it to the authorities it links to. In the same time hubs on the left side will get very low score, and consequently A will get low weight too, although obviously it deserves more.</p>
<h2>Cited Resources</h2>
<ul>
<li>Kleinberg, J. May 1997, â€˜<a title="Authoritative sources in a hyperlinked environment" target="_blank" href="http://citeseer.ist.psu.edu/article/kleinberg98authoritative.html">Authoritative sources in a hyperlinked environment</a>â€™. Technical Report RJ 10076, IBM,. Available at http://citeseer.ist.psu.edu/article/kleinberg98authoritative.html</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.seoresearcher.com/link-analysis-algorithms-hits.htm/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Topic-Sensitive PageRank</title>
		<link>http://www.seoresearcher.com/topic-sensitive-pagerank.htm</link>
		<comments>http://www.seoresearcher.com/topic-sensitive-pagerank.htm#comments</comments>
		<pubDate>Mon, 17 Jul 2006 18:15:51 +0000</pubDate>
		<dc:creator>oleg.ishenko</dc:creator>
				<category><![CDATA[Link Popularity Algorithms]]></category>
		<category><![CDATA[Search Engines Technology]]></category>

		<guid isPermaLink="false">http://www.seoresearcher.com/blog/2006/07/17/topic-sensitive-pagerank/</guid>
		<description><![CDATA[The link structure of the Web is highly sensitive to page topic. Pages tend to contain links pointing to other pages on the same broad topic, e.g. pages on investment banking often link to other business-related resources but rarely to sports portals. While using offline PageRank scores has an advantage of faster processing, it also [...]]]></description>
			<content:encoded><![CDATA[<p>The link structure of the Web is highly sensitive to page <strong>topic</strong>. Pages tend to contain links pointing to other pages on the same broad topic, e.g. pages on investment banking often link to other business-related resources but rarely to sports portals. While using offline PageRank scores has an advantage of faster processing, it also creates a situation where some highly linked page receive higher ranking on topics for which they have no authority. A query-time adjustment of the scoring function is necessary to refine the search results. Some algorithms like <strong>HITS</strong> and <strong>Hilltop </strong>allow such an adjustment. However these algorithms have their own shortcomings that restrict their efficient use by search engines.</p>
<p><strong>HITS </strong>algorithm calculates <em>hubs </em>and <em>authorities </em>in query-time but relies on a relatively small subset of the Web â€“ the immediate neighborhood of a page, since otherwise computation time would be unacceptably long. <strong>Hilltop </strong>algorithm analyses a query and calculates score values by finding pages that seem to be experts in the query-specific topic. This algorithm restricts itself to popular queries, since it canâ€™t produce score values when no experts for an uncommon search term are found.</p>
<p><strong> Topic-Sensitive PageRank</strong> extends the original PageRank idea by adding a <em>query-time topic-sensitive adjustment</em>.<span id="more-10"></span> Instead of a single vector of PageRank values, multiple topic-specific PageRank vectors are calculated. Creation of a PageRank vectors for every possible topic would require extensive resources, so in practice the algorithm uses only 16 topic-specific ranking vectors representing the top categories of the <a title="Open Directory Project" href="http://dmoz.org/" target="_blank">ODP</a> project. Other sources of topics can be used for this purpose as well, but since ODP project is created and edited by a large number of independent volunteers, it is the least likely to be influenced by any one party. For each page in the Web a set of importance scores with respect to various topics is precomputed and stored offline. In query time the topic-specific score is combined with other scores (e.g. content analysis) to form the final ranking for a page.</p>
<p>Let <em>c<sub>j</sub> </em>be one of the <a title="ODP Top-Level Categories" href="http://dmoz.org/" target="_blank">16 top-level</a> ODP categories. For each topic <em>c<sub>j</sub></em> it is necessary to calculate a biased PageRank vector. Let <em>M </em>be a <a title="Wiki: Modofied Adjancecny Matrix" href="http://en.wikipedia.org/wiki/Modified_adjacency_matrix" target="_blank">modified adjacency matrix</a>. Each element <em>m<sub>ji</sub> </em>has value <em>1/N<sub>j</sub></em>, if there is a link from <em>j</em> to <em>i</em>, and where <em>N<sub>j</sub></em> is the number of outgoing links on page <em>j</em>. Otherwise element value is 0. Then the original PageRank formula in matrix notation looks as following:</p>
<p><img title="PageRank formula in a matrix notation" src="http://www.seoresearcher.com/images/link-algorithms/pagerank-matrix-notation.gif" alt="PageRank formula in a matrix notation" /></p>
<p>Parameter <em>Î±</em> here is the dumping factor that equals <em>1-d</em> in the original PageRank formula. The resulting vector of PageRank values is denoted as <em>PR(Î±, p)</em>.</p>
<p>With minor modifications the same formula is used to calculate the topic-sensitive ranking vectors. Let <em>T<sub>j</sub> </em>be the set of pages under a topic <em>c<sub>j</sub></em>. Then instead of the uniform distribution damping factor <em>p</em>, a non-uniform vector <em>v<sub>j</sub></em> is used, where:</p>
<p><img title="Non-uniform damping factor" src="http://www.seoresearcher.com/images/link-algorithms/dampingfactor.gif" alt="Non-uniform damping factor" /></p>
<p>Resulting PageRank vector is denoted as <em>PR(a, v<sub>ij</sub>)</em>. Additionally using all the documents under each topic <em>c<sub>j</sub></em> a term vector <em>D<sub>j</sub> </em>is constructed. Term vector elements <em>D<sub>jt</sub></em> are the numbers of occurrences of every term under topic <em>c<sub>j</sub>.</em> In order to detect a topic, to which a search query term relates to, two scenarios are considered. In the first scenario a user highlights a keyword in a page and initiates a search. In this case the search topic is defined by the page content. For example if word â€˜architectureâ€™ is highlighted in a page about famous buildings, the pages on CPU architecture should not appear among search results. So if a term <em>q</em> is highlighted in some page <em>u</em>, its context <em>qâ€™</em> would be the words in <em>u</em>.  In the second scenario a user enters a keyword into a search form in the conventional way. In this case the context of the query <em>q</em> is the search term itself:<em> qâ€™ = q</em>.  When a history of search terms is kept, it is also possible to use it as the context <em>qâ€™</em>.  In query time the proximity of query context <em>qâ€™ </em>to one of the topics <em>c<sub>j</sub></em> is calculated:</p>
<p><img title="Topic proximity value" src="http://www.seoresearcher.com/images/link-algorithms/proximity.gif" alt="Topic proximity value" /></p>
<p>Proximity value <em>P(qâ€™|c<sub>j</sub>)</em> is calculated using the term vectors <em>D<sub>j</sub></em>. Then a query-sensitive score values are computed for every page d from the index that contains the original search term <em>q</em>. For each document <em>d</em> we sum up the products of topic proximity values <em>P(c<sub>j</sub>|qâ€™)</em> and the page rank <em>r<sub>d</sub></em>. The rank<em> r<sub>d</sub></em> is an element of the page rank vector <em>PR(a, v<sub>ji</sub>)</em> of a topic to which the document d belongs to:</p>
<p><img title="Sorting value" src="http://www.seoresearcher.com/images/link-algorithms/sortingvalue.gif" alt="Sorting value" /></p>
<p>The final search results are sorted by the values of <em>s<sub>d</sub></em>. Since the PageRank calculations are performed in advance, the algorithm is able to quickly perform topic adjustments in the query time.</p>
<h2>Cited resources</h2>
<ul>
<li>Haveliwala, T.H. â€˜<a title="Topic-sensitive PageRank" href="http://citeseer.ist.psu.edu/haveliwala02topicsensitive.html" target="_blank">Topic-sensitive PageRank</a>â€™. In Proceedings of the Eleventh International World Wide Web Conference, Honolulu, Hawaii, May 2002. Available at http://citeseer.ist.psu.edu/haveliwala02topicsensitive.html</li>
<li>Haveliwala, T.H. â€˜<a title="Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search" href="http://citeseer.ist.psu.edu/rd/83310218%2C578979%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/27801/http:zSzzSzwww.stanford.eduzSz%7EtaherhzSzpaperszSztopic-sensitive-pagerank-tkde.pdf/haveliwala03topicsensitive.pdf" target="_blank">Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search</a>â€™. IEEE Trans. Knowl. Data Eng., 15(4):784&#8211;796, 2003. Available at http://citeseer.ist.psu.edu/article/haveliwala03topicsensitive.html</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.seoresearcher.com/topic-sensitive-pagerank.htm/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
