SEO Articles
How Much Blog Spam? A Study of a Ping Dataset
February 12th, 2007
How much blog spam is produced in 5 minutes in a quiet Sunday evening? What is the ratio of spam blogs in the most popular blog services? To answer this question I present you the results of an experiment analyzing ping data and manually reviewing blogs.
The relative ease of creating and maintaining blogs makes them ideal tools for spamming search engines. Spam blogs or splogs serve two basic purposes: making money from advertising and affiliate programs, and participating in link farms. But making money from AdSense and providing nepotistic links are not what it takes to call a blog splog. Otherwise we would have to classify all blogs showing ads or promoting a business as spam; and there are thousands popular, quality blogs that would fall into this category. The distinctive feature of a splog, however, is that it has no use for its visitors. Should Google ban a splog from AdSense and prevent its links from passing on authority – such a splog would have no more value or purpose of existence. So my definition of a splog would be “a blog with the only purpose of showing contextual or affiliate ads, or boosting link popularity of certain target sites”.
Google’s New Algorithm to Rank Pages and Detect Spam: “PhraseRank”?
January 6th, 2007
Will the system described in the recent Google’s patent become a new ranking algorithm to augment the existing PageRank?
PhraseRank
From the very beginning, Google’s distinctive feature was the hyperlink induced popularity ranking. Algorithms using text content to evaluate relevancy of web documents played much lesser role. The reasons to this disparity are purely pragmatical: authors of web documents have total control over their content and are at liberty to modify it to deceive ranking algorithms and get higher positions in search results. Hyperlinks however are much less influenced by webmasters and provide a more reliable measure of authority (link weight) and relevance (link anchor).
Now Google introduces a new way to evaluate relevancy of a web document based on its content which might prove itself to be immune to manipulation attempts Interested? Read on!
Duplicate Content - What You Ought to Know About
November 22nd, 2006
Take a look at your website. How much of your content might be considered as duplicate by a search engine algorithm? Even though you never copy anyone you can’t answer ‘none’ because someone can be copying you. Duplicate content is one of the biggest issues both for search engines trying to keep their results’ relevancy high, and webmasters trying to avoid search engine penalties.
Penalties for having duplicate content can be really harmful. This is not just a downgrade in rankings but a move to supplementary results which are hardly visible to the most of the web users. Normally it is expected that Google would select one URL over another to display in SERPs, while duplicates could be found in supplemental results. Unfortunately this is not always so. In this thread [1] of the WebmasterWorld forum you can read about a case when an original high quality and authoritative page was removed from Google’s index together with its duplicates. Considering that this can happen even to the most honest webmaster, one can imagine the amount of attention this issue gets on any SEO forum. Interested? Read on!
Search Engines vs. SEO Spam: Statistical Methods
November 13th, 2006
High placement in a search engine is critical for the success of any online business. Pages appearing higher in the search engine results to queries relevant to a site’s business will get higher targeted traffic. To get this kind of competitive advantage Internet companies employ various SEO techniques in order to optimize certain factors used by search engines to rank results. In the best case SEO specialists create relevant well-structured keyword rich pages, which not only please the eyes of a search engine crawler but also have value to the human visitor. Unfortunately it takes months for this strategic approach to produce feasible results, and many search engine optimizers use so-called “black-hat” SEO. Interested? Read on!
Distribution of Clicks on Google’s SERPs
October 26th, 2006
What is the distribution of clicks on a search engine results page? What percentage of clicks gets each search result according to its rank? How much more users’ attention gets the first listing compared to the second? Or how often do users click the listing below the page fold? The way users interact with SERPs is one of the most frequently discussed topics in the SEO community and is also a very important field of study for the search engine specialists. To answer the above questions researchers employ the so-called eye tracking experiments.
The 5 Myths about Google PageRank
October 6th, 2006The recent Toolbar RageRank update once again has generated a lot of discussion in the SEO community. Webmasters report their websites receiving not much more traffic despite the increased visible PageRank. In numerous forum threads people question the reliability of toolbar values. By unveiling the following five myths I hope to answer to some of the uncertainties caused by this update. Interested? Read on!
Authority Threshold Algorithm
July 19th, 2006Authority Threshold Algorithm (AT(k))
The idea behind AT(k) Algorithm is using only k highest authority weights instead of calculating average weight from every authority pointed by a hub. The parameter k is called authority threshold. A variant of an AT algorithm is MAX algorithm, where k=1, i.e. a hub is as good as the best authority it links to.
In general AT(k) algorithm uses the same formula as HITS. The difference is that when calculating the weight of a hub we consider top k authorities only, i.e. Fk(i) is a subset of outgoing links F(i). If the number of outgoing links |F(i)| is less or equal k than the AT(k) algorithm works exactly the same as HITS.
Link Analysis Algorithms: HUBAVG
July 19th, 2006HUBAVG Algortihm
To overcome the shortcoming of the HITS algorithm of a hub getting a high weight when it points to numerous low-quality authorities, the following refinement was suggested. While using the same formula to calculate authority weights, the hub score h is now averaged by a number of outgoing links |F(i)|:

So in order to achieve a high weight a hub should link good authorities. Unfortunately this approach has its own flaw. Consider two hubs pointing to an equal number of equally good authorities. The two hubs are identical until one puts one more link to a low quality authority. The average sum of the authorities it points to sinks, and it gets penalized in weight. This is quite illogical but can be fixed by using so-called Authority Threshold Algorithm.
Link Analysis Algorithms: HITS
July 18th, 2006HITS Algorithm
This algorithm was first described by Jon Kleinberg in his work “Authoritative Sources in a Hyperlinked Environment” (1998). The idea behind the HITS (Hyperlink Induced Topic Distillation) algorithm is that the authorities and hubs mutually reinforce each other. Authority weight of a page is calculated as a sum of hub weights pointing to it, and weight of a hub – as a sum of weights of authorities pointed to by it. In other words a hub is as good as the authorities linked by it, and vice versa. Interested? Read on!
Topic-Sensitive PageRank
July 17th, 2006The link structure of the Web is highly sensitive to page topic. Pages tend to contain links pointing to other pages on the same broad topic, e.g. pages on investment banking often link to other business-related resources but rarely to sports portals. While using offline PageRank scores has an advantage of faster processing, it also creates a situation where some highly linked page receive higher ranking on topics for which they have no authority. A query-time adjustment of the scoring function is necessary to refine the search results. Some algorithms like HITS and Hilltop allow such an adjustment. However these algorithms have their own shortcomings that restrict their efficient use by search engines.
HITS algorithm calculates hubs and authorities in query-time but relies on a relatively small subset of the Web – the immediate neighborhood of a page, since otherwise computation time would be unacceptably long. Hilltop algorithm analyses a query and calculates score values by finding pages that seem to be experts in the query-specific topic. This algorithm restricts itself to popular queries, since it can’t produce score values when no experts for an uncommon search term are found.
Topic-Sensitive PageRank extends the original PageRank idea by adding a query-time topic-sensitive adjustment. Interested? Read on!




