SEO Articles

How much blog spam is produced in 5 minutes in a quiet Sunday evening? What is the ratio of spam blogs in the most popular blog services? To answer this question I present you the results of an experiment analyzing ping data and manually reviewing blogs.

The relative ease of creating and maintaining blogs makes them ideal tools for spamming search engines. Spam blogs or splogs serve two basic purposes: making money from advertising and affiliate programs, and participating in link farms. But making money from AdSense and providing nepotistic links are not what it takes to call a blog splog. Otherwise we would have to classify all blogs showing ads or promoting a business as spam; and there are thousands popular, quality blogs that would fall into this category. The distinctive feature of a splog, however, is that it has no use for its visitors. Should Google ban a splog from AdSense and prevent its links from passing on authority – such a splog would have no more value or purpose of existence. So my definition of a splog would be “a blog with the only purpose of showing contextual or affiliate ads, or boosting link popularity of certain target sites”.

How active are these splogs? This question calls for a little experiment; similar to one described by P. Kolari, A. Java and T. Finn in their paper “Characterizing the Splogosphere”. They did their experiment in early 2006, and I am going to repeat it at a smaller scale now, in the early 2007.

Every time a blog is updated it sends a ping to one of many ping servers in order to invite search engine crawlers to index the new post. I am going to use ping data provided by one of the most popular ping servers – Weblogs.com. Due to the limited scale of the experiment I will be using the smaller dataset covering the last 5 minutes of pings. It’s pretty big though: 8117 pings. I’ve written a simple Java application to parse the XML file and extract URLs and names of the blogs in the dataset. Also some of the blogs were classified by blog platform: Blogspot (Blogger), MySpace, Spaces.Live.com etc. I have discovered a number of popular blog services, that I haven’t come across yet, such as a popular Taiwanese site Wretch.cc, or Italian Libero.it and Splinder.com. I was surprised to see how few pings came from some other popular blog services; Livejournal for instance had only 6 pings! Obviously LJ doesn’t rely much on Weblogs.com, but LJ has little to do with my experiment, as it is known to have very small percentage of splogs.

So below is a break down of blogs by platform, according to a ping dataset retrieved on a Sunday evening, Feb. 11. Do not mix blogs under WordPress.com category with blogs using WP as a blog engine. Only those blogs hosted by WordPress.com are included into this category.

Fig. 1 Popular Blog Services in the Sunday Weblogs Dataset

The huge ‘Rest’ category consists of standalone blogs and blogs hosted by minor blog services.
A few words on the blogs in the dataset: a lot of blogs were not in English, I think as much as 70% of them. For instance, all Wretch.cc blogs and many Spaces.Live.com ones are in Chinese, there are also a lot of blogs in Italian, Spanish, Russian, Japanese and German.

Once dataset was downloaded and processed I started manually reviewing the blogs and discovering spam. Of course I couldn’t visit all the 8117 blogs, so I randomly selected 20 blogs from each category.

How did I classify spam blogs? While blogs with automatically generated content or dictionary dumps are easily classified as spam, those with plagiarized content or in foreign languages required a bit more of effort. Nepotistic links with keyword stuffed anchors were a good indicator of spam. Copyscape.com helped much discovering plagiarized posts. And finally, affiliate and contextual ads were the final complement in the spam classification problem. It has to be noted that very few blogs in languages other than English were classified as spam. I can be sure about my judgment of German and Russian blogs, since I know these languages, but when dealing with others I relied only on excessive advertising and nepotistic links as spam indicators. I skipped Wretch.cc and Explog.jp samples as I was totally unable to judge Chinese and Japanese blogs. In total of 177 reviewed blogs 36 were classified as spam.

Below you can see two charts, one indicating a ratio of spam within a sample, and another showing how much each blog platform contributes to the total amount of spam.

Fig 2. Percentage of Spam Blogs in 20-blogs Samples

Fig 3. Contribution of Each Category to the Total Blog Spam

With the notable exception of Blogspot, the majority of blogs hosted by popular blog services are spam free. Of course one can question their quality, as many of them are of little value to others. But let’s not forget that most of those blogs are private diaries or personal playgrounds never intended to have big audiences; and as long as they have value to the author and his/her close circle of friends we can’t call them spam.

Thus, according to my reviews blogs hosted by beon.ru, Libero.it, Spaces.Live.com, Livejournal.com, splinder.com, and typepad.com showed no instances of blog spam in 20 blogs samples. Among 20 MySpace blogs I have discovered 1 splog, and WordPress.com sample contained 2. The popular Google’s service Blogspot has confirmed its unofficial name of Splogspot with 50% spam ratio. ‘The Rest’ category comprised by standalone blogs and blogs attached to commercial sites showed even bigger proportion of blog spam: 23 blogs of 27 reviewed were classified as spam. The relatively low number of splogs hosted by public services can be explained by anti-spam actions taken by the administration of such services. The standalone splogs, however, are not subject to such moderation, which allows them to thrive producing tons of junk content for SE crawlers and overloading ping servers with spam pings.

As you might have noticed I used the same style of charts introduced by the famous blog ModernLifeIsRubbish.co.uk, which has an excellent tutorial on how to create pretty pie charts in Adobe Illustrator. Highly recommended!

If anybody is interested, here is the dataset I used: Dataset

Did you like it? Was it useful? Bookmark or share this post:

26 Responses to “How Much Blog Spam? A Study of a Ping Dataset”

  1. Pranam Kolari Says:

    You might also be interested in one of our more recent studies.

  2. Pranam Kolari Says:

    Could you please share your hand reviewed samples with us?

  3. oleg.ishenko Says:

    Yes, sure. Here it is: Dataset

    Thanks for commenting and for the link!

  4. TrackBacks » Blog Archive » links for 2007-02-13 Says:

    [...] » How Much Blog Spam? A Study of a Ping Dataset (tags: pings spam) [...]

  5. Search Marketing Facts » How Much Blog Spam? A Study of a Ping Dataset Says:

    [...] Read full entry [...]

  6. GMI Blog Says:

    I recall the day when spam was not a problem on the Internet. Between my junk mail folder for my Yahoo e-mail account, all the fake profiles on MySpace, and the million spam blog posts I am amazed the major search engines keep spam search results as low as they do. Spam is like bamboo or ivy. No way to get rid of it. It will always grow back.

  7. Movie Blog Says:

    I remember when spam wasn’t a problem, either. My work email would never get more than one or two items a day, now I get dozens (which is still better than some). The form on my own website delivers 80 spam emails a day… I need to block those.

  8. azrin Says:

    You did not include PING or TRACKBACK Spamming.
    That gives us alot more headache than these.

    My own sample, over past 24 hours, 45 blogs running each getting over 132 spam and out of that 95% are non generic, and majority are placed on redirects and Porn.

  9. tiny signs Says:

    well, that’s the problem when many blog owners and writers can’t compose their own articles. And those who has the original articles posted sometimes do not really care if it were copied by another blog or not.

    Thanks for the information!

  10. Leif Says:

    The difference between spam and just bad content is not that obvious in all cases. And that’s really a big part of the problem.
    Some days I also wish I could go back to the good old days when spam did not exist and when we didn’t need firewalls and the Internet was just some magical big network run by universities. Those were the days.

  11. Seattle SEO Says:

    My blog spam plummeted with the implementation of Captcha security. No more automated comment posters for me.

  12. MPBA.com Says:

    It is tough for firms and companies that need to post contact information, but that want to avoid SPAM too.

  13. Alastor Says:

    Sblogs will be created as long as there is purpose of creating there. If there was no point of creating one, then there wouldn’t be so much spam all over the web.

  14. Loan Express Says:

    I have read the post. It was about the spam for blog sites. I agree with your content that the sblogs are usually done for the business purposes, that is for making money. Can you explain spam in different terms.

  15. club penguin cheats Says:

    I think a question that could be asked is at what rate spam blogging is increasing. At what point will the system simply fall apart because of all the spam?

  16. DUI Says:

    I cannot stand the blog scrappers that steal content and republish an article that took me hours to write. I would much rather put up with SPAM comments than a script that rips off my website.

  17. sohnamukhda Says:

    Nice Information here about blog spamming, I think you have considered only 1 factor that is excessive advertisement, but left link spamming at blogs. I think this is another big factor of spamming.

  18. Jack Payne Says:

    I’m with DUI. With moderation, putting up with spam is not, really, all that big a problem.

  19. Chloe Edwards Says:

    What an excellent post, and excellent detailed survey of analysis.
    I am not entirely suprised by the fact that 50% of blogger pings were spam, alot of splogs i see are .blogspot.com

  20. pass my drug test Says:

    That’s a really interesting article.

  21. oleg.ishenko Says:

    test

  22. seo specialist Says:

    I have couple of my own blog and few months back encountered with several spam comments and ping back but got a solution through WP plugin. But this one of the annoying stuff for many blogger. Thanks for the post.

  23. SEO packages Says:

    You could actually use WP plugin to prevent this to happen. At least to protect your blog from possible spam comments. Spammers also spend time to leave useless comments on idle blogs.

  24. John Says:

    very interesting story. I have definitely started to notice the fast growing number of splogs on the internet these days. Take a look at my post about the new software called Blog Hatter Pro 2010, “http://www.theeventof.com/2010/06/blog-hatter-killed-my-dreams.html” and let me know what you think about how much worse this is going to get.

  25. Dunlockel Wiiong Says:

    So most wordpress blog is now nofollow!!

  26. sac ekim Says:

    Nice Information here about blog spamming, I think you have considered only 1 factor that is excessive advertisement, but left link spamming at blogs. I think this is another big factor of spamming.
    tr..

Leave a Reply