RSS Feed Scraper


It appears that I have a fan, ok maybe not a fan. I have a website scraper that is just not smart enough to actually read the content they are scraping so they are getting my nice RSS feed additional content and posting it in the site. They have many of my posts and the majority of them have this at the bottom of them:

Copyright © LGR Webmaster Blog. This feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement.

Visit the LGR Webmaster Blog for more great content.

You would think that would make it pretty obvious that they stole the content from somewhere. I have sent an email to both the email address on the whois record for the domain and the email address I could find for the web host for their IP address in hopes of having the content removed from the site. Considering the email I sent to the address on the whois record bounced for the domain, I don’t know if I will have much luck.

After the email to the address on the whois bounced I thought I would have a little fun with this scrapper site. If I can’t get the content removed I can at least make sure people know that the content is stolen, just in case they don’t read the copyright notice at the bottom of the post. I post the odd image into my posts, but from now on I will make sure there is always an image in the post, even if it is just a blank image that you can’t see in the post itself. This image is important. It is placed in the WordPress uploads folder, but I suppose it could be placed anywhere on your website. Inside of the WordPress uploads folder I have added another .htaccess file with the following: ErrorDocument 403 /images/403.gif RewriteEngine on RewriteCond %{HTTP_REFERER} websiteIwantBlocked\.com RewriteRule .* - [F]

I changed the website name obviously, but you should get the idea. This stops sending all the images from the WordPress uploads folder to any request coming with the referrer of websiteIwantBlocked.com and returns the 403 error document. Because these are all images that should be sent out from this folder I have created a custom error document that is an image for this folder and placed it in another folder (images). Now when an image is requested from the websiteIwantBlocked.com instead of the server sending out the image I have in the post it returns a 403 error and my custom error image, which by the way looks like this:

403

Now when someone visits the website that scraped my feed that I have listed they get a nice warning that the site has stolen bandwidth, content or both. It only does this for the sites I have listed so feed readers should not be affected.

There are other things I have done as well. I have added the website IP address into the blogs root .htaccess file and denied access, in case the website was scraping the feed directly. It looks like this if you are wondering: deny from IP ADDRESS YOU WANT BLOCKED

I use FeedBurner for my feeds, and usually they list uncommon uses of feeds, but there has been no mention of this one. I did notice that one of the bots is WordPress so it is possible that the site is scraping the FeedBurner feed and not directly from the site. One of the features I wish FeedBurner had was the ability to block individual IP addresses from accessing a feed. That would make it so much easier since every website has an IP address.

I guess we will see if I get an email back from the web host. I am not holding my breath. I think I might have to make due with this, or move the feed away from FeedBurner so I can block individual IP addresses.

How do other people handle very persistent RSS feed scrapers?

Categories: rss web-programming 
Comments