Oct 8, 2007

RSS Feed Scraper

It appears that I have a fan, ok maybe not a fan. I have a website scraper that is just not smart enough to actually read the content they are scraping so they are getting my nice RSS feed additional content and posting it in the site. They have many of my posts and the majority of them have this at the bottom of them:

Copyright © LGR Webmaster Blog. This feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement.
Visit the LGR Webmaster Blog for more great content.

You would think that would make it pretty obvious that they stole the content from somewhere. I have sent an email to both the email address on the whois record for the domain and the email address I could find for the web host for their IP address in hopes of having the content removed from the site. Considering the email I sent to the address on the whois record bounced for the domain, I don’t know if I will have much luck.

After the email to the address on the whois bounced I thought I would have a little fun with this scrapper site. If I can’t get the content removed I can at least make sure people know that the content is stolen, just in case they don’t read the copyright notice at the bottom of the post. I post the odd image into my posts, but from now on I will make sure there is always an image in the post, even if it is just a blank image that you can’t see in the post itself. This image is important. It is placed in the WordPress uploads folder, but I suppose it could be placed anywhere on your website. Inside of the WordPress uploads folder I have added another .htaccess file with the following: ErrorDocument 403 /images/403.gif RewriteEngine on RewriteCond %{HTTP_REFERER} websiteIwantBlocked\.com RewriteRule .* - [F]

I changed the website name obviously, but you should get the idea. This stops sending all the images from the WordPress uploads folder to any request coming with the referrer of websiteIwantBlocked.com and returns the 403 error document. Because these are all images that should be sent out from this folder I have created a custom error document that is an image for this folder and placed it in another folder (images). Now when an image is requested from the websiteIwantBlocked.com instead of the server sending out the image I have in the post it returns a 403 error and my custom error image, which by the way looks like this:

403

Now when someone visits the website that scraped my feed that I have listed they get a nice warning that the site has stolen bandwidth, content or both. It only does this for the sites I have listed so feed readers should not be affected.

There are other things I have done as well. I have added the website IP address into the blogs root .htaccess file and denied access, in case the website was scraping the feed directly. It looks like this if you are wondering: deny from IP ADDRESS YOU WANT BLOCKED

I use FeedBurner for my feeds, and usually they list uncommon uses of feeds, but there has been no mention of this one. I did notice that one of the bots is WordPress so it is possible that the site is scraping the FeedBurner feed and not directly from the site. One of the features I wish FeedBurner had was the ability to block individual IP addresses from accessing a feed. That would make it so much easier since every website has an IP address.

I guess we will see if I get an email back from the web host. I am not holding my breath. I think I might have to make due with this, or move the feed away from FeedBurner so I can block individual IP addresses.

How do other people handle very persistent RSS feed scrapers?

Categories: rss web-programming

Comments

Binh Nguyen

Nov 15, 2007

I'm sorry to hear your bad experience. Currently my blog is new so it's not yet an issue. But for me I'm avoiding this by using Post Teaser and strip off my RSS content to 100 words/post only. Regardless of reading through FeedBurner, or scraper site, it's our blog loosing the bandwidth and ad revenue, so better off redirecting our reader to our own website.
LGR

Nov 16, 2007

Does the partial feed encourage people to visit your blog though? I avoid partial feeds. One of the biggest reasons I use an RSS reader is to save time by reading posts all there. The two partial feeds I have subscribed have to really be good to get me to visit. Just wondering.
The Happy Rock

Feb 21, 2008

Thanks for this. I had wanted to find a way to get back at a scraper that has been plaguing me, but don't have the savy yet. I copied your solution word for word, and the image. If that isn't ok, just let me know. I will see if this works in the morning. The things that really bugs me is the Google indexes the scraper quicker than me, so he always shows up first in Google for about a week or two. Grrr....
The Happy Rock

Feb 22, 2008

Not having any luck. Is there any reason this wouldn't work. The scraper site is investingright dot com, and it still has the regular picture. At what level does the images directory go? at the www root? And the .htaccess file can live at the uploads top directory, even though all the images are stored in subdirectories? Thanks for the help
LGR

Feb 22, 2008

Looks like you got it working. Let's see how long it takes for them to remove that content now.