Absolute URLs make scraping harder
One of the problems with putting content on a public website (or a blog) is that it’s almost trivial for someone else to copy the content and place it on their own site, often without even attributing it to you or linking back to the original site. You can often catch them after the fact using Google Alerts, as I mentioned last September in Using Google Alerts to check for content theft. Still, it’s better if you can prevent content theft in the first place.
Googler Matt Cutts pointed out at the duplicate content issues session of the Search Engine Strategies conference held this week that using absolute links on web pages instead of relative links makes wholesale scraping of a site harder to do.
An absolute link is a link whose URL begins with “http://” (or “https://”) and a domain name. For example:
<a href=\"http://www.memwg.com/index.html\">Home</a>
Drop the protocol (the “http://” part) and the domain name and you’re left with a relative link:
<a href=\"index.html\">Home</a>
A lot of books and tutorials recommend using relative links. This is especially true if you’re building a static site (no PHP, ASP or JSP pages), because relative links make it trivial to test the site on your local computer before uploading it to the hosting service. Using absolute links makes it hard to test a local copy of the website, since the pages will all refer back to the real website, which won’t have the updated pages you’re testing.
The problem with relative links is that anyone copying pages from your website doesn’t have to anything to make their copy of the site work on their web server. Absolute links, however, force them to go through each page of your website and change the links to refer to their site. Now, it’s not really hard to do with a good search-and-replace tool, but it makes the scraper’s life a bit more difficult. And if they forget to do it, or miss some links, then you’ll get some additional traffic sent back to your site (the original source of the content).
Note that bloggers usually use absolute links when referring to images and other posts on their site, otherwise the RSS feeds won’t work if they’re shown in news readers like Bloglines. But we’re talking about using absolute links as a general strategy across an entire site, blog or not.
Originally from An AdSense Blog: Make Easy Money with Google on March 3, 2006, 10:34am
Related Posts