Consulente Web Marketing e Consulenza Motori di Ricerca

Web Marketing & Search Engine Consultant

Dynamic Websites: How to Avoid Indexing Duplicate Content

bigstockphoto_Full_Spectrum_Same_Mold_1667433.png

Search Engines are getting better and better at crawling dynamic websites. Unless you really put your mind to it and do everything you possibly can, dynamic content will get indexed – but that’s not the end of the story.

In some cases you’ll have to deal with and manage eliminating duplicate content from the search engines.

How can you detect if you have duplicate content in the search engine indexes ?

Thats the easy bit.

I recently took on a Client with an Ecommerce site built from scratch and with no organic traffic. One of the first checks is the

site:www.yourwebsite.com

command to see how many pages are in the index. In my case there were thousands of pages, yet the number of products and category pages fell short of 950: the rest were duplicate content.

How can you eliminate the duplicate content pages ?

If you’re like me (= not a programmer) then you’ll have to scrape the SERPs that will give you enough visibility on the different types of pages indexed and identify a pattern, spotting the duplicate pages.

In my case the programmers had used a variable for navigational purposes that could potentially generate tens of thousands of duplicate pages. The structure of the URL is

page.php?category=&empty_e=yes&id=xxxx

the real page is

page.php?category&id=xxxx

The duplicate pages can be eliminated by using an appropriate robots.txt file.

The official robots.txt website states:

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The ‘*’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”.

but that’s not the case – at least it’s not with Google and you can create a regular expression to match pages and block them from being spidered. To make sure your regular expression is compatible with Google you can plug it into the Google Webmaster Tools:

  • Logon to your Google Webmater Tools account
  • Go to the tools section
  • Click on Analyze robots.txt
  • Scroll down the page to the end where you’ll find the following box

200809281847.jpg

Load your regular expression here and click on the Check button and you’ll find out if it’s compatible.

In my case I was able to filter the duplicate content by excluding URLs contating yes. All duplicate content was identified and blocked via the robots.txt file with this regular expression:

Disallow: /page.php?*yes*

You can learn more by reading this article on pattern matching

I know this will work with Google and save you so much time and effort (=money) avoiding work around programming solutions but this may not be the case with other search engines.

You want to keep these pages out of the index because allowing Google to index them is in contrast with their terms of service as listed under the Webmaster Guidelines – under the paragraph Technical Guidelines they clearly state:

Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.

There is another interesting case I’ve come across where forms can be a problem and generate duplicate content – more on that one in my next post 🙂

Enhanced by Zemanta

13 Replies

  1. Good information, I have a few questions
    Will pages with RSS Feeds get a duplicate index such as:
    http://jbode.info/internet-marketing/

    Also can this robot.txt alteration be used for a wordpress blog?

    Thanks

  2. Hi Jeff,
    the robots.txt file is used to manage all search engine spidering activities within your webite: From a spiders perspective there is no difference between your website and your blog: you provide instructions on which files and folders should not be indexed. There can be only 1 robots.txt file on your webiste to mange this activity. I have seen people put more than 1 file …

  3. This is something that people need to be aware of. Many people may not be aware of the fact that they have duplicate content (unintentionally) and it’s affecting their PageRank and SERPS badly.

  4. wow, this article was a huge eye opener for me. I didn’t even realize these great tips on how i can use the robots.txt file to help me find and eliminate my duplicate content. This post was very helpful and over the next few days I’ll be referring to it often as I try and fix a few sites. Thanks

  5. Thanks, this will come in handy especially with dynamic sites… I try to use modules to rename titles to be more seo friendly, but haven’t even set up the robots.txt yet. Hopefully I will get to it asap.

  6. Time Tracker

    Agreed with all above. There’s always a tricky line to walk between exposure and duplicate content. Certainly, you want to get your stuff out there, but not at the cost of getting the smackdown from Google in the same process.

  7. admin

    Interesting article

  8. Hi,
    Great Post!
    This is something that people need to be aware of. Many people may not be aware of the fact that they have duplicate content (unintentionally) and it’s affecting their PageRank and SERPS badly

  9. Nice post and thanks for the advice

  10. Hi…

    Nice points about dealing with duplicate content. It’d also be great to know other tips or advice about this subject. It gets pretty intense trying to understand it so the clearer the better.

    Thanks – excellent work.

    Martin

  11. This is a new thing for me. I never heard about duplicate content previously and how it affect sites. Thanks for bringing it to my attention