Dynamic Websites: How to Avoid Indexing Duplicate Content

September 30th, 2008 by Sante J. Achille

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...
Stumble it!

bigstockphoto_Full_Spectrum_Same_Mold_1667433.png

Search Engines are getting better and better at crawling dynamic websites. Unless you really put your mind to it and do everything you possibly can, dynamic content will get indexed – but that’s not the end of the story.

In some cases you’ll have to deal with and manage eliminating duplicate content from the search engines.

How can you detect if you have duplicate content in the search engine indexes ?

Thats the easy bit.

I recently took on a Client with an Ecommerce site built from scratch and with no organic traffic. One of the first checks is the

site:www.yourwebsite.com

command to see how many pages are in the index. In my case there were thousands of pages, yet the number of products and category pages fell short of 950: the rest were duplicate content.

How can you eliminate the duplicate content pages ?

If you’re like me (= not a programmer) then you’ll have to scrape the SERPs that will give you enough visibility on the different types of pages indexed and identify a pattern, spotting the duplicate pages.

In my case the programmers had used a variable for navigational purposes that could potentially generate tens of thousands of duplicate pages. The structure of the URL is

page.php?category=&empty_e=yes&id=xxxx

the real page is

page.php?category&id=xxxx

The duplicate pages can be eliminated by using an appropriate robots.txt file.

The official robots.txt website states:

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The ‘*’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”.

but that’s not the case – at least it’s not with Google and you can create a regular expression to match pages and block them from being spidered. To make sure your regular expression is compatible with Google you can plug it into the Google Webmaster Tools:

  • Logon to your Google Webmater Tools account
  • Go to the tools section
  • Click on Analyze robots.txt
  • Scroll down the page to the end where you’ll find the following box

200809281847.jpg

Load your regular expression here and click on the Check button and you’ll find out if it’s compatible.

In my case I was able to filter the duplicate content by excluding URLs contating yes. All duplicate content was identified and blocked via the robots.txt file with this regular expression:

Disallow: /page.php?*yes*

You can learn more by reading this article on pattern matching

I know this will work with Google and save you so much time and effort (=money) avoiding work around programming solutions but this may not be the case with other search engines.

You want to keep these pages out of the index because allowing Google to index them is in contrast with their terms of service as listed under the Webmaster Guidelines – under the paragraph Technical Guidelines they clearly state:

Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.

There is another interesting case I’ve come across where forms can be a problem and generate duplicate content – more on that one in my next post :)

Enhanced by Zemanta

Altri Post Interessanti - Other interesting Posts


13 comments!
72 tags should be shown
Tags:

, , , , , ,
Related posts:
Dealing with Dymamic Website OptimizationDynamic Websites:
How Booking Forms Create Duplicate Content
How to set up a Multilingual Website from a Google PerspectiveSoftware Tools for Dynamic Websites – Tips on how to use Web CEOHow should we interpret the Supplemental Results in Google ?


Se hai trovato interessante questo post aggiungi un link
 al tuo sito
per condividerlo con gli utenti del tuo sito.
Copia il codice HTML (Ctrl+C per copiare) - il link appare così:

English:If you found this page useful, link to it.
Copy and paste the code below into your web site (Ctrl+C to copy)
It will look like this:

Dynamic Websites: How to Avoid Indexing Duplicate Content

13 Responses to “Dynamic Websites: How to Avoid Indexing Duplicate Content”

  1. Jeff Says:
    September 30th, 2008 at 18:52

    Good information, I have a few questions
    Will pages with RSS Feeds get a duplicate index such as:
    http://jbode.info/internet-marketing/

    Also can this robot.txt alteration be used for a wordpress blog?

    Thanks

  2. Sante J. Achille Says:
    September 30th, 2008 at 18:59

    Hi Jeff,
    the robots.txt file is used to manage all search engine spidering activities within your webite: From a spiders perspective there is no difference between your website and your blog: you provide instructions on which files and folders should not be indexed. There can be only 1 robots.txt file on your webiste to mange this activity. I have seen people put more than 1 file …

  3. Dwayne Says:
    October 1st, 2008 at 01:21

    This is something that people need to be aware of. Many people may not be aware of the fact that they have duplicate content (unintentionally) and it’s affecting their PageRank and SERPS badly.

  4. Troy Says:
    October 1st, 2008 at 18:58

    wow, this article was a huge eye opener for me. I didn’t even realize these great tips on how i can use the robots.txt file to help me find and eliminate my duplicate content. This post was very helpful and over the next few days I’ll be referring to it often as I try and fix a few sites. Thanks

  5. Mat Says:
    October 2nd, 2008 at 07:24

    Thanks, this will come in handy especially with dynamic sites… I try to use modules to rename titles to be more seo friendly, but haven’t even set up the robots.txt yet. Hopefully I will get to it asap.

  6. Time Tracker Says:
    October 2nd, 2008 at 08:05

    Agreed with all above. There’s always a tricky line to walk between exposure and duplicate content. Certainly, you want to get your stuff out there, but not at the cost of getting the smackdown from Google in the same process.

  7. admin Says:
    October 2nd, 2008 at 08:16

    Interesting article

  8. Dynamic Websites: How Forms Create Duplicate Content Says:
    October 2nd, 2008 at 11:25

    [...] problem was addressed in a post I wrote just a few days ago entitled Dynamic Websites: How to Avoid Indexing Duplicate Content. That post had a focus on website architecture and problems that can arise when the programmers [...]

  9. Chelsa Says:
    October 3rd, 2008 at 07:53

    Hi,
    Great Post!
    This is something that people need to be aware of. Many people may not be aware of the fact that they have duplicate content (unintentionally) and it’s affecting their PageRank and SERPS badly

  10. Jammer Says:
    October 4th, 2008 at 05:19

    Nice post and thanks for the advice

  11. hurleyboy Says:
    October 5th, 2008 at 10:50

    Hi…

    Nice points about dealing with duplicate content. It’d also be great to know other tips or advice about this subject. It gets pretty intense trying to understand it so the clearer the better.

    Thanks – excellent work.

    Martin

  12. Crystal Layden Says:
    October 6th, 2008 at 14:08

    This is a new thing for me. I never heard about duplicate content previously and how it affect sites. Thanks for bringing it to my attention

  13. Large Website Indexing: How to get Lage Websites Indexed, Part 1 Says:
    August 19th, 2009 at 17:40

    [...] the results pages – you need to block this duplicate content out of the index – read this post on how to eliminate duplicate content, for starters and then Google encountered an extremely high number of URLs on your [...]

Leave a Reply



Message