Dynamic Websites: How to Avoid Indexing Duplicate Content

Search Engines are getting better and better at crawling dynamic websites. Unless you really put your mind to it and do everything you possibly can, dynamic content will get indexed – but that’s not the end of the story.
In some cases you’ll have to deal with and manage eliminating duplicate content from the search engines.
How can you detect if you have duplicate content in the search engine indexes ?
Thats the easy bit.
I recently took on a Client with an Ecommerce site built from scratch and with no organic traffic. One of the first checks is the
site:www.yourwebsite.com
command to see how many pages are in the index. In my case there were thousands of pages, yet the number of products and category pages fell short of 950: the rest were duplicate content.
How can you eliminate the duplicate content pages ?
If you’re like me (= not a programmer) then you’ll have to scrape the SERPs that will give you enough visibility on the different types of pages indexed and identify a pattern, spotting the duplicate pages.
In my case the programmers had used a variable for navigational purposes that could potentially generate tens of thousands of duplicate pages. The structure of the URL is
page.php?category=&empty_e=yes&id=xxxx
the real page is
page.php?category&id=xxxx
The duplicate pages can be eliminated by using an appropriate robots.txt file.
The official robots.txt website states:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The ‘*’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”.
but that’s not the case – at least it’s not with Google and you can create a regular expression to match pages and block them from being spidered. To make sure your regular expression is compatible with Google you can plug it into the Google Webmaster Tools:
- Logon to your Google Webmater Tools account
- Go to the tools section
- Click on Analyze robots.txt
- Scroll down the page to the end where you’ll find the following box

Load your regular expression here and click on the Check button and you’ll find out if it’s compatible.
In my case I was able to filter the duplicate content by excluding URLs contating yes. All duplicate content was identified and blocked via the robots.txt file with this regular expression:
Disallow: /page.php?*yes*
You can learn more by reading this article on pattern matching
I know this will work with Google and save you so much time and effort (=money) avoiding work around programming solutions but this may not be the case with other search engines.
You want to keep these pages out of the index because allowing Google to index them is in contrast with their terms of service as listed under the Webmaster Guidelines – under the paragraph Technical Guidelines they clearly state:
“Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.”
There is another interesting case I’ve come across where forms can be a problem and generate duplicate content – more on that one in my next post
Related articles by Zemanta
- Search Engine Position Check
- Country Specific Local SEO Tips
- Love at First Site – Search Engines and your Dreamweaver Site That Is!
- 3 Reasons Why Meta Tags are Still Important
- Do You Want to Optimize Your Own Web Site?
- Know Your Article Marketing Keywords and Profiit!
- Automated Google Sitemaps Generators
- One Standard Fits All: Robots Exclusion Protocol for Yahoo!, Google and Microsoft
Altri Post Interessanti - Other interesting Posts
72 tags should be shown
building dynamic websites, duplicate-content-penalty, duplicated-content, dynamic content websites, dynamic websites, dynamic websites php, Search Engines































September 30th, 2008 at 18:52
Good information, I have a few questions
Will pages with RSS Feeds get a duplicate index such as:
http://jbode.info/internet-marketing/
Also can this robot.txt alteration be used for a wordpress blog?
Thanks
September 30th, 2008 at 18:59
Hi Jeff,
the robots.txt file is used to manage all search engine spidering activities within your webite: From a spiders perspective there is no difference between your website and your blog: you provide instructions on which files and folders should not be indexed. There can be only 1 robots.txt file on your webiste to mange this activity. I have seen people put more than 1 file …
October 1st, 2008 at 01:21
This is something that people need to be aware of. Many people may not be aware of the fact that they have duplicate content (unintentionally) and it’s affecting their PageRank and SERPS badly.
October 1st, 2008 at 18:58
wow, this article was a huge eye opener for me. I didn’t even realize these great tips on how i can use the robots.txt file to help me find and eliminate my duplicate content. This post was very helpful and over the next few days I’ll be referring to it often as I try and fix a few sites. Thanks
October 2nd, 2008 at 07:24
Thanks, this will come in handy especially with dynamic sites… I try to use modules to rename titles to be more seo friendly, but haven’t even set up the robots.txt yet. Hopefully I will get to it asap.
October 2nd, 2008 at 08:05
Agreed with all above. There’s always a tricky line to walk between exposure and duplicate content. Certainly, you want to get your stuff out there, but not at the cost of getting the smackdown from Google in the same process.
October 2nd, 2008 at 08:16
Interesting article
October 2nd, 2008 at 11:25
[...] problem was addressed in a post I wrote just a few days ago entitled Dynamic Websites: How to Avoid Indexing Duplicate Content. That post had a focus on website architecture and problems that can arise when the programmers [...]
October 3rd, 2008 at 07:53
Hi,
Great Post!
This is something that people need to be aware of. Many people may not be aware of the fact that they have duplicate content (unintentionally) and it’s affecting their PageRank and SERPS badly
October 4th, 2008 at 05:19
Nice post and thanks for the advice
October 5th, 2008 at 10:50
Hi…
Nice points about dealing with duplicate content. It’d also be great to know other tips or advice about this subject. It gets pretty intense trying to understand it so the clearer the better.
Thanks – excellent work.
Martin
October 6th, 2008 at 14:08
This is a new thing for me. I never heard about duplicate content previously and how it affect sites. Thanks for bringing it to my attention
August 19th, 2009 at 17:40
[...] the results pages – you need to block this duplicate content out of the index – read this post on how to eliminate duplicate content, for starters and then Google encountered an extremely high number of URLs on your [...]