Dynamic Websites: How to Avoid Indexing Duplicate Content
Search Engines are getting better and better at crawling dynamic websites. Unless you really put your mind to it and do everything you possibly can, dynamic content will get indexed – but that’s not the end of the story.
In some cases you’ll have to deal with and manage eliminating duplicate content from the search engines.
How can you detect if you have duplicate content in the search engine indexes ?
Thats the easy bit.
I recently took on a Client with an Ecommerce site built from scratch and with no organic traffic. One of the first checks is the
command to see how many pages are in the index. In my case there were thousands of pages, yet the number of products and category pages fell short of 950: the rest were duplicate content.
How can you eliminate the duplicate content pages ?
If you’re like me (= not a programmer) then you’ll have to scrape the SERPs that will give you enough visibility on the different types of pages indexed and identify a pattern, spotting the duplicate pages.
In my case the programmers had used a variable for navigational purposes that could potentially generate tens of thousands of duplicate pages. The structure of the URL is
the real page is
The duplicate pages can be eliminated by using an appropriate robots.txt file.
The official robots.txt website states:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The ‘*’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”.
but that’s not the case – at least it’s not with Google and you can create a regular expression to match pages and block them from being spidered. To make sure your regular expression is compatible with Google you can plug it into the Google Webmaster Tools:
- Logon to your Google Webmater Tools account
- Go to the tools section
- Click on Analyze robots.txt
- Scroll down the page to the end where you’ll find the following box
Load your regular expression here and click on the Check button and you’ll find out if it’s compatible.
In my case I was able to filter the duplicate content by excluding URLs contating yes. All duplicate content was identified and blocked via the robots.txt file with this regular expression:
You can learn more by reading this article on pattern matching
I know this will work with Google and save you so much time and effort (=money) avoiding work around programming solutions but this may not be the case with other search engines.
You want to keep these pages out of the index because allowing Google to index them is in contrast with their terms of service as listed under the Webmaster Guidelines – under the paragraph Technical Guidelines they clearly state:
“Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.“
There is another interesting case I’ve come across where forms can be a problem and generate duplicate content – more on that one in my next post 🙂
Related articles by Zemanta
- Search Engine Position Check
- Country Specific Local SEO Tips
- Love at First Site – Search Engines and your Dreamweaver Site That Is!
- 3 Reasons Why Meta Tags are Still Important
- Do You Want to Optimize Your Own Web Site?
- Know Your Article Marketing Keywords and Profiit!
- Automated Google Sitemaps Generators
- One Standard Fits All: Robots Exclusion Protocol for Yahoo!, Google and Microsoft