
Large Website Indexing is a challenge. At one time contentmanaged via dynamic websites was confined to the so-called invisible web because of search engine technological limitations and short comings:
- poor website architecture understanding and dynamic URL crawling
- spider traps causing infinite loops within the website and consequent server crash
Today Search Engines have no problems crawling billions of pages, dynamic URLs are no longer a problem – if you’ve got the content the search engines are willing to crawl it … but
- Is your content in the right format ?
- Can the Search Engines manage their way through your website ?
For very large websites and portals going from this daily crawl rate …

to this one

can mean exposing thousands of additional pages for indexing and ultimately for consultation by the end user.
This post (and the next one) will show you where to start and what to do in order to get your large website indexed following my indications: it’s not rocket science as you’ll see but …
Before you start be warned: this is going to be a lengthy process that will take months before you see improvement. There’s no quick fix to getting millions of pages indexed by search engines.
So you ready ? Here we go …
Trial and error are going to be your allies
In most cases whatever you pull out of your hat after reading this post isn’t going to work first time around in most cases, because your dealing with a living body (the website) undergoing constant change implemented by different people over the years and you don’t have any documentation: I am yet to walk onto a project where there is documentation describing website architecture, database, URL structure … nobody does it. If your website has been online for some time there are going to be different layers of programming, coded by different people all with different approaches and different skills.
Getting Started: What should you do first off ?
Putting order in the content.
Run an inventory on your website. How many pages are there in your website ? You need a ballpark figure. Start by counting the number of products you host, add generic description pages.
Create a simple organizational chart identifying the different sections your website can be broken down into and map the variables used to identify section/sub-section/product and whatever other details or features which identify each and every page of the section.

What you’re actually dealing with is a database and structured queries identifying content organized in the shape and form of a page, looking something like this …

How to run an inventory of your website
It’s simple – run an estimate of your products and all other related pages:
PRODUCT PAGES
20,000 Products ——-> 20,000 pages
CATEGORY PAGES
3 categories with 10 products per page
1st category 3,000 products —-> 300 pages
2nd category 5,000 products —> 500 pages
3rd category 12,000 products —> 1,200 pages
consider an additional 10 generic pages
MAX NUMBER OF (EXPECTED) PAGES IN SEARCH ENGINE INDEX
20,000 (products) + 2,000 (category pages) + 10 (generic pages) = 22,010
Compare your figures to those found on the search engines when you run a site: query.
You’ll need to use some of the advanced operators, especially inurl:
Example:
let’s say you run a widget online store at
www.widgetsforyou.com
You could be faced with one of the following scenarios:
- The overall number of pages reported is less than expected
- The overall number of pages reported is more than expected
Let’s say the site: command returns 210,275.
How can it possibly be there are more pages ?
The site: reports more pages than expected
There could be a number of adverse factors at work here but in 9 out of 10 cases the search engines are spidering your internal search engine and indexing the results pages – you need to block this duplicate content out of the index – read this post on how to eliminate duplicate content, for starters and then Google encountered an extremely high number of URLs on your site
Create your large website indexing starting point
Before you put in place specific robots.txt instructions create a Google Webmaster Tools account. wait some days, at least 1 week if not more, depending on how frequently spiders are visiting your website.
Here’s a tip.
Check your Google Cache – if the latest archive copy of your home page is older than 1 week then wait at least 1 month before implementing any modifications.
This is the only way you can document and observe the impact of changes.
After you implement your robots exclusion strategy you’ll see it at work in Google Webmaster Tools

(you’ll find this under the Diagnostics section of webmaster tools >> Crawl errors)
The site: reports less pages than expected
Now consider the other end of the spectrum, the site: command returns a less than expected number of pages.
In this case your content is not entirely exposed, one of the following factors is probably at work:
- Search Engines cannot reach the content
- Search Engines have reached the content but are not including it in the index
Search Engines Cannot Reach your Content
Start simple and pretend you were a search engine spider – use Lynx to surf your website and see just what the search engines are seeing: if your links are visible the search engines are following them, otherwise there could be a problem.
Links embedded in FLASH animations or particular javascript links are only a few of the most frequent examples of poor website architecture creating barriers between your content and the search engines.
Search Engines know of your content but haven’t included it
Yesterday it was a technical problem creating the barrier between you and your users, today the search engines are willing to put that content under your prospects nose BUT at a price: everything has to be in tip top shape, and you’re falling short of search engine quality standards.
Your pages are probably unique (each page describes a different widget) but you may discover all of them have the same <TITLE> tag and/or identical META DESCRIPTION TAGS. Google webmaster tools will help you spot these pages under
Diagnostics >> HTML suggestions

In 99 % of all cases examined large website indexing was unsuccessful because of these problems:
- duplicate <TITLE> tag
- duplicate META DESCRIPTION tag
Standard Content Management Systems sometimes do not allow web masters to create unique tags for each page and this is a major obstacle between your content and users.
So let’s wrap it up … for now
Go back to your website and do some hands on research, get your hands dirty with this check list:
- How many pages have been indexed ?
- How many pages are there in your website ?
- How many pages are there in each section ?
- How many pages of each section have been indexed ?
- Do you have an internal search engine ?
- If the answer is YES, then have those pages been indexed ?
- How many search pages have been indexed ?
- Which are the variables you can use to identify the search results pages being indexed ?
- Which are the filtering criteria required to prevent these pages from being crawled ?
- Are there pages with duplicate content issues ?
- Which pages present duplicate <TITLE> tags ?
- Which pages present duplicate META DESCRIPTION tags ?
- Is your CMS Search Engine Friendly ?
- Are you allowed to create unique META TAGS for each page ?
- What corrective actions are required which would allow you to create unique META TAGS for each page ?
- What is your current crawling rate prior to implementing any corrective actions ?
- What was the number the site: command returned before implementing corrective actions ?
See you here tomorrow. In my next post, Large Website Indexing: How to get Lage Websites Indexed, Part 2, I’ll take you one step further in understanding how all can work for you and considerably increase exposure of your content.


{ 2 comments }
Excellent article! It’s very descriptive with much needed illustrations. Thank you.
thank you. i try to index a very large site
hope this helps
Comments on this entry are closed.