Consulente Web Marketing e Consulenza Motori di Ricerca

Web Marketing & Search Engine Consultant

How to get Millions of Pages Indexed by the Search Engines – I

HTML-CODE-Blog.jpg

Large Website Indexing is a challenge. At one time contentmanaged via dynamic websites was confined to the so-called invisible web because of search engine technological limitations and short comings:

  • poor website architecture understanding and dynamic URL crawling
  • spider traps causing infinite loops within the website and consequent server crash

Today Search Engines have no problems crawling billions of pages, dynamic URLs are no longer a problem – if you’ve got the content the search engines are willing to crawl it … but

  • Is your content in the right format ?
  • Can the Search Engines manage their way through your website ?

For very large websites and portals going from this daily crawl rate …

spidering-1.png

to this one

Large Website Indexing - Website Diagram

can mean exposing thousands of additional pages for indexing and ultimately for consultation by the end user.

This post (and the next one) will show you where to start and what to do in order to get your large website indexed following my indications: it’s not rocket science as you’ll see but …

Before you start be warned: this is going to be a lengthy process that will take months before you see improvement. There’s no quick fix to getting millions of pages indexed by search engines.

So you ready ? Here we go …

Trial and error are going to be your allies

In most cases whatever you pull out of your hat after reading this post isn’t going to work first time around in most cases, because your dealing with a living body (the website) undergoing constant change implemented by different people over the years and you don’t have any documentation: I am yet to walk onto a project where there is documentation describing website architecture, database, URL structure … nobody does it. If your website has been online for some time there are going to be different layers of programming, coded by different people all with different approaches and different skills.

Getting Started: What should you do first off ?

Putting order in the content.

Run an inventory on your website. How many pages are there in your website ? You need a ballpark figure. Start by counting the number of products you host, add generic description pages.

Create a simple organizational chart identifying the different sections your website can be broken down into and map the variables used to identify section/sub-section/product and whatever other details or features which identify each and every page of the section.

Picture 5.png

What you’re actually dealing with is a database and structured queries identifying content organized in the shape and form of a page, looking something like this …

200908191503.jpg

How to run an inventory of your website

It’s simple – run an estimate of your products and all other related pages:

PRODUCT PAGES

20,000 Products ——-> 20,000 pages

CATEGORY PAGES

3 categories with 10 products per page

1st category 3,000 products —-> 300 pages

2nd category 5,000 products —> 500 pages

3rd category 12,000 products —> 1,200 pages

consider an additional 10 generic pages

MAX NUMBER OF (EXPECTED) PAGES IN SEARCH ENGINE INDEX

20,000 (products) + 2,000 (category pages) + 10 (generic pages) = 22,010

Compare your figures to those found on the search engines when you run a site: query.

You’ll need to use some of the advanced operators, especially inurl:

Example:

let’s say you run a widget online store at

www.widgetsforyou.com

You could be faced with one of the following scenarios:

  • The overall number of pages reported is less than expected
  • The overall number of pages reported is more than expected

Let’s say the site: command returns 210,275.

How can it possibly be there are more pages ?

The site: reports more pages than expected

There could be a number of adverse factors at work here but in 9 out of 10 cases the search engines are spidering your internal search engine and indexing the results pages – you need to block this duplicate content out of the index – read this post on how to eliminate duplicate content, for starters and then Google encountered an extremely high number of URLs on your site

Create your large website indexing starting point

Before you put in place specific robots.txt instructions create a Google Webmaster Tools account. wait some days, at least 1 week if not more, depending on how frequently spiders are visiting your website.

Here’s a tip.

Check your Google Cache – if the latest archive copy of your home page is older than 1 week then wait at least 1 month before implementing any modifications.

This is the only way you can document and observe the impact of changes.

After you implement your robots exclusion strategy you’ll see it at work in Google Webmaster Tools

Picture 1.png

(you’ll find this under the Diagnostics section of webmaster tools >> Crawl errors)

The site: reports less pages than expected

Now consider the other end of the spectrum, the site: command returns a less than expected number of pages.

In this case your content is not entirely exposed, one of the following factors is probably at work:

  • Search Engines cannot reach the content
  • Search Engines have reached the content but are not including it in the index

Search Engines Cannot Reach your Content

Start simple and pretend you were a search engine spider – use Lynx to surf your website and see just what the search engines are seeing: if your links are visible the search engines are following them, otherwise there could be a problem.

Links embedded in FLASH animations or particular javascript links are only a few of the most frequent examples of poor website architecture creating barriers between your content and the search engines.

Search Engines know of your content but haven’t included it

Yesterday it was a technical problem creating the barrier between you and your users, today the search engines are willing to put that content under your prospects nose BUT at a price: everything has to be in tip top shape, and you’re falling short of search engine quality standards.

Your pages are probably unique (each page describes a different widget) but you may discover all of them have the same <TITLE> tag and/or identical META DESCRIPTION TAGS. Google webmaster tools will help you spot these pages under

Diagnostics >> HTML suggestions

Picture 2.png

In 99 % of all cases examined large website indexing was unsuccessful because of these problems:

  • duplicate <TITLE> tag
  • duplicate META DESCRIPTION tag

Standard Content Management Systems sometimes do not allow web masters to create unique tags for each page and this is a major obstacle between your content and users.

So let’s wrap it up … for now

Go back to your website and do some hands on research, get your hands dirty with this check list:

  • How many pages have been indexed ?
  • How many pages are there in your website ?
  • How many pages are there in each section ?
  • How many pages of each section have been indexed ?
  • Do you have an internal search engine ?
  • If the answer is YES, then have those pages been indexed ?
  • How many search pages have been indexed ?
  • Which are the variables you can use to identify the search results pages being indexed ?
  • Which are the filtering criteria required to prevent these pages from being crawled ?
  • Are there pages with duplicate content issues ?
  • Which pages present duplicate <TITLE> tags ?
  • Which pages present duplicate META DESCRIPTION tags ?
  • Is your CMS Search Engine Friendly ?
  • Are you allowed to create unique META TAGS for each page ?
  • What corrective actions are required which would allow you to create unique META TAGS for each page ?
  • What is your current crawling rate prior to implementing any corrective actions ?
  • What was the number the site: command returned before implementing corrective actions ?

See you here tomorrow. In my next post, Large Website Indexing: How to get Lage Websites Indexed, Part 2, I’ll take you one step further in understanding how all can work for you and considerably increase exposure of your content.

2 Replies

  1. Excellent article! It’s very descriptive with much needed illustrations. Thank you.

  2. thank you. i try to index a very large site 🙂 hope this helps