Architecture Strategy

The Spider Trap: Fixing 12 Million Wasted Crawls

Not Indexed

12.6M

Actually Indexed

103K

Crawl Waste

99.1%

Diagnosis: Crawl Budget Inefficiency

During an audit of a large enterprise retailer, we analyzed the ratio of 'Crawled' vs 'Indexed' pages in Search Console. The discrepancy was critical: Google had crawled over 12.6 million URLs, but only indexed 103,000.

Of the millions of 'filtered' pages crawled, only 132 received organic search impressions in the past year. The remainder represented wasted server resources and crawl budget distraction.

The Root Cause: Uncontrolled Facets

The site allowed users to refine product grids by multiple attributes. Each filter selection appended a new query parameter to the URL. Because parameters could be stacked in any order, this created a near-infinite number of unique URLs.

html

https://site.com/category?color=red,blue,green
https://site.com/category?color=blue,red,green
https://site.com/category?color=green,blue,red

The Strategic Solution: The 'Fragments' Strategy

To mitigate crawl waste while preserving UX, we prescribed a multi-tier defense strategy.

1. Immediate Triage: Robots.txt Blocking

We identified specific URL patterns generated exclusively by the filter engine (e.g., `filter.color`) and implemented strict Disallow rules to halt crawler access immediately.

yaml

User-agent: *
Disallow: /*?filter.color=*
Disallow: /*facets=*

2. The Long-Term Fix: URL Fragments

For permanent architecture, we recommended migrating filter state from Query Parameters (`?`) to URL Fragments (`#`). Search engines ignore the hash fragment, ensuring only the canonical category page is crawled while maintaining user functionality.