The Search Engineering Dictionary
The vocabulary of SEO has evolved from "keywords and links" to embeddings, retrieval, and agents.
These are the concepts that matter for how search and AI discovery actually work now.
Ranking Systems & Signals
- Information Gain Score
- A Google scoring system (patent US11354342B2) that measures how much novel content a document contains relative to what already exists in the result set. Documents that say the same thing as other ranking pages score near zero and can be demoted or excluded entirely.
- Site Quality Score (Panda)
- A site-level quality signal (patent US9031929B1) calculated from the ratio of navigational queries directed at the site vs. informational queries the site answers. A low score suppresses all pages from the domain — thin or repetitive pages across a programmatic build drag down the entire domain's ranking ceiling. The 2024 API leak revealed pandaDemotion as a pre-computed site-wide demotion in CompressedQualitySignals, plus babyPanda variants (lighter/faster iterations). Panda operates as 'algorithmic debt' — a site-level quality tax that functions as a ceiling no amount of page-level optimization can overcome.
- Pairwise Quality Scoring
- Pages are ranked via head-to-head comparison against other pages competing for the same query — not on an absolute quality score. An objectively strong page can rank #7 while a weaker page ranks #1 because the weaker page wins the pairwise comparison on specific signals (links, entity coverage, click data).
- Content Freshness Scoring
- A document scoring system (patent US8549014B2) that tracks the age distribution of content within a page — how much is old vs. recently added. The 2024 API leak revealed lastSignificantUpdate (tracking substantive revisions, not cosmetic edits), freshByDocFp (document fingerprinting that detects whether actual content changed vs. just timestamps), and bylineDateConfidence (confidence score for byline date accuracy — contradictory dates degrade the freshness signal). Google stores only the last 20 versions of a document. Cosmetic date changes without substantive edits do not improve freshness scores.
- N-gram Quality Prediction
- A quality detection system (patent US9767157B2) that builds a phrase model from sites of known quality and scores new content against it. Creates a linguistic fingerprint of what quality writing looks like. The patent-level mechanism behind detecting thin content, keyword stuffing, and machine-generated text with unnatural phrase distributions.
- Entity-Based Ranking
- A ranking system (patent US10235423B2) that identifies entities in search results via a knowledge graph, assigns weights by entity type (person, place, organization, product), and computes a composite ranking score. Being recognized as a distinct entity is a direct ranking input.
- Core Web Vitals (LCP, INP, CLS)
- Google's performance metrics for load speed (Largest Contentful Paint), interactivity (Interaction to Next Paint), and visual stability (Cumulative Layout Shift). A ranking signal and direct measure of user experience quality.
- Engagement Signal
- Behavioral metrics like dwell time, scroll depth, and repeat visits. NavBoost uses these per-topic to re-rank results. Can also feed back into RLHF loops for AI systems.
- Passage-Level Ranking
- Google evaluates and ranks individual passages within documents independently (patent US20090055389A1). Sections compete on their own merits — author popularity, word choice, passage diffusion, and whether the passage adds unique information. A single well-written section can surface even when the broader document isn't the best overall match.
- Time on Task
- Measures how long it takes users to complete a goal or the percentage who succeed — finer than generic dwell time.
- Topic Centroid
- The average embedding vector computed from a site's keyword portfolio, representing its semantic center of gravity. Used to measure how well individual pages align with the site's core topics and to identify off-topic pages dragging down site-level quality scores.
- contentEffort
- An attribute from the 2024 Google API leak described as an LLM-based effort estimation for article pages. May quantify human labor, originality, and resources invested in creating content — including unique images, original data, embedded tools, and linguistic complexity. If the leak is interpreted correctly, this could be the closest algorithmic proxy for the Experience dimension of E-E-A-T, though how Google currently weights it (or whether it remains active) is unknown.
- Q* (Quality Star)
- Google's aggregate site/document quality metric, revealed through the DOJ antitrust trial and 2024 API leak. Combines content quality, authority signals, and an evolved PageRank that measures 'distance from a known good source.' Q* is largely static and query-independent — a high Q* score applies across all topics for a domain. Described as 'deliberately engineered rather than machine-learned.' The relationship: E-E-A-T is the goal, Q* is the system, Site_Quality is the score.
- CompressedQualitySignals
- The pre-computed quality gatekeeper module in Google's ranking pipeline. Contains per-document signals — siteAuthority, pandaDemotion, navDemotion, anchorMismatchDemotion, exactMatchDomainDemotion, and others — that can disqualify a page before query-time ranking even begins. A poor CompressedQualitySignals score means no amount of on-page optimization can compensate. This is the mechanism behind why site-level quality acts as a ceiling.
- Twiddler
- Specialized re-ranking functions in Google's SuperRoot framework that adjust results after the primary ranking algorithm (Ascorer) runs. Named Twiddlers include NavBoost (click-based), QualityBoost, FreshnessBoost, RealTimeBoost, and WebImageBoost. 'Lazy Twiddlers' only process the top 20-30 results with granular adjustments. When Google says a system is 'not part of the core algorithm,' it often means it operates as a Twiddler — a post-ranking overlay, not a primary scoring input.
- Index Tiering (Base / Zeppelins / Landfills)
- Google's three-tier document storage system. Base (flash memory) holds the most important, frequently updated content. Zeppelins (SSDs) hold mid-priority pages. Landfills (hard drives) hold low-importance, irregularly updated content. Tier placement determines crawl frequency and serving speed — and directly affects outgoing link value, since links from Base-tier pages carry more signal than links from Landfill-tier pages.
- Copia / Firefly
- Google's scaled content abuse detection system, revealed in the 2024 API leak. Copia (Latin for 'abundance') monitors content velocity — the ratio of URLs generated against substantive articles produced. Firefly aggregates inputs from Copia, page quality scores, and NavBoost user dissatisfaction signals to make site-wide demotion decisions. This is Google's primary mechanism against AI-generated content farms — there is no single 'AI content detector,' but rapid content velocity combined with low engagement and low quality triggers Firefly.
- hostAge (Sandbox)
- A PerDocData attribute from the 2024 API leak explicitly described as used 'to sandbox fresh spam in serving time.' Represents the earliest date Google first encountered any page on a domain. Google has publicly denied the existence of a sandbox for new sites; the leak directly contradicts this. New domains face a trust-building period regardless of content quality — inherent disadvantage against sites with 13 months of accumulated NavBoost click signals.
Retrieval & Discovery
- Canonical Tag
- An HTML link that identifies the master version of a page, preventing duplicate URLs from diluting authority.
- Robots.txt / Robots Meta
- Rules that allow or block crawling and indexing, managing crawl budget and LLM ingestion cost.
- Sitemap / RSS / Atom Feed
- XML files listing URLs and timestamps so crawlers and RAG sync jobs know what's new or updated.
- Query Fan-Out
- Google's AI systems decompose a single query into multiple sub-queries across eight variant types (equivalent, follow-up, generalization, specification, canonicalization, translation, entailment, clarification), run them in parallel, then stitch results together. Gemini 3 averages 10.7 sub-queries per prompt (78% increase over Gemini 2.5), and 95% of fan-out queries have zero traditional search volume. Fan-out coverage has a 0.77 Spearman correlation with AI citation likelihood.
- Sub-Query
- A derivative search created during fan-out (e.g., "cost of living in Brooklyn" from "Is NYC affordable?"). Each sub-query feeds fresh documents into the retrieval pipeline independently.
- Thematic Search
- Fan-out groups results by themes, lets an LLM summarize each theme, and then composes the final answer — automated topic-cluster navigation built into the SERP.
- Cosine Similarity
- The mathematical distance between a query's embedding and your content's embedding in vector space. Content selection for AI grounding operates via embedding distance, not keyword matching — if content doesn't land close to query embeddings, it won't be selected regardless of traditional ranking.
- Embedding Feed
- A scheduled job pushing new content as embeddings to a vector store, ensuring freshness for RAG assistants.
- Zero-Shot Retrieval
- The ability of a retriever to surface your page for queries the model never saw during training.
- Retrieval-to-Citation Drop-off
- The gap between being retrieved by an AI system and being cited in its final answer. AI platforms retrieve far more pages than they ultimately cite — the vast majority of retrieved pages never appear in the response. Being discovered is a prerequisite, not a guarantee; citation selection depends on title-query alignment, content clarity, readability, and query-type dynamics. The drop-off varies by intent: product-discovery and how-to queries convert at higher rates than comparison or validation queries.
- Persistent ID (GUID / GTIN / ISBN)
- A globally unique identifier letting different datasets resolve the same entity without confusion.
- Data Contract
- An explicit schema (often versioned) that defines the fields an API or feed will always supply.
- Content Delivery Network (CDN)
- A globally distributed cache that serves APIs and structured data closer to users and crawlers, boosting speed and uptime.
- API Versioning
- Labeling API releases (v1, v2…) so consumers — including LLMs — can rely on stable fields while you evolve the contract.
- Schema Markup Validator
- Tools (e.g., Google Rich Results Test) that check JSON-LD syntax and completeness, lowering Schema Error Rate.
- Schema Error Rate
- The percentage of URLs whose JSON-LD fails validation. High error rates hide facts from crawlers and embedding pipelines.
- Accessibility & Semantic HTML (WCAG)
- Proper headings, ARIA roles, alt text, and WCAG compliance so humans and multimodal agents can parse content.
- Schema.org Markup
- JSON-LD tags that declare entities (products, authors, FAQs, etc.) in a format search engines and LLM scrapers can parse.
- API Uptime
- The percentage of successful (200) responses from your data endpoints. Downtime means missing citations from AI systems and crawlers.
- API Latency
- The round-trip time of an API call. High latency degrades real-time chat or agent experiences.
- Embedding Sync Lag
- The time between publishing content and its appearance in your vector store, measured in minutes or hours.
- Interactive Tool / Calculator
- On-page widgets or callable APIs that let users compute something instantly, boosting engagement and task completion.
- Semantic Unit
- A 50-150 word content block capturing a single concept with explicit subject-predicate-object structure. The atomic unit of passage-level optimization. Splitting combined paragraphs into focused semantic units measurably improves cosine similarity, and adding proper headers compounds the gain.
- Model Context Protocol (MCP)
- An open standard (Anthropic, November 2024) for connecting AI systems to external tools, data sources, and systems. Adopted by OpenAI across all products including ChatGPT desktop. The de facto agent-to-tool layer — enables AI agents to call APIs, query databases, and execute actions through a unified interface.
- NLWeb
- A Microsoft protocol (co-developed with Schema.org co-founder R.V. Guha) that turns any website into an AI-queryable interface using existing Schema.org, RSS, and structured data. Every NLWeb instance is also an MCP server. Live on TripAdvisor and O'Reilly Media. Yoast integrated NLWeb into its Schema Aggregation feature, creating site-wide endpoints so AI agents can understand an entire site without page-by-page crawling.
- llms.txt
- A proposed plain-text file (like robots.txt) intended to help LLMs understand a site's structure and content. Widespread adoption but no major AI platform has confirmed using it in retrieval. Google's John Mueller confirmed it does not influence search rankings or AI Overview citations. Currently aspirational infrastructure, not a retrieval signal.
AI & Generative Search
- AI Overview (AIO)
- Google/Bing chat-style summary boxes that answer queries directly, usually citing a handful of URLs. Prevalence varies heavily by vertical and query type. Studies consistently show substantial organic CTR declines when AIOs appear, though being cited in an AIO significantly outperforms non-cited pages.
- AI Mode
- Google's conversational search surface, accessible as a dedicated tab. Studies indicate an extremely high zero-click rate — the most click-suppressive surface Google has launched. Queries tend to be significantly longer than traditional search. Despite high semantic similarity with AIO answers, citation overlap is very low — AI Mode and AIO are parallel retrieval systems, not the same system in different formats.
- Google Web Guide
- A Search Labs experiment (beta July 2025, still opt-in as of March 2026) that uses a custom Gemini model to decompose queries via fan-out and organize web results into thematic subtopic clusters instead of a ranked list. Unlike AIO (which absorbs clicks by synthesizing answers), Web Guide redistributes attention across curated source groups — structurally a traffic redistributor, not a traffic absorber. No empirical CTR data exists yet.
- Grounding Budget
- The total text an AI search system receives to synthesize an answer — roughly 2,000 words of context per query. The top-ranked source receives a disproportionate share; lower-ranked sources receive progressively less. Grounding plateaus at a few hundred words per source regardless of original page length — the strongest empirical argument for density over length.
- Content Survival Rate
- The percentage of a page's content that makes it into AI citations. Only a fraction of any page's content survives the grounding pipeline. Pages dense with service specifics, pricing, and process detail retain far more than pages front-loaded with promotional language — the AI pipeline is an aggressive filter that retains factual, task-relevant information and discards everything else.
- Extractive Summarization (Grounding)
- Google's grounding pipeline extracts exact sentences from source pages, not paraphrases. Query-focused selection with heavy lead/positional bias — opening paragraphs are extracted near-wholesale. Every sentence must function as a standalone extractable claim; pronouns and anaphora create extraction failures.
- Zero-Click Search
- Searches ending without a click to the open web. The majority of Google searches are now zero-click, and the rate is rising. A double headwind: search volume per capita is declining while the share of searches that produce no outbound click is increasing.
- Retrieval-Augmented Generation (RAG)
- A workflow that retrieves documents (often via a vector store) and feeds them into the LLM so it cites those facts instead of hallucinating. The architecture behind AI Overviews, ChatGPT web search, and Perplexity.
- Answer Snippet Engineering
- Crafting concise, self-contained paragraphs or bulleted answers so AIOs and voice assistants can quote you verbatim. Snippets are tightly length-constrained. Customer-centric language and clear value propositions are selected more frequently.
- Citation (in LLM output)
- An inline reference the LLM attaches to a statement — often a URL — that lets users verify the fact. AIO citations are highly volatile — the majority of cited pages turn over within a few months, meaning citation presence requires ongoing optimization, not one-time positioning.
- Content Licensing
- Deals (e.g., Reddit ↔ OpenAI) that feed your data into model training or private RAG stores.
- AI Agent / Autonomous Agent
- An LLM-powered system that chains tools and decisions to complete multi-step tasks with minimal human input.
- Multimodal Search
- Search results blending text, images, video, and voice. LLMs route queries to the best modality or combine several.
- Retrieval Weighting
- The scoring logic (similarity × recency × authority, etc.) used to rank documents returned to the LLM during RAG.
- User Feedback Loop
- Thumbs-up/down or rating data on AI answers used to fine-tune future ranking or generation behavior.
- Tool Usage Rate
- The percentage of chat sessions where an agent invokes your calculator or API — a proxy for experience depth in AI contexts.
- Fraggle
- A fragmented passage extracted from a page and surfaced independently in AI Overviews — a short, discrete answer segment that functions as its own retrieval unit. The vast majority of AI Overview citations come from deep interior pages, not homepages, often via fraggle extraction.
- Agentic Commerce Protocol (ACP / UCP)
- Open protocols enabling AI agents to complete purchases within chat interfaces. OpenAI's ACP (with Stripe) is live in ChatGPT for 1M+ Shopify merchants. Google's UCP (with Shopify, Walmart, Target) covers discovery through post-purchase support. Structured product data is no longer markup for rich results — it's an API surface for agent-mediated transactions.
- Selection Rate (SR)
- The frequency at which AI systems incorporate a specific source from available grounding results, expressed as (selections / total available results) × 100. The generative AI equivalent of CTR. Primarily influenced by primary bias — the model's pre-training confidence in a brand's relevance — rather than page-level content signals.
- Ghost Citation
- When an AI system cites a source's content as evidence but does not mention the brand by name in its recommendation. Research shows 73% of AI brand presence consists of ghost citations, where brands supply evidence but competitors receive the recommendation. Being cited and being recommended require separate optimization strategies.
- Generative Intent
- A query category unique to AI search platforms where users ask the system to create, draft, or generate something rather than retrieve existing information ("write me a cover letter," "draft a meal plan"). Accounts for 37.5% of ChatGPT prompts and has no equivalent in traditional search intent taxonomy.
- Web Text Fragment (#:~:text=)
- A URL fragment directive (WICG Text Fragments spec) that scrolls to and highlights specific text on a page. Google's AI Mode and Gemini embed these in citation URLs, encoding the exact sentence selected for grounding — enabling sentence-level reverse-engineering of citation behavior. Fragment format: #:~:text=[prefix-,]textStart[,textEnd][,-suffix]. Shashko (2026) decoded 11,672 fragments to produce the first sentence-level AI citation study.
LLM Fundamentals
- Large Language Model (LLM)
- A neural-network model (e.g., GPT-5, Gemini) trained on massive text corpora and capable of predicting tokens, answering questions, and following instructions.
- Token
- The smallest unit an LLM processes (≈ one word or punctuation mark). Costs, context limits, and output length are all measured in tokens.
- Embedding / Vector Embedding
- A fixed-length list of numbers that captures a text's meaning so similar texts sit close together in multi-dimensional space.
- Contextual Embedding
- An embedding generated with awareness of the surrounding document, not just the target passage. Resolves ambiguity when a paragraph's meaning depends on its page context — e.g., 'their pricing model' is meaningless without knowing which company the page discusses. Perplexity's pplx-embed-context-v1 is the first major open-source implementation, outperforming prior contextual models by 2-10 percentage points on ConTEB benchmarks.
- Bidirectional Encoder
- A text model that processes tokens with attention to both preceding and following context, unlike causal (left-to-right) language models. BERT pioneered the approach; modern retrieval embeddings (pplx-embed, gte-Qwen) convert causal decoder models into bidirectional encoders to produce richer text representations for search.
- Vector Store / Vector DB
- A database (e.g., Pinecone, Supabase, Elasticsearch KNN) optimized for storing embeddings and running "nearest-vector" queries.
- Context Window
- The maximum token count an LLM can "remember" per interaction (prompt + response). Governs chunk sizing in RAG.
- Chunking / Text Splitting
- The process of splitting longer content into smaller, coherent segments before creating embeddings. Effective chunking follows natural content boundaries (e.g., headings and paragraphs) so each vector corresponds to a complete idea that AI systems can reliably retrieve as context.
- Prompt / System Message
- Instructions prepended to the user prompt that set tone, policy, or formatting rules for an LLM conversation.
- Temperature
- A generation parameter (0-2) controlling randomness: lower = deterministic, higher = creative.
- Hallucination
- An LLM answer that sounds plausible but is factually wrong because the model filled gaps with guesswork.
- Reinforcement Learning from Human Feedback (RLHF)
- A training loop that uses human ratings of model outputs as rewards, aligning the LLM with helpful, harmless responses.
- Fine-tuning / Supervised Fine-tuning (SFT)
- Further training a pre-trained LLM on a smaller, task-specific dataset to adopt domain language, style, or private knowledge.
- Multimodal LLM
- A model that both consumes and produces text plus other media (images, audio, video), enabling richer search and UX.
- Function Calling
- When an LLM outputs a JSON payload instructing software to run a tool (e.g., calculateSavingsRate) mid-conversation.
- Prompt Engineering
- The craft of writing clear, constrained prompts (plus examples) to steer LLM outputs toward desired style and accuracy.
- Zero-/Few-/Multi-Shot Prompting
- Supplying zero, a few, or many examples in the prompt to guide model reasoning and relevance.
- Cost per Token
- What you pay an LLM provider for each input/output token. Crucial for budgeting large-scale RAG or generation.
- Hallucination Rate
- The share of AI answers where the LLM asserts unverified or false information. Must be monitored when AI-generated content or AI-assisted tools are part of the product.
Diagnostic Framework
These concepts map to the Clinical Retrieval & Ranking Framework
- Binding Constraint
- The first layer a site cannot pass in the diagnostic framework. Everything downstream is irrelevant until the constraint is cleared. The most common form of SEO capital destruction: optimizing Layers 4-7 while the site has an unresolved Layer 2 access problem.
- Evidence-Builder Loop
- Win achievable queries first to build authority priors, then use those priors to compete for harder queries. Topical authority measurably accelerates traffic acquisition. Sequencing matters: authority compounds on earlier wins.
- Ceiling vs. Weight
- Layer 1 (Eligibility) sets the ranking ceiling — the maximum achievable position given domain authority, penalties, and YMYL risk. Layer 7 (Competition) determines the weight needed to reach that ceiling — authority gaps, SERP feature concentration, and differentiation. Misdiagnosing a ceiling problem as a weight problem wastes capital on content and links that can never rank.
- Investment Screen
- A three-level strategic filter that runs before the diagnostic: (1) Channel qualification — is organic search the right channel? (2) Page category allocation — right page type for this business model? (3) Query-level expected value — does the return justify the investment? Each screen must clear before the next runs.
Strategy & Architecture
- Aggregator vs. Integrator
- The strategic archetype that determines which SEO levers exist. Aggregator SEO is product-led, inventory/UGC-driven, with SEO as the primary growth channel — wins on data scale (cost leadership). Integrator SEO is marketing-led, company-created content, with SEO as a supporting channel — wins on content quality (differentiation). Programmatic SEO architecture is fundamentally an aggregator play: the data asset is the product.
- Product-Led SEO
- SEO treated as a product experience rather than a traffic channel, coined by Eli Schwartz. The core principle: build the product in the way search algorithms optimize for. The rendered data is the product optimized for search — directly applicable to programmatic architecture where template pages surface structured data at scale.
- Search TAM
- Total Addressable Market sizing applied to organic search: TAM (all search volume in category), SAM (queries you could realistically target), SOM (queries you can capture given current resources). Standard keyword-volume forecasting is highly unreliable — scenario planning with stage gates outperforms point estimates.
- Data Moat
- A competitive advantage built from proprietary data that creates self-reinforcing cycles: more data produces better products, which attract more users, which generate more data. Requires 2-3 years of consistent investment before delivering significant advantages. Zillow, TripAdvisor, and NerdWallet all built organic moats through proprietary data plus template infrastructure, not editorial volume.
- Content Half-Life
- The time it takes for a piece of content to lose half its organic visibility. Has compressed significantly for competitive topics. Refreshing legacy content consistently outperforms exclusive focus on new production. Strategy must allocate a meaningful share of content budget to maintenance.
- Content Pruning
- Systematic removal or consolidation of pages that drag down site-level quality signals. Embedding-based methodology: generate topic centroids, score all pages against them via cosine similarity, layer in performance data and freshness, then apply kill/keep/review thresholds. Critical for programmatic builds where template pages can drift off-topic.
Want to see how this applies in practice?
The glossary covers the vocabulary. The patterns go deeper — real architectural problems from real audits, with the diagnosis and fix.
See Tech SEO Patterns