The Search Engineering Dictionary

The vocabulary of SEO has evolved from "keywords and links" to embeddings, retrieval, and agents.These are the concepts that matter for how search and AI discovery actually work now.

Ranking Systems & Signals

Information Gain Score
A Google scoring system (patent US11354342B2) that measures how much novel content a document contains relative to what already exists in the result set. Documents that say the same thing as other ranking pages score near zero and can be demoted or excluded entirely.
Site Quality Score (Panda)
A site-level quality signal (patent US9031929B1) calculated from the ratio of navigational queries directed at the site vs. informational queries the site answers. A low score suppresses all pages from the domain — thin or repetitive pages across a programmatic build drag down the entire domain's ranking ceiling. The 2024 API leak revealed pandaDemotion as a pre-computed site-wide demotion in CompressedQualitySignals, plus babyPanda variants (lighter/faster iterations). Panda operates as 'algorithmic debt' — a site-level quality tax that functions as a ceiling no amount of page-level optimization can overcome.
Pairwise Quality Scoring
Pages are ranked via head-to-head comparison against other pages competing for the same query — not on an absolute quality score. An objectively strong page can rank #7 while a weaker page ranks #1 because the weaker page wins the pairwise comparison on specific signals (links, entity coverage, click data).
Topical Authority
Not a single signal but the combined output of multiple overlapping systems: site-level topic embeddings (QualityAuthorityTopicEmbeddings), siteFocusScore, siteRadius, NsrChunks (per-section topic evaluation), and ClusterUplift. 50 thin articles won't move the topic embedding vector — 20 deeply comprehensive, entity-rich articles will.
Content Freshness Scoring
A document scoring system (patent US8549014B2) that tracks the age distribution of content within a page — how much is old vs. recently added. The 2024 API leak revealed lastSignificantUpdate (tracking substantive revisions, not cosmetic edits), freshByDocFp (document fingerprinting that detects whether actual content changed vs. just timestamps), and bylineDateConfidence (confidence score for byline date accuracy — contradictory dates degrade the freshness signal). Google stores only the last 20 versions of a document. Cosmetic date changes without substantive edits do not improve freshness scores.
N-gram Quality Prediction
A quality detection system (patent US9767157B2) that builds a phrase model from sites of known quality and scores new content against it. Creates a linguistic fingerprint of what quality writing looks like. The patent-level mechanism behind detecting thin content, keyword stuffing, and machine-generated text with unnatural phrase distributions.
Entity-Based Ranking
A ranking system (patent US10235423B2) that identifies entities in search results via a knowledge graph, assigns weights by entity type (person, place, organization, product), and computes a composite ranking score. Being recognized as a distinct entity is a direct ranking input.
Core Web Vitals (LCP, INP, CLS)
Google's performance metrics for load speed (Largest Contentful Paint), interactivity (Interaction to Next Paint), and visual stability (Cumulative Layout Shift). A ranking signal and direct measure of user experience quality.
Engagement Signal
Behavioral metrics like dwell time, scroll depth, and repeat visits. NavBoost uses these per-topic to re-rank results. Can also feed back into RLHF loops for AI systems.
Passage-Level Ranking
Google evaluates and ranks individual passages within documents independently (patent US20090055389A1). Sections compete on their own merits — author popularity, word choice, passage diffusion, and whether the passage adds unique information. A single well-written section can surface even when the broader document isn't the best overall match.
Time on Task
Measures how long it takes users to complete a goal or the percentage who succeed — finer than generic dwell time.
Topic Centroid
The average embedding vector computed from a site's keyword portfolio, representing its semantic center of gravity. Used to measure how well individual pages align with the site's core topics and to identify off-topic pages dragging down site-level quality scores.
contentEffort
An attribute from the 2024 Google API leak described as an LLM-based effort estimation for article pages. May quantify human labor, originality, and resources invested in creating content — including unique images, original data, embedded tools, and linguistic complexity. If the leak is interpreted correctly, this could be the closest algorithmic proxy for the Experience dimension of E-E-A-T, though how Google currently weights it (or whether it remains active) is unknown.
Q* (Quality Star)
Google's aggregate site/document quality metric, revealed through the DOJ antitrust trial and 2024 API leak. Combines content quality, authority signals, and an evolved PageRank that measures 'distance from a known good source.' Q* is largely static and query-independent — a high Q* score applies across all topics for a domain. Described as 'deliberately engineered rather than machine-learned.' The relationship: E-E-A-T is the goal, Q* is the system, Site_Quality is the score.
CompressedQualitySignals
The pre-computed quality gatekeeper module in Google's ranking pipeline. Contains per-document signals — siteAuthority, pandaDemotion, navDemotion, anchorMismatchDemotion, exactMatchDomainDemotion, and others — that can disqualify a page before query-time ranking even begins. A poor CompressedQualitySignals score means no amount of on-page optimization can compensate. This is the mechanism behind why site-level quality acts as a ceiling.
Twiddler
Specialized re-ranking functions in Google's SuperRoot framework that adjust results after the primary ranking algorithm (Ascorer) runs. Named Twiddlers include NavBoost (click-based), QualityBoost, FreshnessBoost, RealTimeBoost, and WebImageBoost. 'Lazy Twiddlers' only process the top 20-30 results with granular adjustments. When Google says a system is 'not part of the core algorithm,' it often means it operates as a Twiddler — a post-ranking overlay, not a primary scoring input.
Index Tiering (Base / Zeppelins / Landfills)
Google's three-tier document storage system. Base (flash memory) holds the most important, frequently updated content. Zeppelins (SSDs) hold mid-priority pages. Landfills (hard drives) hold low-importance, irregularly updated content. Tier placement determines crawl frequency and serving speed — and directly affects outgoing link value, since links from Base-tier pages carry more signal than links from Landfill-tier pages.
Copia / Firefly
Google's scaled content abuse detection system, revealed in the 2024 API leak. Copia (Latin for 'abundance') monitors content velocity — the ratio of URLs generated against substantive articles produced. Firefly aggregates inputs from Copia, page quality scores, and NavBoost user dissatisfaction signals to make site-wide demotion decisions. This is Google's primary mechanism against AI-generated content farms — there is no single 'AI content detector,' but rapid content velocity combined with low engagement and low quality triggers Firefly.
hostAge (Sandbox)
A PerDocData attribute from the 2024 API leak explicitly described as used 'to sandbox fresh spam in serving time.' Represents the earliest date Google first encountered any page on a domain. Google has publicly denied the existence of a sandbox for new sites; the leak directly contradicts this. New domains face a trust-building period regardless of content quality — inherent disadvantage against sites with 13 months of accumulated NavBoost click signals.

Retrieval & Discovery

Canonical Tag
An HTML link that identifies the master version of a page, preventing duplicate URLs from diluting authority.
Robots.txt / Robots Meta
Rules that allow or block crawling and indexing, managing crawl budget and LLM ingestion cost.
Sitemap / RSS / Atom Feed
XML files listing URLs and timestamps so crawlers and RAG sync jobs know what's new or updated.
Query Fan-Out
Google's AI systems decompose a single query into multiple sub-queries across eight variant types (equivalent, follow-up, generalization, specification, canonicalization, translation, entailment, clarification), run them in parallel, then stitch results together. Gemini 3 averages 10.7 sub-queries per prompt (78% increase over Gemini 2.5), and 95% of fan-out queries have zero traditional search volume. Fan-out coverage has a 0.77 Spearman correlation with AI citation likelihood.
Sub-Query
A derivative search created during fan-out (e.g., "cost of living in Brooklyn" from "Is NYC affordable?"). Each sub-query feeds fresh documents into the retrieval pipeline independently.
Cosine Similarity
The mathematical distance between a query's embedding and your content's embedding in vector space. Content selection for AI grounding operates via embedding distance, not keyword matching — if content doesn't land close to query embeddings, it won't be selected regardless of traditional ranking.
Embedding Feed
A scheduled job pushing new content as embeddings to a vector store, ensuring freshness for RAG assistants.
Zero-Shot Retrieval
The ability of a retriever to surface your page for queries the model never saw during training.
Retrieval-to-Citation Drop-off
The gap between being retrieved by an AI system and being cited in its final answer. AI platforms retrieve far more pages than they ultimately cite — the vast majority of retrieved pages never appear in the response. Being discovered is a prerequisite, not a guarantee; citation selection depends on title-query alignment, content clarity, readability, and query-type dynamics. The drop-off varies by intent: product-discovery and how-to queries convert at higher rates than comparison or validation queries.
Persistent ID (GUID / GTIN / ISBN)
A globally unique identifier letting different datasets resolve the same entity without confusion.
Data Contract
An explicit schema (often versioned) that defines the fields an API or feed will always supply.
Content Delivery Network (CDN)
A globally distributed cache that serves APIs and structured data closer to users and crawlers, boosting speed and uptime.
API Versioning
Labeling API releases (v1, v2…) so consumers — including LLMs — can rely on stable fields while you evolve the contract.
Schema Markup Validator
Tools (e.g., Google Rich Results Test) that check JSON-LD syntax and completeness, lowering Schema Error Rate.
Schema Error Rate
The percentage of URLs whose JSON-LD fails validation. High error rates hide facts from crawlers and embedding pipelines.
Accessibility & Semantic HTML (WCAG)
Proper headings, ARIA roles, alt text, and WCAG compliance so humans and multimodal agents can parse content.
Schema.org Markup
JSON-LD tags that declare entities (products, authors, FAQs, etc.) in a format search engines and LLM scrapers can parse.
API Uptime
The percentage of successful (200) responses from your data endpoints. Downtime means missing citations from AI systems and crawlers.
API Latency
The round-trip time of an API call. High latency degrades real-time chat or agent experiences.
Embedding Sync Lag
The time between publishing content and its appearance in your vector store, measured in minutes or hours.
Interactive Tool / Calculator
On-page widgets or callable APIs that let users compute something instantly, boosting engagement and task completion.
Semantic Unit
A 50-150 word content block capturing a single concept with explicit subject-predicate-object structure. The atomic unit of passage-level optimization. Splitting combined paragraphs into focused semantic units measurably improves cosine similarity, and adding proper headers compounds the gain.
Model Context Protocol (MCP)
An open standard (Anthropic, November 2024) for connecting AI systems to external tools, data sources, and systems. Adopted by OpenAI across all products including ChatGPT desktop. The de facto agent-to-tool layer — enables AI agents to call APIs, query databases, and execute actions through a unified interface.
NLWeb
A Microsoft protocol (co-developed with Schema.org co-founder R.V. Guha) that turns any website into an AI-queryable interface using existing Schema.org, RSS, and structured data. Every NLWeb instance is also an MCP server. Live on TripAdvisor and O'Reilly Media. Yoast integrated NLWeb into its Schema Aggregation feature, creating site-wide endpoints so AI agents can understand an entire site without page-by-page crawling.
llms.txt
A proposed plain-text file (like robots.txt) intended to help LLMs understand a site's structure and content. Widespread adoption but no major AI platform has confirmed using it in retrieval. Google's John Mueller confirmed it does not influence search rankings or AI Overview citations. Currently aspirational infrastructure, not a retrieval signal.

Authority & Trust

Entity Stacking
Building 30-50 unique trust signals across trusted third-party sources (citations, social profiles, reference sites, press) before investing in content or links. Google needs these signals before it considers a brand a real entity. Because these are trusted sources, the entire stack can be built quickly without triggering spam signals.
ClusterUplift
Google groups sites with similar sites and applies collective quality boosts or demotions to the cluster. If the cluster has a quality problem, every site in it gets demoted — even clean ones. This explains why entire niches get hammered in updates while individual sites in other niches are untouched.
siteFocusScore / siteRadius
Two signals from the Google API leak measuring topical coherence. siteFocusScore quantifies how dedicated a site is to a single topic (specialist vs. generalist). siteRadius measures how much an individual page deviates from the site's central theme. High focus with low radius = strong topical authority signal.
Brand Mention
Your name in trusted publications without a link. Modern ranking systems treat these as implied links. Studies suggest brand search volume is a stronger predictor of AI citations than backlinks.
Author Markup / Bylines
Schema fields or HTML blocks that tie content to a real person, allowing knowledge graphs to attribute expertise. Author topic authority (patent US8458196B1) accumulates per-topic — authors who write repeatedly on a focused topic compound authority scores.
Citation Velocity
The rate at which new referring domains cite your content. Spikes often precede ranking gains.
Knowledge Graph
A search-engine database of entities and their relationships. Accurate representation fast-tracks authority in both traditional rankings and LLM answers.
Entity Linking / Disambiguation
Associating a text mention with the correct entity ID (e.g., Apple-fruit vs. Apple-Inc.), preventing authority leaks to competitors.
Third-Party Review
User or expert ratings hosted off-site (e.g., Trustpilot) that act as external validation of quality.
NAP Consistency
Exact match of Name, Address, Phone across directories — critical for local SEO and entity disambiguation.
User-Generated Content (UGC)
Reviews, comments, or forum posts that add fresh, authentic signals of experience. Reddit has become one of the most visible sites in Google US results, appearing in the vast majority of product review queries.
Knowledge-Graph Triple
A fact stored as (subject → predicate → object), e.g., "Citibank → offersRate → 3.30% APY." The atomic unit of structured knowledge.
Digital Signature / Provenance (C2PA)
Cryptographic metadata proving who created a file and whether it has been altered, boosting trust in LLM pipelines.
E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness)
Not a single ranking signal but a label for dozens of independent algorithmic features evaluated at document, domain, and entity levels. E-E-A-T enters rankings indirectly: Quality Raters evaluate pages using E-E-A-T criteria, those evaluations become training data for Google's RankEmbed models, and the models learn to predict rater-like quality scores at scale. Trust is the most important dimension — an untrustworthy page receives low E-E-A-T regardless of expertise or authority. For AI citation, E-E-A-T appears to act as a gate, not a weight: the overwhelming majority of AI Overview citations come from sources with strong E-E-A-T signals.
YMYL (Your Money or Your Life)
Google's classification for queries where low-quality results could harm users — health, finance, legal, civic information. YMYL queries receive differential E-E-A-T weighting: Google gives more weight to authoritativeness, expertise, and trustworthiness signals. Quality thresholds ratchet upward over time — even major health publishers saw drops in recent core updates. YMYL verticals trigger AI Overviews at significantly different rates, with health and legal far more likely to show AIOs than commercial or political queries.

LLM Fundamentals

Large Language Model (LLM)
A neural-network model (e.g., GPT-5, Gemini) trained on massive text corpora and capable of predicting tokens, answering questions, and following instructions.
Token
The smallest unit an LLM processes (≈ one word or punctuation mark). Costs, context limits, and output length are all measured in tokens.
Embedding / Vector Embedding
A fixed-length list of numbers that captures a text's meaning so similar texts sit close together in multi-dimensional space.
Contextual Embedding
An embedding generated with awareness of the surrounding document, not just the target passage. Resolves ambiguity when a paragraph's meaning depends on its page context — e.g., 'their pricing model' is meaningless without knowing which company the page discusses. Perplexity's pplx-embed-context-v1 is the first major open-source implementation, outperforming prior contextual models by 2-10 percentage points on ConTEB benchmarks.
Bidirectional Encoder
A text model that processes tokens with attention to both preceding and following context, unlike causal (left-to-right) language models. BERT pioneered the approach; modern retrieval embeddings (pplx-embed, gte-Qwen) convert causal decoder models into bidirectional encoders to produce richer text representations for search.
Vector Store / Vector DB
A database (e.g., Pinecone, Supabase, Elasticsearch KNN) optimized for storing embeddings and running "nearest-vector" queries.
Context Window
The maximum token count an LLM can "remember" per interaction (prompt + response). Governs chunk sizing in RAG.
Chunking / Text Splitting
The process of splitting longer content into smaller, coherent segments before creating embeddings. Effective chunking follows natural content boundaries (e.g., headings and paragraphs) so each vector corresponds to a complete idea that AI systems can reliably retrieve as context.
Prompt / System Message
Instructions prepended to the user prompt that set tone, policy, or formatting rules for an LLM conversation.
Temperature
A generation parameter (0-2) controlling randomness: lower = deterministic, higher = creative.
Hallucination
An LLM answer that sounds plausible but is factually wrong because the model filled gaps with guesswork.
Reinforcement Learning from Human Feedback (RLHF)
A training loop that uses human ratings of model outputs as rewards, aligning the LLM with helpful, harmless responses.
Fine-tuning / Supervised Fine-tuning (SFT)
Further training a pre-trained LLM on a smaller, task-specific dataset to adopt domain language, style, or private knowledge.
Multimodal LLM
A model that both consumes and produces text plus other media (images, audio, video), enabling richer search and UX.
Function Calling
When an LLM outputs a JSON payload instructing software to run a tool (e.g., calculateSavingsRate) mid-conversation.
Prompt Engineering
The craft of writing clear, constrained prompts (plus examples) to steer LLM outputs toward desired style and accuracy.
Zero-/Few-/Multi-Shot Prompting
Supplying zero, a few, or many examples in the prompt to guide model reasoning and relevance.
Cost per Token
What you pay an LLM provider for each input/output token. Crucial for budgeting large-scale RAG or generation.
Hallucination Rate
The share of AI answers where the LLM asserts unverified or false information. Must be monitored when AI-generated content or AI-assisted tools are part of the product.

Diagnostic Framework

These concepts map to the Clinical Retrieval & Ranking Framework

Binding Constraint
The first layer a site cannot pass in the diagnostic framework. Everything downstream is irrelevant until the constraint is cleared. The most common form of SEO capital destruction: optimizing Layers 4-7 while the site has an unresolved Layer 2 access problem.
Evidence-Builder Loop
Win achievable queries first to build authority priors, then use those priors to compete for harder queries. Topical authority measurably accelerates traffic acquisition. Sequencing matters: authority compounds on earlier wins.
Ceiling vs. Weight
Layer 1 (Eligibility) sets the ranking ceiling — the maximum achievable position given domain authority, penalties, and YMYL risk. Layer 7 (Competition) determines the weight needed to reach that ceiling — authority gaps, SERP feature concentration, and differentiation. Misdiagnosing a ceiling problem as a weight problem wastes capital on content and links that can never rank.
Investment Screen
A three-level strategic filter that runs before the diagnostic: (1) Channel qualification — is organic search the right channel? (2) Page category allocation — right page type for this business model? (3) Query-level expected value — does the return justify the investment? Each screen must clear before the next runs.

Strategy & Architecture

Aggregator vs. Integrator
The strategic archetype that determines which SEO levers exist. Aggregator SEO is product-led, inventory/UGC-driven, with SEO as the primary growth channel — wins on data scale (cost leadership). Integrator SEO is marketing-led, company-created content, with SEO as a supporting channel — wins on content quality (differentiation). Programmatic SEO architecture is fundamentally an aggregator play: the data asset is the product.
Product-Led SEO
SEO treated as a product experience rather than a traffic channel, coined by Eli Schwartz. The core principle: build the product in the way search algorithms optimize for. The rendered data is the product optimized for search — directly applicable to programmatic architecture where template pages surface structured data at scale.
Search TAM
Total Addressable Market sizing applied to organic search: TAM (all search volume in category), SAM (queries you could realistically target), SOM (queries you can capture given current resources). Standard keyword-volume forecasting is highly unreliable — scenario planning with stage gates outperforms point estimates.
Data Moat
A competitive advantage built from proprietary data that creates self-reinforcing cycles: more data produces better products, which attract more users, which generate more data. Requires 2-3 years of consistent investment before delivering significant advantages. Zillow, TripAdvisor, and NerdWallet all built organic moats through proprietary data plus template infrastructure, not editorial volume.
Content Half-Life
The time it takes for a piece of content to lose half its organic visibility. Has compressed significantly for competitive topics. Refreshing legacy content consistently outperforms exclusive focus on new production. Strategy must allocate a meaningful share of content budget to maintenance.
Content Pruning
Systematic removal or consolidation of pages that drag down site-level quality signals. Embedding-based methodology: generate topic centroids, score all pages against them via cosine similarity, layer in performance data and freshness, then apply kill/keep/review thresholds. Critical for programmatic builds where template pages can drift off-topic.

Want to see how this applies in practice?

The glossary covers the vocabulary. The patterns go deeper — real architectural problems from real audits, with the diagnosis and fix.

See Tech SEO Patterns