Research Synthesis

In-Page Information Architecture: Structuring Content for Three Audiences

The SEO discipline has mature frameworks for site architecture and content strategy. The gap is how information is organized within a single page. Three consumers parse in-page structure using overlapping but distinct logic: human readers scan in patterns shaped by visual hierarchy, search crawlers segment text into ranked passages, and AI retrieval systems extract individual sentences under hard token constraints. The design patterns that serve all three converge on the same structural unit.

Compiled by Aviel Fahl · Last updated March 25, 2026

Key Findings

AI citation systems extract individual sentences with a median length of 10 words and a hard ceiling at 17 words. Structured content (headings, lists, tables) achieves a 2.3x citation advantage over unstructured prose. The grounding budget per query is approximately 2,000 words, and content survival drops from 61% for pages under 1,000 words to 13% for pages over 3,000 words. Eye-tracking research confirms that 57% of viewing time is spent above the fold and that scannable formatting improves usability by 124%. These constraints converge on a single structural primitive: the 50-180 word semantic unit with an explicit heading, self-contained sentences, and at least one extractable data point. Pages built from these units serve human scanners, passage-level indexing, and AI extraction pipelines simultaneously. 94.8% of pages fail at least one WCAG 2 accessibility check, meaning compliance with accessibility standards that also benefit search and AI retrieval is a structural competitive advantage.

Contents

10 words

Median sentence length in AI citations (Shashko, 42,971)

2.3x

Structured content citation advantage vs unstructured

94.8%

Pages with at least one WCAG 2 failure (WebAIM 2025)

~2,000

Words in AI grounding budget per query (DEJAN)

The Missing Middle


Site architecture has a mature body of research. Topic clusters produce +43% keyword visibility (HubSpot). Crawl depth beyond three clicks degrades crawl rates by 33% (Botify, 6.2B requests). Internal linking A/B tests show +25% organic uplift (SearchPilot). These findings shape how practitioners connect pages to each other.

Content strategy has its own evidence base. The content engineering discipline covers what to build: content models, metadata, structured data, governance systems. The gap is the layer between these two: how information is organized within a single page.

The gap matters because three distinct consumers parse in-page structure, each with different processing logic.

Human readers scan in predictable patterns shaped by visual hierarchy. NN/g eye-tracking across multiple studies (2006-2019, cumulative n=200+) found that 79% of users scan any new page. Only 16% read word-by-word. The layer-cake pattern, where users fixate on headings and skip body text, is the most effective scanning behavior. It only occurs when pages provide sufficient structural cues. Without headings, bolding, and whitespace, users fall into the F-pattern: a failure state where reading efficiency collapses.

Search crawlers extract text from rendered HTML, segment it into passages, and classify entities. Passage-level indexing (patent US20160078102) evaluates individual passages within a page independently of surrounding content. A well-structured page with clear section boundaries gives the passage indexer cleaner input.

AI retrieval systems extract at sentence-level granularity. Shashko's analysis of 42,971 AI citations found a median cited sentence length of 10 words and a hard ceiling at 17 words. The grounding budget per query is approximately 2,000 words (DEJAN, 7,060 queries). Content must be structured so the highest-value sentences are self-contained, because AI systems extract exact sentences, not paraphrases.

Why this is not 'content quality'

Content quality and in-page architecture are independent dimensions. A page can contain excellent, well-researched content and still fail all three audiences if the information is presented as undifferentiated prose. The content engineering research covers what to build. This page covers how to structure what you have built.

The Semantic Unit


The atomic building block of in-page architecture. iPullRank defines a semantic unit as a 50-150 word block capturing a single concept with explicit subject-predicate-object structure. The size range aligns with multiple independent findings about how machines process content.

Source:Multiple independent studies converge on the same structural unit size: 50-180 words, self-contained, with explicit headings.
FindingMeasurementSource
Optimal section length for AI citation120–180 wordsSE Ranking, 129K domains
Structured content citation advantage2.3x (91.3% vs 39.3% sentence match)Shashko, 42,971 citations
Splitting combined topics improves cosine similarity+19.24%iPullRank, relevance engineering
Adding proper headers after splitting+17.54% additional liftiPullRank, relevance engineering
Sequential headings and citation correlation2.8x higher citation likelihoodAirOps, 2026 State of AI Search

The convergence is notable because these studies measured different things. SE Ranking measured AI citation rates across 129,000 domains. Shashko measured sentence-level extraction patterns across six AI platforms. iPullRank measured cosine similarity improvements from restructuring content. AirOps measured citation likelihood across their 2026 dataset. They all arrived at the same structural unit size.

The semantic unit is not just a search optimization concept. It maps directly to how humans process information in the layer-cake scanning pattern. Users fixate on headings to decide whether to read the block below. Each heading-plus-block is a decision point. If the block is self-contained, meaning the heading accurately previews the content and the content delivers without requiring context from adjacent blocks, both the human scanner and the AI extractor can process it independently.

NN/g paragraph attention data quantifies this: paragraph 1 is viewed by 81% of users, paragraph 2 by 71%, paragraph 3 by 63%, paragraph 4 by only 32%. The drop from first to fourth paragraph is nearly 50 percentage points. This steep decay makes the case for front-loading the most important information within each semantic unit, not just within the page overall.

Grounding Budget and Extraction Mechanics


AI retrieval systems operate under hard constraints that dictate how much of a page gets used. The AI citation research covers the retrieval pipeline in detail. Here the focus is on what these constraints mean for page-level structure.

Source:DEJAN AI, 7,060 queries, 2,275 tokenized pages, 883,262 snippets (March 2026)
Page LengthContent GroundedImplication
<1K words61%Short pages lose less but have less to offer
1–2K words35%Sweet spot for most content types
2–3K words22%Diminishing returns begin
3K+ words13%Most content never reaches the model

Grounding plateaus at approximately 540 words / 3,500 characters regardless of page length. The architectural question is not "how long should the page be" but "which 540 words will the model see."

Sentence-level extraction specifics from Shashko's 42,971-citation study across six platforms paint the granular picture. The median cited sentence is 10 words. The maximum is 17 words, a hard ceiling with nothing longer cited in the entire dataset. 92.4% of citations fall between 6 and 20 words. Position bias is significant: the mean cited position is 34.9% down the page, with the 75th percentile at 48.8%. The top third of the page produces a disproportionate share of citations.

The structural gap between pages is stark. Pages with lists, tables, and headings achieved a 91.3% sentence-match rate. Unstructured pages: 39.3%. That 2.3x advantage is not about having better content. It is about making individual sentences extractable. A well-researched paragraph buried in a wall of text is invisible to the extraction pipeline. The same sentence with a heading above it and whitespace around it is a candidate for citation.

The extractive summarization constraint

Google's AI systems use extractive summarization, pulling exact sentences from source pages rather than paraphrasing. DEJAN confirmed this through direct comparison of AI output against source text. The structural implication: every sentence that could be a citation target must be grammatically complete and self-contained. Sentences that begin with "This" or "However" referring to a previous paragraph are invisible to extractive pipelines because they cannot stand alone.

Scanning Patterns as Structural Constraints


Eye-tracking research provides the empirical foundation for in-page architecture decisions. These are measured behaviors, not style preferences, and they determine whether content gets processed.

NN/g eye-tracking studies (multiple rounds, 2006-2019, cumulative n=200+) established two primary scanning patterns. The layer-cake pattern is optimal: users fixate on headings and subheadings, deliberately skipping body text between them. NN/g describes this as "by far the most effective way to scan pages." It occurs when pages provide clear visual hierarchy with distinct headings, varied formatting, and whitespace. The F-pattern is a failure state: users read the first line fully, scan partway through the second, then skim vertically down the left margin. It occurs when pages lack formatting cues. Reading efficiency collapses.

Source:NN/g eye-tracking study (2018, n=120, 130,000+ fixations, 1920x1080 screens)
PositionViewing TimeImplication
Above fold57%Critical content and key findings here
First two screenfuls74%Most users never scroll past this
First three screenfuls81%Effective page boundary for most visitors
Below three screenfuls19%Only committed readers reach here

Combined scannable + concise + objective formatting produces+124% usability improvement (Morkes & Nielsen, 1997):

  • Scannable text (headings, bullets): +47%
  • Concise text (half word count): +58%
  • Objective tone (non-promotional): +27%

The sample was small by behavioral science standards, but the directional findings are consistent with all subsequent NN/g research.

First impressions compound the structural stakes. Users form reliable aesthetic judgments within 50 milliseconds (Lindgaard et al. 2006, 1,000+ academic citations, replicated by Google/University of Basel 2012). Visual complexity and prototypicality affect perception at 17ms. The page structure above the fold is evaluated before the first word is read. A dense wall of text triggers a negative aesthetic judgment before content quality can register.

The connection to NavBoost: these scanning patterns directly feed Google's most important ranking signal. If visual structure causes a user to stay and engage (goodClicks, lastLongestClicks), the page accumulates positive NavBoost signal on a rolling 13-month window. If poor structure causes a quick return to the SERP (badClicks), the page accumulates negative signal. In-page architecture is not a UX concern separate from ranking. It is a ranking input.

Content Format and Citation Rates


Specific content formats produce measurably different AI citation rates. The format is the container, not the content. The same information, restructured, produces different citation outcomes.

Source:Onely, compiled from Digital Bloom 2025, AmICited.com, Frase.io, Semrush. Sample: 768K+ citations, 67,394 content pieces.
FormatCitation Rate / LiftSource
Data tables~2.5x vs paragraph textOnely, compiled from multiple studies
FAQ structure3.2x more likely in AIOsOnely
Comprehensive guides with data tables67% citation rateOnely
Product comparison pages60–70%Onely
Structured how-to guides54%Onely
Comparative listicles32.5% of all citationsOnely
Narrative how-to25–40%Onely
Opinion pieces18%Onely

The gradient from data tables (2.5x) to opinion pieces (18%) is a format effect, not a quality effect. Opinion pieces can be brilliant. Data tables can be trivial. The difference is extractability: tables present discrete, labeled data points that AI systems can lift directly. Opinion prose requires the model to identify the claim, which adds a processing step that reduces selection likelihood.

Source:Semrush, 337,000 URLs analyzed (2026)
QualityCitation LiftNote
Clarity (structure, readability)+32.83%Strongest signal, highest-leverage fix
E-E-A-T signals+30.64%Expertise, experience, authority markers
Q&A format+25.45%Self-contained answers to specific questions
Factual density+22.17%Statistics, data points, named sources
Comprehensiveness+18.92%Breadth of topic coverage

Clarity is the strongest content quality correlated with AI citation. Not depth, not authority, not comprehensiveness, but structural clarity: the property that makes content parseable by both human scanners and machine extractors. The AI citation research covers the full pipeline from retrieval to citation. Here, the takeaway is narrower: clarity is primarily a structural property. You achieve it through heading hierarchy, semantic units, data tables, and format diversity. It is an architecture decision, not a writing quality.

Progressive Disclosure: Tooltips as Case Study


Progressive disclosure is the principle of showing only core content initially and revealing detail on demand. NN/g research confirms it improves learnability, efficiency, and error reduction. The failure condition is more than two disclosure levels, where users lose orientation. Tooltips represent a single disclosure level, well within safe bounds.

Glossary tooltips are a specific implementation of this principle: domain-specific terminology is defined in context via hover/tap popovers. Baymard Institute validated this pattern across 4,400+ usability test sessions (25 rounds, Think Aloud protocol). Definitions served in tooltips on desktop or tappable links on mobile improved comprehension without adding page length. The tested examples, B&H Photo for video resolution and Crutchfield for audio terminology, demonstrate the pattern across different product vocabularies.

Implementation constraints from NN/g timing research: a 200ms open delay prevents accidental activation during normal cursor movement; a 150ms close delay prevents premature dismissal when the user moves their cursor to the tooltip content to click a link. WCAG 1.4.13 requires that tooltip content be dismissible (Escape key), hoverable (the user can enter the tooltip without it closing), and persistent (it stays visible until actively dismissed).

The SEO mechanics are straightforward. Google indexes tooltip content present in rendered HTML via its Chromium-based rendering pipeline, including content rendered through the native Popover API. The content exists in the DOM regardless of visual state. But hidden content is weighted less than visible content. John Mueller has stated this for tabs and accordions, and the same logic applies. Tooltip definitions contribute to entity understanding and page-level semantics without carrying the full weight of visible body text.

Content effort signal

The 2024 Google API leak revealed contentEffort, an LLM-based scoring attribute that quantifies editorial investment computationally. A site-wide glossary system with canonical definitions, consistent terminology, and editorial governance is a form of measurable content effort. Whether this specific pattern registers in the contentEffort scorer is unknown, but the attribute's existence confirms Google measures effort through automated means, not just human quality raters. The Reality Gap research covers the full list of leaked quality signals.

The Information Gain patent (US20200349181A1, granted June 2024) provides one more conceptual connection. The patent defines information gain as "the amount of valuable information learned minus the amount of effort it took to learn." Tooltip definitions plausibly reduce the effort denominator by making content self-contained, eliminating the need for external lookups. The connection is inferential. No study has tested whether tooltips specifically affect information gain scoring.

The largest research gap in this area: no published A/B test measures tooltip impact on engagement metrics (time-on-page, bounce rate, scroll depth, conversion). The UX case for tooltips rests on usability testing observations, not quantitative engagement data. This is a gap worth closing for any site implementing the pattern at scale.

Accessibility as Structural Advantage


Web accessibility compliance is both a legal obligation and a structural advantage that overlaps with in-page architecture goals. The overlap is underappreciated: many WCAG requirements directly produce the structural properties that benefit search and AI retrieval.

94.8%

Pages failing WCAG 2 (WebAIM 2025, 1M pages)

51

Average errors per page

4,187

Accessibility lawsuits in 2024 (UsableNet)

Source:WebAIM Million 2025 (February 2025, n=1,000,000)
Violation% of PagesArchitecture Relevance
Low contrast text79.1%Reduces scanning speed and readability
Missing alt text55.5%Image search and entity recognition
Missing form labels48.2%Form conversion and screen readers
Empty links45.4%Navigation and link equity signals
Empty buttons29.6%Interaction and conversion
Missing document language15.8%Language classification (rosettaLanguages)

The ARIA paradox: pages with ARIA averaged 57 errors compared to 27 on pages without ARIA (WebAIM Million 2025). ARIA does not cause errors. Complex implementations tend to be more broken. The finding is a caution against adding ARIA attributes as a checkbox exercise. Semantic HTML that needs fewer ARIA overrides produces better outcomes than ARIA layered on top of non-semantic markup.

The structural overlap between WCAG compliance and in-page architecture is concrete:

  • Semantic heading hierarchy (h1-h6) creates the layer-cake scanning pattern that both users and passage-level indexing depend on.
  • Alt text provides entity context for image understanding and multimodal retrieval.
  • Document language aids Google's language classification pipeline.
  • Keyboard navigation structure implies logical content ordering.
  • Color contrast improves readability, affecting scanning efficiency and time-on-page, feeding back into NavBoost behavioral signals.

A correlation finding: WCAG-compliant sites show 23% more organic traffic and 27% more keywords (SEMrush/ AccessibilityChecker.org, 2025, n=10,000). This is a correlation, not causation. The likely mechanism: sites that invest in accessibility tend to invest in other structural quality signals (semantic HTML, proper heading hierarchy, clean markup), and these cumulative signals produce the traffic differential.

The business case extends beyond search. Click-Away Pound (2019) found 69% of disabled consumers abandon inaccessible sites. 4.9M disabled online shoppers represent GBP 17.1B/year in lost purchasing power. The European Accessibility Act (EAA) enforcement began 2025. In the US, UsableNet tracked 4,187 digital accessibility lawsuits in 2024, with projections trending upward for 2025.

Practitioner Reference

Operational Framework


The evidence above translates into a page-level architecture checklist. The seven layers form a construction sequence, not a scoring rubric. Layers are ordered by dependency: heading hierarchy must exist before semantic units can be evaluated, and semantic units must exist before extraction targets can be assessed.

Source:Derived from convergent evidence across NN/g, Shashko, DEJAN, iPullRank, Baymard, and WCAG 2.2
LayerActionValidation
1. Heading hierarchyH1 > H2 > H3, no skipped levels, each H2 scoping a semantic unitAutomated: heading-level audit
2. Semantic unit sizing50–180 words per section, single concept, SVO sentencesManual: review each section for self-containment
3. Above-fold structureKey finding or value proposition in first 100 wordsCan someone understand the thesis without scrolling?
4. Extraction targetsAt least one table, one sourced data point, one self-contained definition per 500 wordsWould an AI system find a citable sentence in each section?
5. Progressive disclosureGlossary tooltips for domain terminology, first occurrence onlyAutomated: tooltip coverage audit
6. Accessibility baselineWCAG 2.2 AA, semantic HTML, ARIA where needed, contrast ratioAutomated: axe-core or WAVE scan
7. Format diversityMix of prose, tables, callouts, stat blocks within the pageDoes the page look scannable at arm’s length?

Layers 1 and 6 are automatable. A heading-level audit can flag skipped levels, missing H1s, or H2s that don't scope a single concept. An axe-core scan catches the 94.8% failure rate violations that WebAIM documents. These should run as part of any CI pipeline for content-heavy sites.

Layers 2-5 require editorial judgment. Evaluating whether a section is self-contained, whether the above-fold content communicates the page's thesis, whether extraction targets exist in each section, and whether tooltip coverage serves comprehension: these are human decisions. The clinical diagnostic framework provides a systematic approach to identifying which layer is the binding constraint for a given page.

Layer 7, format diversity, is a visual check. Print the page or view it at arm's length. If it looks like a wall of text, the structure is failing the layer-cake scanning requirement. If every section looks identical, the format lacks the diversity that produces different extraction opportunities for AI systems. Tables, callouts, stat blocks, and prose serve different extraction pipelines.

For programmatic builds

If you build template-driven pages at scale, the in-page information architecture is the template. Getting it right means every generated page inherits optimal structure for humans, crawlers, and AI systems. Getting it wrong means every generated page inherits the same structural flaw multiplied across thousands of URLs. The programmatic SEO architecture research covers template design at the system level. This checklist governs what each template produces per page. Topical authority compounds when every page in a programmatic build is structurally sound. It dilutes when structural flaws repeat at scale.

The checklist is intentionally lean. Seven layers, each with a clear validation method. Sites that pass all seven are structuring content for all three audiences simultaneously: human readers who scan in layer-cake patterns, search crawlers that index at the passage level, and AI systems that extract individual sentences under hard token constraints. The evidence from independent research streams converges on the same structural unit, the same attention distribution, the same format advantages. In-page architecture is not a design preference. It is an engineering specification with measurable outcomes.