Research Synthesis

In-Page Information Architecture: Structuring Content for Three Audiences

The SEO discipline has mature frameworks for site architecture and content strategy. The gap is how information is organized within a single page. Three consumers parse in-page structure using overlapping but distinct logic: human readers scan in patterns shaped by visual hierarchy, search crawlers segment text into ranked passages, and AI retrieval systems extract individual sentences under hard token constraints. The design patterns that serve all three converge on the same structural unit.

Compiled by Aviel Fahl · Last updated April 1, 2026

Key Findings

AI citation systems extract individual sentences with a median length of 10 words and a hard ceiling at 17 words. Structured content (headings, lists, tables) achieves a 2.3x citation advantage over unstructured prose. The grounding budget per query is approximately 2,000 words, and content survival drops from 61% for pages under 1,000 words to 13% for pages over 3,000 words. Eye-tracking research confirms that 57% of viewing time is spent above the fold and that scannable formatting improves usability by 124%. These constraints converge on a single structural primitive: the 50-180 word semantic unit with an explicit heading, self-contained sentences, and at least one extractable data point. Pages built from these units serve human scanners, passage-level indexing, and AI extraction pipelines simultaneously. 94.8% of pages fail at least one WCAG 2 accessibility check, meaning compliance with accessibility standards that also benefit search and AI retrieval is a structural competitive advantage.

Contents

10 words

Median sentence length in AI citations (Shashko, 42,971)

2.3x

Structured content citation advantage vs unstructured

94.8%

Pages with at least one WCAG 2 failure (WebAIM 2025)

~2,000

Words in AI grounding budget per query (DEJAN)

The Missing Middle

Site architecture has a mature body of research. Topic clusters drove one HubSpot client from 500 to 190,000 monthly visitors. Crawl depth beyond three clicks degrades crawl rates by 33% (Botify, 6.2B requests). Internal linking A/B tests show +25% organic uplift (SearchPilot). These findings shape how practitioners connect pages to each other.

Content strategy has its own evidence base. The content engineering discipline covers what to build: content models, metadata, structured data, governance systems. The gap is the layer between these two: how information is organized within a single page.

The gap matters because three distinct consumers parse in-page structure, each with different processing logic.

Human readers scan in predictable patterns shaped by visual hierarchy. NN/g eye-tracking across multiple studies (2006-2019, cumulative n=200+) found that 79% of users scan any new page. Only 16% read word-by-word. The layer-cake pattern, where users fixate on headings and skip body text, is the most effective scanning behavior. It only occurs when pages provide sufficient structural cues. Without headings, bolding, and whitespace, users fall into the F-pattern: a failure state where reading efficiency collapses.

Search crawlers extract text from rendered HTML, segment it into passages, and classify entities. Passage-level indexing (patent US20160078102) evaluates individual passages within a page independently of surrounding content. A well-structured page with clear section boundaries gives the passage indexer cleaner input.

AI retrieval systems extract at sentence-level granularity. Shashko's analysis of 42,971 AI citations found a median cited sentence length of 10 words and a hard ceiling at 17 words. The grounding budget per query is approximately 2,000 words (DEJAN, 7,060 queries). Content must be structured so the highest-value sentences are self-contained, because AI systems extract exact sentences, not paraphrases.

Why this is not 'content quality'

Content quality and in-page architecture are independent dimensions. A page can contain excellent, well-researched content and still fail all three audiences if the information is presented as undifferentiated prose. The content engineering research covers what to build. This page covers how to structure what you have built.

The Semantic Unit

The atomic building block of in-page architecture. iPullRank defines a semantic unit as a 50-150 word block capturing a single concept with explicit subject-predicate-object structure. The size range aligns with multiple independent findings about how machines process content.

Source:Multiple independent studies converge on the same structural unit size: 50-180 words, self-contained, with explicit headings.
Finding	Measurement	Source
Optimal section length for AI citation	120–180 words	SE Ranking, 129K domains
Structured content citation advantage	2.3x (91.3% vs 39.3% sentence match)	Shashko, 42,971 citations
Splitting combined topics improves cosine similarity	+19.24%	iPullRank, relevance engineering
Adding proper headers after splitting	+17.54% additional lift	iPullRank, relevance engineering
Sequential headings and citation correlation	2.8x higher citation likelihood	AirOps, 2026 State of AI Search

The convergence is notable because these studies measured different things. SE Ranking measured AI citation rates across 129,000 domains. Shashko measured sentence-level extraction patterns across six AI platforms. iPullRank measured cosine similarity improvements from restructuring content. AirOps measured citation likelihood across their 2026 dataset. They all arrived at the same structural unit size.

The semantic unit is not just a search optimization concept. It maps directly to how humans process information in the layer-cake scanning pattern. Users fixate on headings to decide whether to read the block below. Each heading-plus-block is a decision point. If the block is self-contained, meaning the heading accurately previews the content and the content delivers without requiring context from adjacent blocks, both the human scanner and the AI extractor can process it independently.

NN/g paragraph attention data quantifies this: paragraph 1 is viewed by 81% of users, paragraph 2 by 71%, paragraph 3 by 63%, paragraph 4 by only 32%. The drop from first to fourth paragraph is nearly 50 percentage points. This steep decay makes the case for front-loading the most important information within each semantic unit, not just within the page overall.

Grounding Budget and Extraction Mechanics

AI retrieval systems operate under hard constraints that dictate how much of a page gets used. The AI citation research covers the retrieval pipeline in detail. Here the focus is on what these constraints mean for page-level structure.

Source:DEJAN AI, 7,060 queries, 2,275 tokenized pages, 883,262 snippets (March 2026)
Page Length	Content Grounded	Implication
<1K words	61%	Short pages lose less but have less to offer
1–2K words	35%	Sweet spot for most content types
2–3K words	22%	Diminishing returns begin
3K+ words	13%	Most content never reaches the model

Grounding plateaus at approximately 540 words / 3,500 characters regardless of page length. The architectural question is not "how long should the page be" but "which 540 words will the model see."

Sentence-level extraction specifics from Shashko's 42,971-citation study across six platforms paint the granular picture. The median cited sentence is 10 words. The maximum is 17 words, a hard ceiling with nothing longer cited in the entire dataset. 92.4% of citations fall between 6 and 20 words. Position bias is significant: the mean cited position is 34.9% down the page, with the 75th percentile at 48.8%. The top third of the page produces a disproportionate share of citations.

A related structural pattern validates the extraction mechanics at the paragraph level. Adam Gnuse (Saltbox Solutions, via Search Engine Land, November 2025) found that 72.4% of ChatGPT-cited blog posts contained an "answer capsule": a concise declarative statement of 120-150 characters (approximately 20-25 words) placed directly after a question-based H2. Approximately 91% of these capsules contained no links. The capsule functions as the source paragraph from which a shorter sentence is extracted, consistent with Shashko's sentence-level findings (median 10 words, max 17). The capsule provides the self-contained context; the extraction pipeline pulls the single most information-dense sentence from within it.

Kevin Indig and Nicole Johnson (Growth Memo, March 2026, 1.2M ChatGPT responses, 98K citations across 7 verticals) extended this principle to opening paragraphs. Declarative intro language (the "[X] is [Y]" form in the first paragraph) was the only universal writing signal across all 7 verticals, with a +14% aggregate citation lift. Entity types also predict citation differently: DATE and NUMBER entities are universal positive signals, while PRICE entities are universally negative and Knowledge Graph-verified entities reduce citation rates (0.81x). Pages optimized for entity richness should prioritize temporal and quantitative entities over brand or product-name entities.

The structural gap between pages is stark. Pages with lists, tables, and headings achieved a 91.3% sentence-match rate. Unstructured pages: 39.3%. That 2.3x advantage is not about having better content. It is about making individual sentences extractable. A well-researched paragraph buried in a wall of text is invisible to the extraction pipeline. The same sentence with a heading above it and whitespace around it is a candidate for citation.

Heading count is vertical-specific. The Indig/Johnson study found that 3-4 headings performs worse than zero headings in every vertical studied. CRM and SaaS content peaks at 20+ headings; healthcare inverts entirely, with zero headings performing best. The recommendation is not "add more headings" but "match heading density to what your vertical's citation-earning pages actually use."

The extractive summarization constraint

Google's AI systems use extractive summarization, pulling exact sentences from source pages rather than paraphrasing. DEJAN confirmed this through direct comparison of AI output against source text. The structural implication: every sentence that could be a citation target must be grammatically complete and self-contained. Sentences that begin with "This" or "However" referring to a previous paragraph are invisible to extractive pipelines because they cannot stand alone.

Scanning Patterns as Structural Constraints

Eye-tracking research provides the empirical foundation for in-page architecture decisions. These are measured behaviors, not style preferences, and they determine whether content gets processed.

NN/g eye-tracking studies (multiple rounds, 2006-2019, cumulative n=200+) established two primary scanning patterns. The layer-cake pattern is optimal: users fixate on headings and subheadings, deliberately skipping body text between them. NN/g describes this as "by far the most effective way to scan pages." It occurs when pages provide clear visual hierarchy with distinct headings, varied formatting, and whitespace. The F-pattern is a failure state: users read the first line fully, scan partway through the second, then skim vertically down the left margin. It occurs when pages lack formatting cues. Reading efficiency collapses.

Source:NN/g eye-tracking study (2018, n=120, 130,000+ fixations, 1920x1080 screens)
Position	Viewing Time	Implication
Above fold	57%	Critical content and key findings here
First two screenfuls	74%	Most users never scroll past this
First three screenfuls	81%	Effective page boundary for most visitors
Below three screenfuls	19%	Only committed readers reach here

Combined scannable + concise + objective formatting produces+124% usability improvement (Morkes & Nielsen, 1997):

Scannable text (headings, bullets): +47%
Concise text (half word count): +58%
Objective tone (non-promotional): +27%

The sample was small by behavioral science standards, but the directional findings are consistent with all subsequent NN/g research.

First impressions compound the structural stakes. Users form reliable aesthetic judgments within 50 milliseconds (Lindgaard et al. 2006, 1,000+ academic citations, replicated by Google/University of Basel 2012). Visual complexity and prototypicality affect perception at 17ms. The page structure above the fold is evaluated before the first word is read. A dense wall of text triggers a negative aesthetic judgment before content quality can register.

The connection to NavBoost: these scanning patterns directly feed Google's most important ranking signal. If visual structure causes a user to stay and engage (goodClicks, lastLongestClicks), the page accumulates positive NavBoost signal on a rolling 13-month window. If poor structure causes a quick return to the SERP (badClicks), the page accumulates negative signal. In-page architecture is not a UX concern separate from ranking. It is a ranking input.

Content Format and Citation Rates

Specific content formats produce measurably different AI citation rates. The format is the container, not the content. The same information, restructured, produces different citation outcomes.

Source:Onely, compiled from Digital Bloom 2025, AmICited.com, Frase.io, Semrush. Sample: 768K+ citations, 67,394 content pieces.
Format	Citation Rate / Lift	Source
Data tables	~2.5x vs paragraph text	Onely, compiled from multiple studies
FAQ structure	3.2x more likely in AIOs	Onely
Comprehensive guides with data tables	67% citation rate	Onely
Product comparison pages	60–70%	Onely
Structured how-to guides	54%	Onely
Comparative listicles	32.5% of all citations	Onely
Narrative how-to	25–40%	Onely
Opinion pieces	18%	Onely

The gradient from data tables (2.5x) to opinion pieces (18%) is a format effect, not a quality effect. Opinion pieces can be brilliant. Data tables can be trivial. The difference is extractability: tables present discrete, labeled data points that AI systems can lift directly. Opinion prose requires the model to identify the claim, which adds a processing step that reduces selection likelihood.

Previsible's analysis of 5,000 prompts (Ana Fernandez, January 2026) adds granular structural benchmarks. Cited pages have a header every 100-200 words, compared to one header per 400+ words on uncited pages. Lists appeared on 63% of cited pages, tables on 39% (jumping higher for competitive and buying queries), and FAQ sections on 47%, especially for factual and informational prompts. 58% of cited pages used interrogative headers ("What is...?", "How does... work?"). When LLMs used source content in responses, they transformed it into list format 76% of the time, regardless of original structure. The formats that were almost never cited: opinion pieces, storytelling without structure, blogs without intermediate headers, pages with more images than text, and pure conversion landing pages.

Source:Semrush, 337,000 URLs analyzed (2026)
Quality	Citation Lift	Note
Clarity (structure, readability)	+32.83%	Strongest signal, highest-leverage fix
E-E-A-T signals	+30.64%	Expertise, experience, authority markers
Q&A format	+25.45%	Self-contained answers to specific questions
Factual density	+22.17%	Statistics, data points, named sources
Comprehensiveness	+18.92%	Breadth of topic coverage

Clarity is the strongest content quality correlated with AI citation. Not depth, not authority, not comprehensiveness, but structural clarity: the property that makes content parseable by both human scanners and machine extractors. The AI citation research covers the full pipeline from retrieval to citation. Here, the takeaway is narrower: clarity is primarily a structural property. You achieve it through heading hierarchy, semantic units, data tables, and format diversity. It is an architecture decision, not a writing quality.

Progressive Disclosure: Tooltips as Case Study

Progressive disclosure is the principle of showing only core content initially and revealing detail on demand. NN/g research confirms it improves learnability, efficiency, and error reduction. The failure condition is more than two disclosure levels, where users lose orientation. Tooltips represent a single disclosure level, well within safe bounds.

Glossary tooltips are a specific implementation of this principle: domain-specific terminology is defined in context via hover/tap popovers. Baymard Institute validated this pattern across 4,400+ usability test sessions (25 rounds, Think Aloud protocol). Definitions served in tooltips on desktop or tappable links on mobile improved comprehension without adding page length. The tested examples, B&H Photo for video resolution and Crutchfield for audio terminology, demonstrate the pattern across different product vocabularies.

Implementation constraints from NN/g timing research: a 200ms open delay prevents accidental activation during normal cursor movement; a 150ms close delay prevents premature dismissal when the user moves their cursor to the tooltip content to click a link. WCAG 1.4.13 requires that tooltip content be dismissible (Escape key), hoverable (the user can enter the tooltip without it closing), and persistent (it stays visible until actively dismissed).

The SEO mechanics are straightforward. Google indexes tooltip content present in rendered HTML via its Chromium-based rendering pipeline, including content rendered through the native Popover API. The content exists in the DOM regardless of visual state. But hidden content is weighted less than visible content. John Mueller has stated this for tabs and accordions, and the same logic applies. Tooltip definitions contribute to entity understanding and page-level semantics without carrying the full weight of visible body text.

Content effort signal

The 2024 Google API leak revealed contentEffort, an LLM-based scoring attribute that quantifies editorial investment computationally. A site-wide glossary system with canonical definitions, consistent terminology, and editorial governance is a form of measurable content effort. Whether this specific pattern registers in the contentEffort scorer is unknown, but the attribute's existence confirms Google measures effort through automated means, not just human quality raters. The Reality Gap research covers the full list of leaked quality signals.

The Information Gain patent (US20200349181A1, granted June 2024) provides one more conceptual connection. The patent defines information gain as "the amount of valuable information learned minus the amount of effort it took to learn." Tooltip definitions plausibly reduce the effort denominator by making content self-contained, eliminating the need for external lookups. The connection is inferential. No study has tested whether tooltips specifically affect information gain scoring.

The largest research gap in this area: no published A/B test measures tooltip impact on engagement metrics (time-on-page, bounce rate, scroll depth, conversion). The UX case for tooltips rests on usability testing observations, not quantitative engagement data. This is a gap worth closing for any site implementing the pattern at scale.

Accessibility as Structural Advantage

Web accessibility compliance is both a legal obligation and a structural advantage that overlaps with in-page architecture goals. The overlap is underappreciated: many WCAG requirements directly produce the structural properties that benefit search and AI retrieval.

94.8%

Pages failing WCAG 2 (WebAIM 2025, 1M pages)

Average errors per page

4,187

Accessibility lawsuits in 2024 (UsableNet)

Source:WebAIM Million 2025 (February 2025, n=1,000,000)
Violation	% of Pages	Architecture Relevance
Low contrast text	79.1%	Reduces scanning speed and readability
Missing alt text	55.5%	Image search and entity recognition
Missing form labels	48.2%	Form conversion and screen readers
Empty links	45.4%	Navigation and link equity signals
Empty buttons	29.6%	Interaction and conversion
Missing document language	15.8%	Language classification (rosettaLanguages)

The ARIA paradox: pages with ARIA averaged 57 errors compared to 27 on pages without ARIA (WebAIM Million 2025). ARIA does not cause errors. Complex implementations tend to be more broken. The finding is a caution against adding ARIA attributes as a checkbox exercise. Semantic HTML that needs fewer ARIA overrides produces better outcomes than ARIA layered on top of non-semantic markup.

The structural overlap between WCAG compliance and in-page architecture is concrete:

Semantic heading hierarchy (h1-h6) creates the layer-cake scanning pattern that both users and passage-level indexing depend on.
Alt text provides entity context for image understanding and multimodal retrieval.
Document language aids Google's language classification pipeline.
Keyboard navigation structure implies logical content ordering.
Color contrast improves readability, affecting scanning efficiency and time-on-page, feeding back into NavBoost behavioral signals.

A correlation finding: WCAG-compliant sites show 23% more organic traffic and 27% more keywords (SEMrush/ AccessibilityChecker.org, 2025, n=10,000). This is a correlation, not causation. The likely mechanism: sites that invest in accessibility tend to invest in other structural quality signals (semantic HTML, proper heading hierarchy, clean markup), and these cumulative signals produce the traffic differential.

The business case extends beyond search. Click-Away Pound (2019) found 69% of disabled consumers abandon inaccessible sites. 4.9M disabled online shoppers represent GBP 17.1B/year in lost purchasing power. The European Accessibility Act (EAA) enforcement began 2025. In the US, UsableNet tracked 4,187 digital accessibility lawsuits in 2024, with projections trending upward for 2025.

Practitioner Reference

Operational Framework

The evidence above translates into a page-level architecture checklist. The seven layers form a construction sequence, not a scoring rubric. Layers are ordered by dependency: heading hierarchy must exist before semantic units can be evaluated, and semantic units must exist before extraction targets can be assessed.

Source:Derived from convergent evidence across NN/g, Shashko, DEJAN, iPullRank, Baymard, and WCAG 2.2
Layer	Action	Validation
1. Heading hierarchy	H1 > H2 > H3, no skipped levels, each H2 scoping a semantic unit	Automated: heading-level audit
2. Semantic unit sizing	50–180 words per section, single concept, SVO sentences	Manual: review each section for self-containment
3. Above-fold structure	Key finding or value proposition in first 100 words	Can someone understand the thesis without scrolling?
4. Extraction targets	At least one table, one sourced data point, one self-contained definition per 500 words	Would an AI system find a citable sentence in each section?
5. Progressive disclosure	Glossary tooltips for domain terminology, first occurrence only	Automated: tooltip coverage audit
6. Accessibility baseline	WCAG 2.2 AA, semantic HTML, ARIA where needed, contrast ratio	Automated: axe-core or WAVE scan
7. Format diversity	Mix of prose, tables, callouts, stat blocks within the page	Does the page look scannable at arm’s length?

Layers 1 and 6 are automatable. A heading-level audit can flag skipped levels, missing H1s, or H2s that don't scope a single concept. An axe-core scan catches the 94.8% failure rate violations that WebAIM documents. These should run as part of any CI pipeline for content-heavy sites.

Layers 2-5 require editorial judgment. Evaluating whether a section is self-contained, whether the above-fold content communicates the page's thesis, whether extraction targets exist in each section, and whether tooltip coverage serves comprehension: these are human decisions. The clinical diagnostic framework provides a systematic approach to identifying which layer is the binding constraint for a given page.

Layer 7, format diversity, is a visual check. Print the page or view it at arm's length. If it looks like a wall of text, the structure is failing the layer-cake scanning requirement. If every section looks identical, the format lacks the diversity that produces different extraction opportunities for AI systems. Tables, callouts, stat blocks, and prose serve different extraction pipelines.

For programmatic builds

If you build template-driven pages at scale, the in-page information architecture is the template. Getting it right means every generated page inherits optimal structure for humans, crawlers, and AI systems. Getting it wrong means every generated page inherits the same structural flaw multiplied across thousands of URLs. The programmatic SEO architecture research covers template design at the system level. This checklist governs what each template produces per page. Topical authority compounds when every page in a programmatic build is structurally sound. It dilutes when structural flaws repeat at scale.

The checklist is intentionally lean. Seven layers, each with a clear validation method. Sites that pass all seven are structuring content for all three audiences simultaneously: human readers who scan in layer-cake patterns, search crawlers that index at the passage level, and AI systems that extract individual sentences under hard token constraints. The evidence from independent research streams converges on the same structural unit, the same attention distribution, the same format advantages. In-page architecture is not a design preference. It is an engineering specification with measurable outcomes.