Skip to content

Conversation

asafashirov
Copy link
Contributor

@asafashirov asafashirov commented Sep 29, 2025

Summary

Implements comprehensive schema.org structured data across all content types to improve discoverability in search engines and AI tools, with enhanced schemas for videos, tutorials, and code examples.

Page Types Enhanced

  • Documentation pages (/docs/) - Article schema with code examples
  • Blog posts (/blog/) - BlogPosting schema with technical metadata
  • Tutorials (/tutorials/) - Course schema with HowTo for step-by-step guides
  • Templates (/templates/) - Article schema with SoftwareSourceCode
  • Case studies (/case-studies/) - Article schema with entity detection
  • Product pages (/product/) - SoftwareApplication schema
  • Physical Events (/events/) - Event schema (only for non-virtual events)
  • Pricing page - QAPage schema for FAQ section
  • Pages with videos - VideoObject schema for YouTube embeds
  • Pages with code - SoftwareSourceCode schema for code blocks

Changes to Existing Schemas

  • BlogPosting: Enhanced with cloud provider detection, infrastructure patterns, software requirements, and deduplicated technology mentions
  • Article: Comprehensive technical metadata with content aggregation for special pages
  • Course: Prerequisite detection, skill level assessment, and Course List (ItemList) for tutorial index pages
  • Organization/WebSite: Enhanced with additional properties while maintaining backward compatibility
  • Event: Only generates for physical events (virtual events excluded per Google requirements)
  • Breadcrumbs: Include position markers for better hierarchy understanding

Key Additions

  • VideoObject Schema: Auto-detects YouTube embeds, includes thumbnails and required metadata (118 pages enhanced)
  • HowTo Schema: For step-by-step tutorials with tool detection and time estimates (19 tutorials enhanced)
  • SoftwareSourceCode Schema: For code examples with language detection and runtime requirements
  • Course List Schema: ItemList implementation for tutorial collections and index pages
  • Cloud provider detection: Automatically identifies AWS, Azure, GCP from URLs and content
  • Resource type extraction: Detects cloud services and Pulumi resource types
  • Infrastructure patterns: Recognizes patterns like serverless, containers, IaC, DevOps
  • Content aggregation: Extracts content from various page structures for complete schemas
  • Enhanced metadata: Proper schema.org compliant properties throughout
  • Modern cloud services: Added AI/ML services (Bedrock, Vertex AI), observability platforms, CDN services

Technical Implementation

  • Created utility partials for detection and extraction (content-aggregator, cloud-provider-detector, etc.)
  • Enhanced all content schemas (article, blog, course, event)
  • Added new schema types (video, howto, code, course-list)
  • Updated cloud_resources.yml with 20+ modern services
  • Fixed property naming conventions (camelCase compliance)
  • Implemented robust date handling with fallbacks
  • Added efficient deduplication for entity mentions
  • All schemas validate with required fields guaranteed

Impact

  • Better search engine visibility for technical queries
  • Enhanced AI/LLM understanding of content structure
  • Eligible for rich results (video carousels, how-to snippets)
  • Improved knowledge graph connections
  • Enhanced answer engine optimization (AEO)
  • Better comprehension by ChatGPT, Claude, and Perplexity

Testing

  • ✅ Valid JSON-LD generation
  • ✅ Successful Hugo builds (0 errors)
  • ✅ Schema.org validation compliance
  • ✅ Proper deduplication
  • ✅ All required fields have fallbacks
  • ✅ 118 video schemas, 19 HowTo schemas, 7 code schemas generated

Copy link
Contributor

claude bot commented Sep 29, 2025

Overall Assessment

This PR implements comprehensive schema.org structured data, which is a valuable addition for SEO and AI discoverability. The implementation shows good understanding of structured data concepts and follows a logical organization pattern. However, there are several critical issues that need to be addressed.

Critical Issues

1. Missing Files Break Build (lines 48, 51 in loader.html)

The loader references several partial files that don't exist in the diff:

  • layouts/partials/schema/utils/author-entities.html (referenced in blog.html:46)
  • layouts/partials/schema/utils/related-content.html (referenced in article.html:51)
  • layouts/partials/schema/content/course.html (referenced in loader.html:26)
  • layouts/partials/schema/content/event.html (referenced in loader.html:29)
  • layouts/partials/schema/content/product-software.html (referenced in loader.html:32)
  • layouts/partials/schema/content/qa.html (referenced in loader.html:44)

This will cause build failures when these partials are called.

2. Incomplete Hugo Template in blog.html (line 22)

Line 24 in blog.html has an incomplete Hugo template expression that will break builds.

3. JSON Syntax Error in organization.html (line 28)

The street address contains an unescaped quote that will create invalid JSON.

Code Quality Issues

1. File Naming Missing Newlines

Several files are missing trailing newlines as required by AGENTS.md:

  • layouts/partials/schema/base/breadcrumb.html
  • layouts/partials/schema/base/software-app.html
  • layouts/partials/schema/base/website.html

2. Performance Concerns

The technology entity detection in article.html and technology-entities.html performs many string operations on page content. This could impact build performance on large sites. Consider:

  • Caching detection results
  • Limiting content scanning to first 5000 characters
  • Using more efficient pattern matching

3. Hardcoded Values

  • Line 27 in software-app.html: Version 3.198.0 should be dynamically sourced
  • Aggregate rating values (lines 21-22) appear arbitrary and should be based on actual data

Documentation & Maintainability

1. Complex Logic Needs Documentation

The deduplication logic and cloud provider detection contain complex business rules that should be documented with inline comments explaining the reasoning.

2. Schema Validation

Consider adding validation to ensure generated JSON-LD is valid schema.org markup, especially for complex schemas like Course and Event.

Recommendations

  1. Immediate: Add all missing partial files or remove references to them
  2. Immediate: Fix the syntax errors that will break builds
  3. Before merge: Add trailing newlines to all files
  4. Before merge: Add schema validation or at least JSON validation
  5. Consider: Performance testing with large content volumes
  6. Consider: Adding configuration options to disable expensive detection features

Positive Aspects

  • Well-organized modular structure
  • Good separation of concerns between base and content schemas
  • Comprehensive coverage of different content types
  • Backward compatibility maintenance
  • Rich semantic markup that will benefit SEO

The concept and structure are solid, but the implementation needs completion and bug fixes before it's ready for production.

@pulumi-bot
Copy link
Collaborator

- Add cloud provider and resource type detection
- Add infrastructure pattern recognition
- Enhance Article, BlogPosting, and Course schemas with technical metadata
- Implement deduplication for entity mentions
- Add support for multi-cloud scenarios
@pulumi-bot
Copy link
Collaborator

Copy link
Contributor

@CamSoper CamSoper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look at the markup generated by all of them, but the one for docs articles looks fine.

Image

@asafashirov asafashirov marked this pull request as ready for review September 29, 2025 18:13
Copy link
Contributor

claude bot commented Sep 29, 2025

This PR implements comprehensive schema.org structured data across the site, which is an excellent enhancement for search visibility and AI discoverability. The implementation is well-architected and follows good practices. Here's my review:

Strengths

  • Well-organized architecture: The modular approach with separate files for different schema types and utilities is excellent
  • Backward compatibility: Maintains existing functionality while adding new features
  • AI optimization: Enhanced structured data will improve visibility in AI tools and answer engines
  • Comprehensive coverage: Covers all major page types (docs, blog, tutorials, templates, etc.)
  • Efficient deduplication: Prevents duplicate entities in mentions arrays

Issues Found

Critical: Missing Newlines

Several files don't end with newlines, violating the repository's absolute requirement:

  • layouts/partials/schema/base/organization.html:63 - Missing final newline
  • layouts/partials/schema/content/article.html:328 - Missing final newline

Code Quality Issues

  1. Complex template logic: Some utility files (like resource-type-extractor.html) contain very long, repetitive detection logic that could be simplified
  2. Hardcoded data: Version numbers and ratings are hardcoded in software-app.html:27 and software-app.html:22-25

Style Guide Violations

  1. Inconsistent spacing: Mixed spacing around Hugo template delimiters in several files
  2. Long lines: Some utility files have very long detection chains that harm readability

Recommendations

  1. Fix newline issues (required): Add newlines to all files that are missing them
  2. Extract configuration: Move hardcoded values like version numbers to site config
  3. Simplify detection logic: Consider using data files for resource type mappings instead of long if/else chains
  4. Add validation: Consider adding schema validation to catch malformed JSON-LD

Testing

The implementation looks comprehensive, but I recommend:

  • Testing with Google's Rich Results Test tool
  • Validating JSON-LD syntax with schema.org validator
  • Ensuring Hugo builds succeed with make build

Overall, this is a valuable enhancement that will significantly improve the site's discoverability. The architectural approach is sound and the implementation is thorough.

Performance & Maintainability Improvements:
- Refactor resource-type-extractor.html to use data-driven approach (300+ lines → 119 lines)
- Create cloud_resources.yml data file for maintainable cloud service definitions
- Add performance optimization: limit content scanning to 5000 characters
- Add comprehensive inline documentation for complex deduplication logic
- Document multi-cloud detection strategy and business logic

Code Quality Fixes:
- Remove inappropriate schema fields (fake ratings, meaningless price dates)
- Add trailing newlines to all files per AGENTS.md requirements
- Update config.yml with schema-specific parameters
- Improve pattern matching specificity to reduce false positives

Business Appropriateness:
- Remove hardcoded aggregateRating from free software (Pulumi CLI)
- Keep legitimate pricing data for actual paid services
- Use site config for dynamic version management

Enhanced Functionality:
- Maintain full backward compatibility
- Improve cloud service detection accuracy
- Support multi-cloud content scenarios
- Add proper Wikidata entity linking
- Ensure clean deduplication across all mention sources

Test Results:
- Hugo builds successfully with all optimizations
- Schema.org JSON-LD validates correctly
- Cloud services detected accurately (AWS, Azure, GCP)
- Programming languages auto-detected (TypeScript, Python, C#, YAML)
- Mentions array properly deduplicated
- Performance improved with content length limiting

This commit addresses all feedback from the remote assessment while
maintaining the comprehensive schema coverage and AI optimization benefits.
Critical Fix:
- Fix Hugo template syntax error in resource-type-extractor.html (line 99, 101)
  Changed .["@id"] to (index . "@id") to prevent build failures

Accuracy Improvements:
- Fix API Gateway Pattern false positive detection on VPC pages
  Made pattern matching more specific to avoid casual mentions like "API gateways"
- Remove duplicate entity definitions from technology-entities.html:
  • AWS, Azure, GCP providers (handled by cloud-provider-detector)
  • AWS Lambda, Azure/Google Cloud Functions (handled by cloud_resources.yml)
  • Infrastructure as Code, DevOps, Cloud Native (handled by infrastructure-patterns)
  This eliminates ~60 lines of redundant code and prevents duplicate entities

Schema Quality:
- VPC documentation no longer incorrectly tagged with "API Gateway Pattern"
- AWS Lambda and other services appear only once in mentions arrays
- All entity detection now uses single source of truth
- Maintains accurate programming language and cloud service detection

Test Results:
- Hugo builds successfully without template errors
- Schema.org JSON-LD validates correctly
- Entity deduplication working properly
- No false positive patterns detected
- Performance improved with less redundant processing

This resolves the remote CI build failure and improves schema accuracy.
@pulumi-bot
Copy link
Collaborator

@asafashirov asafashirov marked this pull request as draft September 29, 2025 19:47
…ced date handling

This commit implements comprehensive improvements to the schema.org structured data
generation system, addressing critical issues with content extraction, date handling,
and detection accuracy.

Key improvements:

**Content Aggregation System:**
- Added content-aggregator.html utility to extract effective content from special pages
- Handles cloud overview pages (extracting from frontmatter components/providers)
- Handles docs home pages (extracting from sections/cards)
- Provides fallback content for pages without traditional .Content

**Enhanced Date Handling:**
- Fixed "0001-01-01" date issues with robust fallback logic
- Date hierarchy: .Date → .GitInfo.AuthorDate → now
- Applied to all content schemas (article.html, blog.html, course.html)

**Detection System Updates:**
- Updated cloud-provider-detector.html to use content aggregator
- Fixed resource matching logic in resource-type-extractor.html
- Improved deduplication algorithm for better accuracy
- All detection utilities now work with aggregated content

**Schema Content Accuracy:**
- Fixed empty articleBody on special pages
- Accurate wordCount calculation using aggregated content
- Better content extraction for AI/SEO optimization

**Files Modified:**
- layouts/partials/schema/utils/content-aggregator.html (new)
- layouts/partials/schema/content/article.html
- layouts/partials/schema/content/blog.html
- layouts/partials/schema/content/course.html
- layouts/partials/schema/utils/cloud-provider-detector.html
- layouts/partials/schema/utils/resource-type-extractor.html

**Impact:**
- Fixes empty schema fields on 29+ special pages (cloud overview, docs home)
- Eliminates invalid dates in structured data
- Improves detection accuracy and reduces false positives
- Better SEO and AI discoverability for all content types

Build tested successfully with no template errors.
@pulumi-bot
Copy link
Collaborator

This commit addresses comprehensive schema.org validation issues identified
across multiple page types, achieving full compliance and eliminating all
44 validation issues (23 errors + 21 warnings).

**Phase 1: Critical Error Fixes (23 errors eliminated)**
- Fixed cloud_resources.yml property naming conventions:
  • Changed 'same_as' → 'sameAs' (schema.org standard)
  • Removed 'provider_id' entirely (not valid schema.org property)
  • Standardized 'category' → 'applicationCategory'
- Fixed CSS selectors for speakable specification:
  • article.html: Replaced non-existent '.summary', 'pre code' selectors
  • blog.html: Fixed '.article-summary', '.blog-content h2/h3' selectors
- Fixed course instructor schema type (Organization → Person with worksFor)

**Phase 2: Property Misuse Fixes (21 warnings eliminated)**
- Removed 'applicationCategory' from invalid schema types:
  • Article, BlogPosting, Course (only valid for SoftwareApplication)
- Removed non-standard properties:
  • 'proficiencyLevel' from Article schemas
  • 'targetPlatform' from Article/BlogPosting/Course
  • 'availableLanguage' from SoftwareApplication
  • 'isRelatedTo' from product schemas
- Changed 'relatedLink' → 'citation' (standard schema.org property)

**Phase 3: Quality Improvements**
- Enhanced infrastructure pattern integration:
  • Moved patterns from invalid 'applicationCategory' to 'mentions'
  • Improved semantic accuracy for AI/search understanding
  • Maintained rich DefinedTerm entities with Wikipedia/Wikidata links

**Validation Results (Before → After)**
- Case Studies: 10 errors, 6 warnings → 0 errors, 0 warnings ✅
- Cloud Overview: 7 errors, 8 warnings → 0 errors, 0 warnings ✅
- Blog Posts: 5 errors, 1 warning → 0 errors, 0 warnings ✅
- Tutorials: 1 error, 2 warnings → 0 errors, 0 warnings ✅
- Product Pages: 0 errors, 4 warnings → 0 errors, 0 warnings ✅

**Files Modified:**
- data/cloud_resources.yml (property naming fixes)
- layouts/partials/schema/content/article.html
- layouts/partials/schema/content/blog.html
- layouts/partials/schema/content/course.html
- layouts/partials/schema/content/product-software.html

**Impact:**
- Full schema.org compliance for improved SEO signals
- Enhanced AI/LLM content understanding
- Better rich results eligibility in search engines
- Cleaner, more maintainable schema generation code

Build tested successfully with no template errors.
…a guidelines

Major updates:
- Event schema now only generates for physical events (Google requirement)
- Added course list schema with ItemList for tutorial index pages
- Fixed ISO-8601 date formatting with proper timezone handling
- Removed non-compliant properties from course schema

Event Schema Updates:
- Skip schema generation for virtual events (location: virtual)
- Skip schema generation for external events (external: true)
- Enhanced date handling with proper ISO-8601 timezone format
- Only generate Event schema for physical events with real locations

Course Schema Updates:
- Created course-list.html for tutorial index pages with ItemList schema
- Added provider URL to organization structure
- Removed availableLanguage property for compliance
- Changed relatedLink to citation property

Course List Implementation:
- New ItemList schema for tutorials section pages
- Minimum 3 courses requirement (Google's guideline)
- Individual course items with proper positioning and metadata
- Enhanced with educational level and duration estimation

These changes align with Google's documentation requirements for Event and Course rich results.
@pulumi-bot
Copy link
Collaborator

Implements carefully validated schema.org enhancements designed to pass all validation requirements while improving visibility in AI tools and search engines.

VideoObject Schema:
- Detects YouTube embeds (shortcodes and iframes)
- Includes all required fields (name, description, uploadDate, thumbnailUrl)
- Uses YouTube's standard thumbnail URL pattern
- Supports multiple videos per page
- Found 118 pages with videos now properly marked up

HowTo Schema:
- Only applies to content with clear numbered steps (minimum 2 steps)
- Detects "1. 2. 3." pattern or "Step 1, Step 2" pattern
- Auto-detects common tools (Pulumi CLI, AWS CLI, Node.js, Python, Docker)
- Estimates time based on word count
- Applied to 19 tutorial pages with step-by-step instructions

SoftwareSourceCode Schema:
- Detects fenced code blocks with language identifiers
- Maps common aliases (ts→TypeScript, py→Python, cs→C#)
- Includes ComputerLanguage type for proper validation
- Adds runtime platform and software requirements
- Minimal schema with no required fields (validation-safe)
- Applied to 7 pages with code examples

Cloud Resources Updates:
- Added modern AI/ML services (Bedrock, SageMaker, Azure OpenAI, Vertex AI)
- Added container orchestration services (EKS, AKS, ECS)
- Added observability platforms (Datadog, New Relic, Grafana)
- Added CDN services (CloudFront, Azure Front Door)
- All use validated structure matching existing resources

Key Design Decisions:
- All schemas include required fields with robust fallbacks
- Use camelCase for all properties (schema.org standard)
- Omit optional fields if data unavailable (e.g., video duration)
- Only generate schemas when content structure validates
- Follow existing successful patterns from the codebase

Build Results:
- Hugo build successful with no errors
- 118 video schemas generated
- 19 HowTo schemas generated
- 7 code schemas generated
- All schemas validate against schema.org standards
@pulumi-bot
Copy link
Collaborator

@asafashirov asafashirov marked this pull request as ready for review September 30, 2025 15:15
Copy link
Contributor

claude bot commented Sep 30, 2025

Pull Request Review Summary

This PR implements comprehensive schema.org structured data across the Pulumi documentation site. Overall, this is a well-structured and thorough implementation that should improve SEO and AI visibility. However, there are several issues that need to be addressed:

Critical Issues

  1. JSON-LD Syntax Error Fixed
    Good catch fixing the malformed address in the old organization schema:

    • OLD: streetAddress: "1525" 4th Ave. Suite 800"
    • NEW: streetAddress: "1525 4th Ave Suite 800"
  2. Missing Newline
    Line 510: data/cloud_resources.yml must end with a newline per repository standards (AGENTS.md).

Technical Issues

  1. Duplicate Resource Definitions
    Lines 84-89, 396-410, 427-440: The cloud_resources.yml file contains duplicate entries:

    • aws_eks is defined twice (lines 84-89 and 396-410)
    • azure_aks is defined twice (lines 216-226 and 427-440)
    • aws_ecs is defined twice (lines 94-104 and 441-453)

    The second definitions should be removed to avoid conflicts.

  2. Inconsistent URL Patterns ⚠️
    Many Azure and GCP resources have empty url_pattern fields, while AWS resources have proper patterns. Consider adding consistent URL patterns for better content detection.

Documentation & Style

  1. Schema Complexity ⚠️
    While comprehensive, the schema system is quite complex with many utility partials and detection logic. Ensure there's adequate documentation for maintenance.

  2. Performance Considerations
    Good addition of max_content_scan_length: 5000 configuration for performance optimization.

Positive Aspects

  • Comprehensive Coverage: Excellent coverage of all content types (blog posts, tutorials, documentation, etc.)
  • Modern Services: Great addition of contemporary cloud services (Bedrock, Vertex AI, Azure OpenAI)
  • Backward Compatibility: Proper handling of existing schema flags
  • Rich Metadata: Detailed structured data that should significantly improve discoverability

Recommendations

  1. Remove duplicate resource definitions in cloud_resources.yml
  2. Add missing newline to cloud_resources.yml
  3. Consider adding URL patterns for Azure/GCP resources for consistency
  4. Add brief documentation about the schema system architecture

The implementation is solid and should provide significant SEO benefits. Once the duplicates are removed and the newline is added, this will be ready to merge.

@adamgordonbell adamgordonbell self-requested a review September 30, 2025 15:45
- Remove duplicate resource definitions (aws_eks, azure_aks, aws_ecs)
- Merge best attributes from duplicates (expanded patterns, updated Wikidata IDs)
- Add missing newline at end of cloud_resources.yml
- Add URL patterns for all Azure and GCP resources
- Ensure repository standards compliance
@pulumi-bot
Copy link
Collaborator

- Update organization schema with new Seattle address (601 Union St Suite 1415)
- Fix article and blog schemas to prefer h1 frontmatter for headlines
- Ensures schema headlines match what users see on rendered pages
@pulumi-bot
Copy link
Collaborator

- Fixed BreadcrumbList double-encoding issue by adding safeJS filter to prevent JSON escaping
- Changed all cloud service types from WebAPI to SoftwareApplication where applicationCategory is used
  - SoftwareApplication properly supports applicationCategory property
  - More semantically accurate for cloud services (not just APIs)
  - Eliminates schema.org validation warnings

These fixes ensure proper parsing by search engines and AI tools for better content discovery.
@pulumi-bot
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants