DashboardAnalyticsResourcesPlansAccountFAQAbout usContact us

Cloudflare Worker Setup Guide

Intercept bot traffic and inject Schema.org markup automatically


What This Does

Intercepts all requests to your customer's website

Detects bot traffic via User-Agent (Google, Bing, GPT, Claude, etc.)

Fetches schema from your R2 bucket at schema.baselinelabs.ai/{domain}/{path}.json

Injects schema inline into HTML <head>

Passes through unmodified for human traffic

Worker Script Template

export default {
  async fetch(request, env) {
    const url = new URL(request.url);
    const userAgent = request.headers.get('User-Agent') || '';

    // Bot detection - comprehensive list of search engines, AI crawlers, and social bots
    // Updated for 2025-2026 bot landscape
    const botPatterns = [
      // Search Engines
      'googlebot',
      'bingbot',
      'yandexbot',
      'baiduspider',
      'duckduckbot',
      'applebot',
      // AI Crawlers (Updated 2025-2026)
      'gptbot',
      'google-extended',        // Gemini AI training (2024+)
      'claudebot',
      'claude-web',
      'perplexitybot',
      'anthropic-ai',
      'chatgpt-user',
      'cohere-ai',
      'omgilibot',
      'omgili',
      'meta-externalagent',     // Meta's updated crawler (2024+)
      'facebookbot',            // Legacy Meta crawler
      'facebot',
      'bytespider',             // TikTok/ByteDance crawler
      'amazonbot',              // Amazon Alexa
      'applebot-extended',      // Apple Intelligence AI (2024+)
      'youbot',                 // You.com search
      'diffbot',                // Knowledge graph extraction
      'img2dataset',            // AI training scraper
      // Social Media & Messaging
      'twitterbot',
      'slackbot',
      'discordbot',
      'telegrambot',
      'whatsapp',
      'linkedinbot',
      'pinterestbot',
      // Common Crawlers
      'ccbot',
      'ia_archiver',
      'archive.org_bot'
    ];

    const isBot = botPatterns.some(bot =>
      userAgent.toLowerCase().includes(bot)
    );

    // If not a bot, pass through unchanged
    if (!isBot) {
      return fetch(request);
    }

    // Log bot detection for monitoring (with bot type identification)
    const botType = botPatterns.find(bot => userAgent.toLowerCase().includes(bot));
    console.log(`[geo-butler] Bot detected: ${botType} | User-Agent: ${userAgent.substring(0, 80)} | Path: ${url.pathname}`);

    // Fetch original page
    const response = await fetch(request);

    // Only process HTML responses
    const contentType = response.headers.get('content-type') || '';
    if (!contentType.includes('text/html')) {
      return response;
    }

    // Build schema URL with proper path normalization
    const domain = url.hostname;

    // Remove leading slash, strip trailing slashes, ignore query params
    let path = url.pathname.slice(1) || 'index';
    path = path.replace(/\/+$/, '') || 'index'; // Strip trailing slashes

    // Handle nested paths properly (e.g., /blog/post-1 -> blog/post-1)
    const schemaUrl = `https://schema.baselinelabs.ai/${domain}/${path}.json`;

    // Fetch schema from R2 bucket with timeout
    let schemaData = null;
    const controller = new AbortController();
    const timeoutId = setTimeout(() => controller.abort(), 3000); // 3 second timeout

    try {
      const schemaResponse = await fetch(schemaUrl, {
        signal: controller.signal
      });
      clearTimeout(timeoutId);

      if (schemaResponse.ok) {
        schemaData = await schemaResponse.json();
        console.log(`[geo-butler] Schema fetched successfully: ${schemaUrl}`);
      } else {
        console.log(`[geo-butler] Schema not found (HTTP ${schemaResponse.status}): ${schemaUrl}`);

        // Try fallback to domain-level default schema
        const fallbackUrl = `https://schema.baselinelabs.ai/${domain}/default.json`;
        const fallbackResponse = await fetch(fallbackUrl);
        if (fallbackResponse.ok) {
          schemaData = await fallbackResponse.json();
          console.log(`[geo-butler] Using fallback schema: ${fallbackUrl}`);
        }
      }
    } catch (e) {
      clearTimeout(timeoutId);
      console.log(`[geo-butler] Schema fetch failed: ${e.message}`);
      // Continue without schema - don't block the response
    }

    // If no schema found, return original response
    if (!schemaData) {
      console.log(`[geo-butler] No schema available, returning original HTML`);
      return response;
    }

    // Inject schema into HTML
    let html = await response.text();
    const schemaScript = `
<script type="application/ld+json">
${JSON.stringify(schemaData, null, 2)}
</script>`;

    // Add verification marker for geo-butler
    const verificationMarker = '<!-- geo-butler-schema-active -->';
    const injection = `${verificationMarker}\n${schemaScript}`;

    // Try multiple injection strategies (case-insensitive)
    let injected = false;

    // Strategy 1: Before </head> tag (preferred)
    if (html.match(/<\/head>/i)) {
      html = html.replace(/<\/head>/i, `${injection}\n</head>`);
      injected = true;
    }
    // Strategy 2: After <head> opening tag
    else if (html.match(/<head[^>]*>/i)) {
      html = html.replace(/<head([^>]*)>/i, `<head$1>\n${injection}`);
      injected = true;
    }
    // Strategy 3: After <body> opening tag
    else if (html.match(/<body[^>]*>/i)) {
      html = html.replace(/<body([^>]*)>/i, `<body$1>\n${injection}`);
      injected = true;
    }
    // Strategy 4: Prepend to entire HTML (fallback)
    else {
      html = `${injection}\n${html}`;
      injected = true;
    }

    if (injected) {
      console.log(`[geo-butler] Schema injected successfully`);
    }

    // Create new response with modified HTML
    // CRITICAL: Remove Content-Length header as body size has changed
    const newHeaders = new Headers(response.headers);
    newHeaders.delete('Content-Length');
    newHeaders.delete('content-length');

    return new Response(html, {
      status: response.status,
      statusText: response.statusText,
      headers: newHeaders
    });
  }
};

Customer Setup Instructions

1

Log into Cloudflare Dashboard

Go to dash.cloudflare.com

2

Create a New Worker

Navigate to Workers & PagesCreate ApplicationCreate Worker

3

Name Your Worker

Name it baselinelabs-schema (or any name you prefer)

4

Replace Default Code

Delete the default worker code and paste the script template above

5

Save and Deploy

Click Save and Deploy to publish the worker

6

Configure Worker Route

Go to your website's zone → Workers Routes

Add route pattern:

*example.com/*

Select worker:

baselinelabs-schema

Done!

All bot traffic to the customer's site will now get schema injected automatically

Configuration Options

Add More Bot Patterns

The script now includes comprehensive bot detection, but you can extend the botPatterns array for specialized crawlers:

// Already included: googlebot, bingbot, gptbot, claudebot,
// perplexitybot, slackbot, linkedinbot, applebot, etc.

// Add custom patterns:
const botPatterns = [
  ...existingPatterns,
  'custom-bot',      // Your custom bot
  'partner-crawler', // Partner crawler
  // ... add more as needed
];

Custom Schema URL

Change the schema base URL to point to a different CDN or R2 bucket:

const schemaUrl = `https://your-custom-cdn.com/${domain}/${path}.json`;

Logging and Debugging

Add logging to track bot detection and schema injection:

// Add after bot detection
if (isBot) {
  console.log('Bot detected:', userAgent);
  console.log('Fetching schema from:', schemaUrl);
}

// Add after schema injection
if (schemaData) {
  console.log('Schema injected successfully');
}

Fallback Schema (Built-in)

The worker automatically tries a domain-level fallback schema if page-specific schema isn't found:

// Automatic fallback sequence:
// 1. Try: schema.baselinelabs.ai/example.com/about.json
// 2. If not found, try: schema.baselinelabs.ai/example.com/default.json
// 3. If still not found, return original HTML without modification

// To use: Upload a default.json file to your domain folder in R2
// This will serve as a fallback for any pages without specific schema

How It Works

Request Flow

1
User/Bot requests example.com/about
2
Cloudflare Worker intercepts the request at the edge
3
Worker checks User-Agent header for bot patterns (35+ bots including 2025-2026 AI crawlers)
4
If human: Pass through original page unchanged (zero overhead)
5
If bot: Fetch schema from schema.baselinelabs.ai/example.com/about.json (3s timeout)
6
If not found: Try fallback default.json
7
Inject schema as <script type="application/ld+json"> into HTML </head>
8
Return modified HTML to bot with corrected headers

Example URL Mapping

example.com/
→ schema.baselinelabs.ai/example.com/index.json
example.com/about
→ schema.baselinelabs.ai/example.com/about.json
example.com/blog/post-1
→ schema.baselinelabs.ai/example.com/blog/post-1.json

Key Benefits

Zero Performance Impact: Human visitors see no changes or delays
SEO Optimized: Search engines and AI bots get rich structured data
Automatic Updates: Change schema in R2, updates propagate instantly
Edge Computing: Runs at Cloudflare's edge, minimal latency
No Code Changes: Works with any existing website, no modifications needed

Production Considerations

✓ Improvements in This Version

  • Content-Length Fix: Headers properly cleaned to prevent truncation
  • 3-Second Timeout: R2 fetches won't block bot crawls indefinitely
  • Path Normalization: Handles trailing slashes and nested paths correctly
  • Fallback Schema: Automatically tries domain-level default.json
  • 35+ Bot Patterns: Updated for 2025-2026, includes Google-Extended, Meta-External-Agent, ByteSpider, Amazonbot, and other AI crawlers
  • Logging Built-in: Monitor bot detection and schema injection in Cloudflare logs
  • Case-Insensitive HTML: Works with <HEAD>, <head>, or <Body>

⚠️ User-Agent Spoofing

Bot detection relies on User-Agent headers, which can be spoofed. This is acceptable for SEO purposes since Schema.org data is public information anyway. If you need to hide content from unauthorized users, do NOT rely on User-Agent alone—implement proper authentication instead.

📊 Monitoring Recommendations

Check Cloudflare Worker logs to monitor:

  • Which bots are accessing your site (User-Agent tracking)
  • Schema fetch success/failure rates from R2
  • Pages missing schema files (404s from R2)
  • Worker execution time and performance

🔧 Testing Your Setup

Before going live, test with curl:

# Test as Googlebot
curl -H "User-Agent: Googlebot" https://example.com/about

# Should see: <!-- geo-butler-schema-active -->
# And: <script type="application/ld+json">

# Test as human (should be unchanged)
curl https://example.com/about