Skip to main content
The Web Connector allows you to crawl and index content from any publicly accessible website, making it searchable in Omni.

Overview

What Gets Indexed

SourceContent
Web pagesPage text, headings, and metadata
DocumentsPDFs and other linked documents

How It Works

  1. You provide a root URL as the starting point
  2. The crawler follows links to discover pages within the configured depth and limits
  3. Page content is extracted and indexed for search
  4. Subsequent syncs check for updated content
The connector only accesses publicly available content. It respects robots.txt by default.

Prerequisites

Before setting up the Web Connector, ensure you have:
  • Target website URL that is publicly accessible (HTTP or HTTPS)
  • Permission to crawl the target website (check their terms of service)

Setup

Step 1: Navigate to Integrations

  1. Go to SettingsIntegrations in Omni
  2. Click Add Source and select Web

Step 2: Configure the Crawler

Enter the following configuration:
FieldRequiredDefaultDescription
Root URLYes-The starting URL for the crawler (must be HTTP or HTTPS)
Max DepthNo10How many levels deep to follow links from the root URL
Max PagesNo100,000Maximum number of pages to crawl
Respect Robots.txtNoYesWhether to honor the site’s robots.txt rules
Include SubdomainsNoNoWhether to crawl subdomains of the root URL
Blacklist PatternsNoEmptyURL patterns to exclude from crawling
User AgentNoDefaultCustom user agent string for HTTP requests

Step 3: Connect

  1. Click Connect to start crawling
  2. A new web source is created with the name Web: <hostname>
  3. An initial sync is triggered automatically
Your Web Connector is now configured. You can monitor crawl progress on the Integrations page.

Configuration Tips

Start small: Begin with a lower Max Pages value (e.g., 100) to test the crawler before doing a full crawl.

Blacklist Patterns

Use blacklist patterns to exclude specific sections from crawling:
/api/*
/admin/*
/login/*
*.pdf
Common patterns to exclude:
  • Login and authentication pages (/login/*, /auth/*)
  • API documentation (/api/*, /swagger/*)
  • Admin sections (/admin/*, /dashboard/*)
  • Large binary files (*.zip, *.tar.gz)

Subdomain Handling

Enable Include Subdomains only if the site uses subdomains for content you want to index:
  • blog.example.com
  • docs.example.com
  • help.example.com
Enabling subdomains can significantly increase the crawl scope. Monitor page counts carefully.

Managing the Integration

Viewing Sync Status

Navigate to SettingsIntegrations to view:
  • Last sync time
  • Number of indexed pages
  • Crawl progress and any errors

Updating Configuration

  1. Navigate to SettingsIntegrations
  2. Find your web source and click Configure
  3. Update settings as needed
  4. Click Save to apply changes
Changes take effect on the next sync cycle.

Removing the Integration

  1. Navigate to SettingsIntegrations
  2. Find the web source and click Disconnect
  3. Confirm the removal

Troubleshooting

Common causes:
  • Max Depth too low: Increase depth if pages are nested deep in the site structure
  • Blacklist too aggressive: Check if your patterns are excluding desired content
  • JavaScript-rendered content: The crawler may not execute JavaScript; static HTML is indexed
Try starting with a higher Max Depth and fewer blacklist patterns.
To speed up crawling:
  • Lower Max Pages to limit scope
  • Add Blacklist Patterns for large or irrelevant sections
  • Disable Include Subdomains if not needed
Large sites (100,000+ pages) may take several hours to fully index.
The target site may be blocking the crawler:
  • Verify the site allows web crawlers (check robots.txt)
  • Some sites block automated access entirely
  • Try setting a custom User Agent string
Content updates are picked up during sync cycles:
  • Check when the last sync occurred
  • Trigger a manual sync if needed
  • New pages may not be discovered if they’re not linked from existing indexed pages

Security Considerations

  • Respect robots.txt: Keep this enabled to follow site owner preferences
  • Rate limiting: The crawler automatically rate-limits requests to avoid overloading servers
  • Public content only: Only publicly accessible pages are indexed
  • No authentication: The crawler cannot access login-protected content

What’s Next