Web Connector

The Web Connector allows you to crawl and index content from any publicly accessible website, making it searchable in Omni.

Overview

What Gets Indexed

Source	Content
Web pages	Page text, headings, and metadata
Documents	PDFs and other linked documents

How It Works

You provide a root URL as the starting point
The crawler follows links to discover pages within the configured depth and limits
Page content is extracted and indexed for search
Subsequent syncs check for updated content

The connector only accesses publicly available content. It respects robots.txt by default.

Prerequisites

Before setting up the Web Connector, ensure you have:

Target website URL that is publicly accessible (HTTP or HTTPS)
Permission to crawl the target website (check their terms of service)

Setup

Step 1: Navigate to Integrations

Go to Settings → Integrations in Omni
Click Add Source and select Web

Step 2: Configure the Crawler

Enter the following configuration:

Field	Required	Default	Description
Root URL	Yes	-	The starting URL for the crawler (must be HTTP or HTTPS)
Max Depth	No	10	How many levels deep to follow links from the root URL
Max Pages	No	100,000	Maximum number of pages to crawl
Respect Robots.txt	No	Yes	Whether to honor the site’s robots.txt rules
Include Subdomains	No	No	Whether to crawl subdomains of the root URL
Blacklist Patterns	No	Empty	URL patterns to exclude from crawling
User Agent	No	Default	Custom user agent string for HTTP requests

Step 3: Connect

Click Connect to start crawling
A new web source is created with the name Web: <hostname>
An initial sync is triggered automatically

Your Web Connector is now configured. You can monitor crawl progress on the Integrations page.

Configuration Tips

Start small: Begin with a lower Max Pages value (e.g., 100) to test the crawler before doing a full crawl.

Blacklist Patterns

Use blacklist patterns to exclude specific sections from crawling:

/api/*
/admin/*
/login/*
*.pdf

Common patterns to exclude:

Login and authentication pages (/login/*, /auth/*)
API documentation (/api/*, /swagger/*)
Admin sections (/admin/*, /dashboard/*)
Large binary files (*.zip, *.tar.gz)

Subdomain Handling

Enable Include Subdomains only if the site uses subdomains for content you want to index:

blog.example.com
docs.example.com
help.example.com

Enabling subdomains can significantly increase the crawl scope. Monitor page counts carefully.

Managing the Integration

Viewing Sync Status

Navigate to Settings → Integrations to view:

Last sync time
Number of indexed pages
Crawl progress and any errors

Updating Configuration

Navigate to Settings → Integrations
Find your web source and click Configure
Update settings as needed
Click Save to apply changes

Changes take effect on the next sync cycle.

Removing the Integration

Navigate to Settings → Integrations
Find the web source and click Disconnect
Confirm the removal

Troubleshooting

Crawler not finding pages

Common causes:

Max Depth too low: Increase depth if pages are nested deep in the site structure
Blacklist too aggressive: Check if your patterns are excluding desired content
JavaScript-rendered content: The crawler may not execute JavaScript; static HTML is indexed

Try starting with a higher Max Depth and fewer blacklist patterns.

Crawl taking too long

To speed up crawling:

Lower Max Pages to limit scope
Add Blacklist Patterns for large or irrelevant sections
Disable Include Subdomains if not needed

Large sites (100,000+ pages) may take several hours to fully index.

403 or access denied errors

The target site may be blocking the crawler:

Verify the site allows web crawlers (check robots.txt)
Some sites block automated access entirely
Try setting a custom User Agent string

Missing recent content

Content updates are picked up during sync cycles:

Check when the last sync occurred
Trigger a manual sync if needed
New pages may not be discovered if they’re not linked from existing indexed pages

Security Considerations

Respect robots.txt: Keep this enabled to follow site owner preferences
Rate limiting: The crawler automatically rate-limits requests to avoid overloading servers
Public content only: Only publicly accessible pages are indexed
No authentication: The crawler cannot access login-protected content

What’s Next

Search Your Data

Learn how to search across indexed web content

AI Assistant

Ask questions about your indexed websites

Add More Connectors

Connect additional data sources

Getting Started

Deployment

Connectors

Developers

User Guide

Administration

Overview

What Gets Indexed

How It Works

Prerequisites

Setup

Step 1: Navigate to Integrations

Step 2: Configure the Crawler

Step 3: Connect

Configuration Tips

Blacklist Patterns

Subdomain Handling

Managing the Integration

Viewing Sync Status

Updating Configuration

Removing the Integration

Troubleshooting

Security Considerations

What’s Next

Search Your Data

AI Assistant

Add More Connectors

Getting Started

Deployment

Connectors

Developers

User Guide

Administration

​Overview

​What Gets Indexed

​How It Works

​Prerequisites

​Setup

​Step 1: Navigate to Integrations

​Step 2: Configure the Crawler

​Step 3: Connect

​Configuration Tips

​Blacklist Patterns

​Subdomain Handling

​Managing the Integration

​Viewing Sync Status

​Updating Configuration

​Removing the Integration

​Troubleshooting

​Security Considerations

​What’s Next

Search Your Data

AI Assistant

Add More Connectors

Overview

What Gets Indexed

How It Works

Prerequisites

Setup

Step 1: Navigate to Integrations

Step 2: Configure the Crawler

Step 3: Connect

Configuration Tips

Blacklist Patterns

Subdomain Handling

Managing the Integration

Viewing Sync Status

Updating Configuration

Removing the Integration

Troubleshooting

Security Considerations

What’s Next