Overview
What Gets Indexed
| Source | Content |
|---|---|
| Web pages | Page text, headings, and metadata |
| Documents | PDFs and other linked documents |
How It Works
- You provide a root URL as the starting point
- The crawler follows links to discover pages within the configured depth and limits
- Page content is extracted and indexed for search
- Subsequent syncs check for updated content
The connector only accesses publicly available content. It respects robots.txt by default.
Prerequisites
Before setting up the Web Connector, ensure you have:- Target website URL that is publicly accessible (HTTP or HTTPS)
- Permission to crawl the target website (check their terms of service)
Setup
Step 1: Navigate to Integrations
- Go to Settings → Integrations in Omni
- Click Add Source and select Web
Step 2: Configure the Crawler
Enter the following configuration:| Field | Required | Default | Description |
|---|---|---|---|
| Root URL | Yes | - | The starting URL for the crawler (must be HTTP or HTTPS) |
| Max Depth | No | 10 | How many levels deep to follow links from the root URL |
| Max Pages | No | 100,000 | Maximum number of pages to crawl |
| Respect Robots.txt | No | Yes | Whether to honor the site’s robots.txt rules |
| Include Subdomains | No | No | Whether to crawl subdomains of the root URL |
| Blacklist Patterns | No | Empty | URL patterns to exclude from crawling |
| User Agent | No | Default | Custom user agent string for HTTP requests |
Step 3: Connect
- Click Connect to start crawling
- A new web source is created with the name
Web: <hostname> - An initial sync is triggered automatically
Your Web Connector is now configured. You can monitor crawl progress on the Integrations page.
Configuration Tips
Blacklist Patterns
Use blacklist patterns to exclude specific sections from crawling:- Login and authentication pages (
/login/*,/auth/*) - API documentation (
/api/*,/swagger/*) - Admin sections (
/admin/*,/dashboard/*) - Large binary files (
*.zip,*.tar.gz)
Subdomain Handling
Enable Include Subdomains only if the site uses subdomains for content you want to index:blog.example.comdocs.example.comhelp.example.com
Managing the Integration
Viewing Sync Status
Navigate to Settings → Integrations to view:- Last sync time
- Number of indexed pages
- Crawl progress and any errors
Updating Configuration
- Navigate to Settings → Integrations
- Find your web source and click Configure
- Update settings as needed
- Click Save to apply changes
Removing the Integration
- Navigate to Settings → Integrations
- Find the web source and click Disconnect
- Confirm the removal
Troubleshooting
Crawler not finding pages
Crawler not finding pages
Common causes:
- Max Depth too low: Increase depth if pages are nested deep in the site structure
- Blacklist too aggressive: Check if your patterns are excluding desired content
- JavaScript-rendered content: The crawler may not execute JavaScript; static HTML is indexed
Crawl taking too long
Crawl taking too long
To speed up crawling:
- Lower Max Pages to limit scope
- Add Blacklist Patterns for large or irrelevant sections
- Disable Include Subdomains if not needed
403 or access denied errors
403 or access denied errors
The target site may be blocking the crawler:
- Verify the site allows web crawlers (check robots.txt)
- Some sites block automated access entirely
- Try setting a custom User Agent string
Missing recent content
Missing recent content
Content updates are picked up during sync cycles:
- Check when the last sync occurred
- Trigger a manual sync if needed
- New pages may not be discovered if they’re not linked from existing indexed pages
Security Considerations
- Respect robots.txt: Keep this enabled to follow site owner preferences
- Rate limiting: The crawler automatically rate-limits requests to avoid overloading servers
- Public content only: Only publicly accessible pages are indexed
- No authentication: The crawler cannot access login-protected content