> ## Documentation Index
> Fetch the complete documentation index at: https://docs.getomni.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Web Connector

> Crawl and index content from any public website

The Web Connector allows you to crawl and index content from any publicly accessible website, making it searchable in Omni.

## Overview

### What Gets Indexed

| Source    | Content                           |
| --------- | --------------------------------- |
| Web pages | Page text, headings, and metadata |
| Documents | PDFs and other linked documents   |

### How It Works

1. You provide a root URL as the starting point
2. The crawler follows links to discover pages within the configured depth and limits
3. Page content is extracted and indexed for search
4. Subsequent syncs check for updated content

<Note>
  The connector only accesses publicly available content. It respects robots.txt by default.
</Note>

## Prerequisites

Before setting up the Web Connector, ensure you have:

* **Target website URL** that is publicly accessible (HTTP or HTTPS)
* **Permission to crawl** the target website (check their terms of service)

## Setup

### Step 1: Navigate to Integrations

1. Go to **Settings** → **Integrations** in Omni
2. Click **Add Source** and select **Web**

### Step 2: Configure the Crawler

Enter the following configuration:

| Field              | Required | Default | Description                                              |
| ------------------ | -------- | ------- | -------------------------------------------------------- |
| Root URL           | Yes      | -       | The starting URL for the crawler (must be HTTP or HTTPS) |
| Max Depth          | No       | 10      | How many levels deep to follow links from the root URL   |
| Max Pages          | No       | 100,000 | Maximum number of pages to crawl                         |
| Respect Robots.txt | No       | Yes     | Whether to honor the site's robots.txt rules             |
| Include Subdomains | No       | No      | Whether to crawl subdomains of the root URL              |
| Blacklist Patterns | No       | Empty   | URL patterns to exclude from crawling                    |
| User Agent         | No       | Default | Custom user agent string for HTTP requests               |

### Step 3: Connect

1. Click **Connect** to start crawling
2. A new web source is created with the name `Web: <hostname>`
3. An initial sync is triggered automatically

<Check>
  Your Web Connector is now configured. You can monitor crawl progress on the Integrations page.
</Check>

***

## Configuration Tips

<Tip>
  **Start small**: Begin with a lower `Max Pages` value (e.g., 100) to test the crawler before doing a full crawl.
</Tip>

### Blacklist Patterns

Use blacklist patterns to exclude specific sections from crawling:

```
/api/*
/admin/*
/login/*
*.pdf
```

Common patterns to exclude:

* Login and authentication pages (`/login/*`, `/auth/*`)
* API documentation (`/api/*`, `/swagger/*`)
* Admin sections (`/admin/*`, `/dashboard/*`)
* Large binary files (`*.zip`, `*.tar.gz`)

### Subdomain Handling

Enable **Include Subdomains** only if the site uses subdomains for content you want to index:

* `blog.example.com`
* `docs.example.com`
* `help.example.com`

<Warning>
  Enabling subdomains can significantly increase the crawl scope. Monitor page counts carefully.
</Warning>

***

## Managing the Integration

### Viewing Sync Status

Navigate to **Settings** → **Integrations** to view the sync status for each source directly on the list, including last sync time, number of indexed items, and any errors. Click **Configure** on a source for more details.

### Updating Configuration

1. Navigate to **Settings** → **Integrations**
2. Find your web source and click **Configure**
3. Update settings as needed
4. Click **Save** to apply changes

Changes take effect on the next sync cycle.

### Removing the Integration

1. Navigate to **Settings** → **Integrations**
2. Click **Configure** against the web source
3. Click **Delete Permanently**

***

## Troubleshooting

<AccordionGroup>
  <Accordion title="Crawler not finding pages">
    Common causes:

    * **Max Depth too low**: Increase depth if pages are nested deep in the site structure
    * **Blacklist too aggressive**: Check if your patterns are excluding desired content
    * **JavaScript-rendered content**: The crawler may not execute JavaScript; static HTML is indexed

    Try starting with a higher Max Depth and fewer blacklist patterns.
  </Accordion>

  <Accordion title="Crawl taking too long">
    To speed up crawling:

    * Lower **Max Pages** to limit scope
    * Add **Blacklist Patterns** for large or irrelevant sections
    * Disable **Include Subdomains** if not needed

    Large sites (100,000+ pages) may take several hours to fully index.
  </Accordion>

  <Accordion title="403 or access denied errors">
    The target site may be blocking the crawler:

    * Verify the site allows web crawlers (check robots.txt)
    * Some sites block automated access entirely
    * Try setting a custom **User Agent** string
  </Accordion>

  <Accordion title="Missing recent content">
    Content updates are picked up during sync cycles:

    * Check when the last sync occurred
    * Trigger a manual sync if needed
    * New pages may not be discovered if they're not linked from existing indexed pages
  </Accordion>
</AccordionGroup>

## Security Considerations

* **Respect robots.txt**: Keep this enabled to follow site owner preferences
* **Rate limiting**: The crawler automatically rate-limits requests to avoid overloading servers
* **Public content only**: Only publicly accessible pages are indexed
* **No authentication**: The crawler cannot access login-protected content

## What's Next

<CardGroup cols={3}>
  <Card title="Search Your Data" icon="magnifying-glass" href="/user-guide/search">
    Learn how to search across indexed web content
  </Card>

  <Card title="AI Assistant" icon="robot" href="/user-guide/ai-assistant">
    Ask questions about your indexed websites
  </Card>

  <Card title="Add More Connectors" icon="plug" href="/connectors/overview">
    Connect additional data sources
  </Card>
</CardGroup>
