Skip to main content

Web Crawler Connector

Crawl and index public or internal websites.

Overview

The Web Crawler allows you to:

  • Index website content
  • Follow links automatically
  • Respect robots.txt
  • Handle authentication

Prerequisites

  • Target website URL
  • Network access to site
  • (Optional) Authentication credentials

Configuration

SettingDescription
Start URLBeginning URL for crawl
DepthHow many links to follow
Max PagesMaximum pages to index
Include PatternsURL patterns to include
Exclude PatternsURL patterns to skip
Respect robots.txtHonor robot rules
Stay on DomainRestrict crawling to starting domain
Include SubdomainsFollow links to subdomains of the starting domain
User AgentCustom user agent string for requests
Download FilesDownload and process linked files (e.g., PDFs)
Allowed File TypesFile types to process when downloading (e.g., pdf, docx)
HeadersCustom HTTP headers to send with requests
ModeCrawl mode: default or documentation
Doc FrameworkDocumentation framework preset (e.g., Docusaurus, MkDocs)
Sitemap URLSitemap URL for efficient page discovery
Incremental SyncOnly re-index changed pages on subsequent syncs

Setup Steps

  1. Add Connector: Knowledge → Add Data Source → Web Crawler
  2. Enter Start URL: Base URL to begin crawling
  3. Configure Depth: Set link-following depth
  4. Set Patterns: Include/exclude URL patterns
  5. Test & Create: Verify and save

URL Patterns

Include specific paths:

https://docs.example.com/*
https://example.com/help/*

Exclude paths:

*/login/*
*/admin/*
*.pdf

Crawl Settings

SettingDefaultDescription
Depth3Link depth to follow
Max Pages1000Maximum pages
Delay1sDelay between requests
Timeout30sPage load timeout

Authentication

Basic Auth

For password-protected sites:

Username: crawler
Password: ****

For sites requiring login (contact support).

Documentation Mode

The Web Crawler supports a specialized Documentation Mode designed for crawling documentation sites. When the mode is set to documentation, the crawler applies framework-aware extraction logic optimized for sites built with tools like Docusaurus, MkDocs, GitBook, and similar platforms. This mode leverages sitemaps for efficient page discovery, understands navigation structures, and extracts content more cleanly by filtering out repeated navigation elements. Use the Doc Framework setting to select a preset for your documentation platform.

Content Indexed

ElementIndexed
Page textYes
HeadingsYes
LinksMetadata
ImagesAlt text
Meta tagsYes

Best Practices

  1. Start with documentation sites
  2. Set reasonable depth limits
  3. Exclude login/admin pages
  4. Use appropriate rate limiting

Troubleshooting

Blocked by robots.txt: Check robot rules or disable check

Timeout errors: Increase timeout or reduce concurrency

Missing pages: Verify URL patterns are correct