Web Crawler Connector
Crawl and index public or internal websites.
Overview
The Web Crawler allows you to:
- Index website content
- Follow links automatically
- Respect robots.txt
- Handle authentication
Prerequisites
- Target website URL
- Network access to site
- (Optional) Authentication credentials
Configuration
| Setting | Description |
|---|---|
| Start URL | Beginning URL for crawl |
| Depth | How many links to follow |
| Max Pages | Maximum pages to index |
| Include Patterns | URL patterns to include |
| Exclude Patterns | URL patterns to skip |
| Respect robots.txt | Honor robot rules |
| Stay on Domain | Restrict crawling to starting domain |
| Include Subdomains | Follow links to subdomains of the starting domain |
| User Agent | Custom user agent string for requests |
| Download Files | Download and process linked files (e.g., PDFs) |
| Allowed File Types | File types to process when downloading (e.g., pdf, docx) |
| Headers | Custom HTTP headers to send with requests |
| Mode | Crawl mode: default or documentation |
| Doc Framework | Documentation framework preset (e.g., Docusaurus, MkDocs) |
| Sitemap URL | Sitemap URL for efficient page discovery |
| Incremental Sync | Only re-index changed pages on subsequent syncs |
Setup Steps
- Add Connector: Knowledge → Add Data Source → Web Crawler
- Enter Start URL: Base URL to begin crawling
- Configure Depth: Set link-following depth
- Set Patterns: Include/exclude URL patterns
- Test & Create: Verify and save
URL Patterns
Include specific paths:
https://docs.example.com/*
https://example.com/help/*
Exclude paths:
*/login/*
*/admin/*
*.pdf
Crawl Settings
| Setting | Default | Description |
|---|---|---|
| Depth | 3 | Link depth to follow |
| Max Pages | 1000 | Maximum pages |
| Delay | 1s | Delay between requests |
| Timeout | 30s | Page load timeout |
Authentication
Basic Auth
For password-protected sites:
Username: crawler
Password: ****
Cookie-based
For sites requiring login (contact support).
Documentation Mode
The Web Crawler supports a specialized Documentation Mode designed for crawling documentation sites. When the mode is set to documentation, the crawler applies framework-aware extraction logic optimized for sites built with tools like Docusaurus, MkDocs, GitBook, and similar platforms. This mode leverages sitemaps for efficient page discovery, understands navigation structures, and extracts content more cleanly by filtering out repeated navigation elements. Use the Doc Framework setting to select a preset for your documentation platform.
Content Indexed
| Element | Indexed |
|---|---|
| Page text | Yes |
| Headings | Yes |
| Links | Metadata |
| Images | Alt text |
| Meta tags | Yes |
Best Practices
- Start with documentation sites
- Set reasonable depth limits
- Exclude login/admin pages
- Use appropriate rate limiting
Troubleshooting
Blocked by robots.txt: Check robot rules or disable check
Timeout errors: Increase timeout or reduce concurrency
Missing pages: Verify URL patterns are correct