Web Crawler Connector
Crawl and index public or internal websites.
Overview
The Web Crawler allows you to:
- Index website content
- Follow links automatically
- Respect robots.txt
- Handle authentication
Prerequisites
- Target website URL
- Network access to site
- (Optional) Authentication credentials
Configuration
| Setting | Description |
|---|---|
| Start URL | Beginning URL for crawl |
| Depth | How many links to follow |
| Max Pages | Maximum pages to index |
| Include Patterns | URL patterns to include |
| Exclude Patterns | URL patterns to skip |
| Respect robots.txt | Honor robot rules |
Setup Steps
- Add Connector: Knowledge → Add Data Source → Web Crawler
- Enter Start URL: Base URL to begin crawling
- Configure Depth: Set link-following depth
- Set Patterns: Include/exclude URL patterns
- Test & Create: Verify and save
URL Patterns
Include specific paths:
https://docs.example.com/*
https://example.com/help/*
Exclude paths:
*/login/*
*/admin/*
*.pdf
Crawl Settings
| Setting | Default | Description |
|---|---|---|
| Depth | 3 | Link depth to follow |
| Max Pages | 1000 | Maximum pages |
| Delay | 1s | Delay between requests |
| Timeout | 30s | Page load timeout |
Authentication
Basic Auth
For password-protected sites:
Username: crawler
Password: ****
Cookie-based
For sites requiring login (contact support).
Content Indexed
| Element | Indexed |
|---|---|
| Page text | Yes |
| Headings | Yes |
| Links | Metadata |
| Images | Alt text |
| Meta tags | Yes |
Best Practices
- Start with documentation sites
- Set reasonable depth limits
- Exclude login/admin pages
- Use appropriate rate limiting
Troubleshooting
Blocked by robots.txt: Check robot rules or disable check
Timeout errors: Increase timeout or reduce concurrency
Missing pages: Verify URL patterns are correct