Web Crawler Connector

Crawl and index public or internal websites.

Overview

The Web Crawler allows you to:

Index website content
Follow links automatically
Respect robots.txt
Handle authentication

Prerequisites

Target website URL
Network access to site
(Optional) Authentication credentials

Configuration

Setting	Description
Start URL	Beginning URL for crawl
Depth	How many links to follow
Max Pages	Maximum pages to index
Include Patterns	URL patterns to include
Exclude Patterns	URL patterns to skip
Respect robots.txt	Honor robot rules

Setup Steps

Add Connector: Knowledge → Add Data Source → Web Crawler
Enter Start URL: Base URL to begin crawling
Configure Depth: Set link-following depth
Set Patterns: Include/exclude URL patterns
Test & Create: Verify and save

URL Patterns

Include specific paths:

https://docs.example.com/*
https://example.com/help/*

Exclude paths:

*/login/*
*/admin/*
*.pdf

Crawl Settings

Setting	Default	Description
Depth	3	Link depth to follow
Max Pages	1000	Maximum pages
Delay	1s	Delay between requests
Timeout	30s	Page load timeout

Authentication

Basic Auth

For password-protected sites:

Username: crawler
Password: ****

For sites requiring login (contact support).

Content Indexed

Element	Indexed
Page text	Yes
Headings	Yes
Links	Metadata
Images	Alt text
Meta tags	Yes

Best Practices

Start with documentation sites
Set reasonable depth limits
Exclude login/admin pages
Use appropriate rate limiting

Troubleshooting

Blocked by robots.txt: Check robot rules or disable check

Timeout errors: Increase timeout or reduce concurrency

Missing pages: Verify URL patterns are correct

Overview​

Prerequisites​

Configuration​

Setup Steps​

URL Patterns​

Crawl Settings​

Authentication​

Basic Auth​

Cookie-based​

Content Indexed​

Best Practices​

Troubleshooting​

Related​