Skip to main content

Web Crawler Connector

Crawl and index public or internal websites.

Overview

The Web Crawler allows you to:

  • Index website content
  • Follow links automatically
  • Respect robots.txt
  • Handle authentication

Prerequisites

  • Target website URL
  • Network access to site
  • (Optional) Authentication credentials

Configuration

SettingDescription
Start URLBeginning URL for crawl
DepthHow many links to follow
Max PagesMaximum pages to index
Include PatternsURL patterns to include
Exclude PatternsURL patterns to skip
Respect robots.txtHonor robot rules

Setup Steps

  1. Add Connector: Knowledge → Add Data Source → Web Crawler
  2. Enter Start URL: Base URL to begin crawling
  3. Configure Depth: Set link-following depth
  4. Set Patterns: Include/exclude URL patterns
  5. Test & Create: Verify and save

URL Patterns

Include specific paths:

https://docs.example.com/*
https://example.com/help/*

Exclude paths:

*/login/*
*/admin/*
*.pdf

Crawl Settings

SettingDefaultDescription
Depth3Link depth to follow
Max Pages1000Maximum pages
Delay1sDelay between requests
Timeout30sPage load timeout

Authentication

Basic Auth

For password-protected sites:

Username: crawler
Password: ****

For sites requiring login (contact support).

Content Indexed

ElementIndexed
Page textYes
HeadingsYes
LinksMetadata
ImagesAlt text
Meta tagsYes

Best Practices

  1. Start with documentation sites
  2. Set reasonable depth limits
  3. Exclude login/admin pages
  4. Use appropriate rate limiting

Troubleshooting

Blocked by robots.txt: Check robot rules or disable check

Timeout errors: Increase timeout or reduce concurrency

Missing pages: Verify URL patterns are correct