You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
62 lines
2.2 KiB
Markdown
62 lines
2.2 KiB
Markdown
# @lobechat/web-crawler
|
|
|
|
LobeChat's built-in web crawling module for intelligent extraction of web content and conversion to Markdown format.
|
|
|
|
## 📝 Introduction
|
|
|
|
`@lobechat/web-crawler` is a core component of LobeChat responsible for intelligent web content crawling and processing. It extracts valuable content from various webpages, filters out distracting elements, and generates structured Markdown text.
|
|
|
|
## 🛠️ Core Features
|
|
|
|
- **Intelligent Content Extraction**: Identifies main content based on Mozilla Readability algorithm
|
|
- **Multi-level Crawling Strategy**: Supports multiple crawling implementations including basic crawling, Jina, and Browserless rendering
|
|
- **Custom URL Rules**: Handles specific website crawling logic through a flexible rule system
|
|
|
|
## 🤝 Contribution
|
|
|
|
Web structures are diverse and complex. We welcome community contributions for specific website crawling rules. You can participate in improvements through:
|
|
|
|
### How to Contribute URL Rules
|
|
|
|
1. Add new rules to the [urlRules.ts](https://github.com/lobehub/lobe-chat/blob/main/packages/web-crawler/src/urlRules.ts) file
|
|
2. Rule example:
|
|
|
|
```typescript
|
|
// Example: handling specific websites
|
|
const url = [
|
|
// ... other URL matching rules
|
|
{
|
|
// URL matching pattern, supports regex
|
|
urlPattern: 'https://example.com/articles/(.*)',
|
|
|
|
// Optional: URL transformation, redirects to an easier-to-crawl version
|
|
urlTransform: 'https://example.com/print/$1',
|
|
|
|
// Optional: specify crawling implementation, supports 'naive', 'jina', and 'browserless'
|
|
impls: ['naive', 'jina', 'browserless'],
|
|
|
|
// Optional: content filtering configuration
|
|
filterOptions: {
|
|
// Whether to enable Readability algorithm for filtering distracting elements
|
|
enableReadability: true,
|
|
// Whether to convert to plain text
|
|
pureText: false,
|
|
},
|
|
},
|
|
];
|
|
```
|
|
|
|
### Rule Submission Process
|
|
|
|
1. Fork the [LobeChat repository](https://github.com/lobehub/lobe-chat)
|
|
2. Add or modify URL rules
|
|
3. Submit a Pull Request describing:
|
|
|
|
- Target website characteristics
|
|
- Problems solved by the rule
|
|
- Test cases (example URLs)
|
|
|
|
## 📌 Note
|
|
|
|
This is an internal module of LobeHub (`"private": true`), designed specifically for LobeChat and not published as a standalone package.
|