You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
|
2 months ago | |
---|---|---|
.. | ||
src | 2 months ago | |
README.md | 2 months ago | |
README.zh-CN.md | 2 months ago | |
package.json | 2 months ago | |
tsconfig.json | 2 months ago |
README.md
@lobechat/web-crawler
LobeChat's built-in web crawling module for intelligent extraction of web content and conversion to Markdown format.
📝 Introduction
@lobechat/web-crawler
is a core component of LobeChat responsible for intelligent web content crawling and processing. It extracts valuable content from various webpages, filters out distracting elements, and generates structured Markdown text.
🛠️ Core Features
- Intelligent Content Extraction: Identifies main content based on Mozilla Readability algorithm
- Multi-level Crawling Strategy: Supports multiple crawling implementations including basic crawling, Jina, and Browserless rendering
- Custom URL Rules: Handles specific website crawling logic through a flexible rule system
🤝 Contribution
Web structures are diverse and complex. We welcome community contributions for specific website crawling rules. You can participate in improvements through:
How to Contribute URL Rules
- Add new rules to the urlRules.ts file
- Rule example:
// Example: handling specific websites
const url = [
// ... other URL matching rules
{
// URL matching pattern, supports regex
urlPattern: 'https://example.com/articles/(.*)',
// Optional: URL transformation, redirects to an easier-to-crawl version
urlTransform: 'https://example.com/print/$1',
// Optional: specify crawling implementation, supports 'naive', 'jina', and 'browserless'
impls: ['naive', 'jina', 'browserless'],
// Optional: content filtering configuration
filterOptions: {
// Whether to enable Readability algorithm for filtering distracting elements
enableReadability: true,
// Whether to convert to plain text
pureText: false,
},
},
];
Rule Submission Process
- Fork the LobeChat repository
- Add or modify URL rules
- Submit a Pull Request describing:
- Target website characteristics
- Problems solved by the rule
- Test cases (example URLs)
📌 Note
This is an internal module of LobeHub ("private": true
), designed specifically for LobeChat and not published as a standalone package.