You cannot select more than 25 topics
			Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
		
		
		
		
		
			| 
				
					
						
							 | 
			8 months ago | |
|---|---|---|
| .. | ||
| src | 8 months ago | |
| README.md | 8 months ago | |
| README.zh-CN.md | 8 months ago | |
| package.json | 8 months ago | |
| tsconfig.json | 8 months ago | |
		
			
				
				README.md
			
		
		
			
			
		
	
	@lobechat/web-crawler
LobeChat's built-in web crawling module for intelligent extraction of web content and conversion to Markdown format.
📝 Introduction
@lobechat/web-crawler is a core component of LobeChat responsible for intelligent web content crawling and processing. It extracts valuable content from various webpages, filters out distracting elements, and generates structured Markdown text.
🛠️ Core Features
- Intelligent Content Extraction: Identifies main content based on Mozilla Readability algorithm
 - Multi-level Crawling Strategy: Supports multiple crawling implementations including basic crawling, Jina, and Browserless rendering
 - Custom URL Rules: Handles specific website crawling logic through a flexible rule system
 
🤝 Contribution
Web structures are diverse and complex. We welcome community contributions for specific website crawling rules. You can participate in improvements through:
How to Contribute URL Rules
- Add new rules to the urlRules.ts file
 - Rule example:
 
// Example: handling specific websites
const url = [
  // ... other URL matching rules
  {
    // URL matching pattern, supports regex
    urlPattern: 'https://example.com/articles/(.*)',
    // Optional: URL transformation, redirects to an easier-to-crawl version
    urlTransform: 'https://example.com/print/$1',
    // Optional: specify crawling implementation, supports 'naive', 'jina', and 'browserless'
    impls: ['naive', 'jina', 'browserless'],
    // Optional: content filtering configuration
    filterOptions: {
      // Whether to enable Readability algorithm for filtering distracting elements
      enableReadability: true,
      // Whether to convert to plain text
      pureText: false,
    },
  },
];
Rule Submission Process
- Fork the LobeChat repository
 - Add or modify URL rules
 - Submit a Pull Request describing:
 
- Target website characteristics
 - Problems solved by the rule
 - Test cases (example URLs)
 
📌 Note
This is an internal module of LobeHub ("private": true), designed specifically for LobeChat and not published as a standalone package.