WebBaseLoader. This approach is ideal for building chatbots or knowledge systems that rely on up-to-date web data.

Prerequisites
Make sure you have installed the community document loaders:You also need network access to fetch external URLs.
Step 1: Fetching Web Content with WebBaseLoader
LangChain’sWebBaseLoader retrieves the full text of a page along with rich metadata (title, description, source URL, and more). Here’s a simple example:
data will be a list containing a single Document object:
- Raw text:
data[0].page_content - Metadata fields:
data[0].metadata["title"],data[0].metadata["source"], etc.
Understanding the Loaded Document
Here’s a quick overview of the key metadata fields provided byWebBaseLoader:
| Metadata Field | Description | Example |
|---|---|---|
| source | The original URL of the webpage | https://www.theverge.com/2024/4/18/...-rival |
| title | The HTML <title> content | Meta releases new AI assistant powered by Llama 3 model |
| description | The page’s meta description (if available) | Meta’s AI assistant brings Llama 3 to ChatGPT competition. |
| date | Publication date (if parseable) | 2024-04-18 |
Step 2: Preparing for Text Splitting
Once the page is loaded into aDocument object, the next step is to split its contents into manageable chunks. This enables efficient embedding, similarity search, and retrieval in downstream applications.
Splitting too aggressively (very small chunks) can degrade context. Tune
chunk_size and chunk_overlap according to your application needs.Next Steps
With your webpage content properly loaded and chunked, you can:- Generate embeddings for semantic search
- Build a conversational agent over the content
- Index and query using a vector database