LangChain
Performing Retrieval
Loading Webpages
In this tutorial, you’ll learn how to fetch and process the contents of a live webpage using LangChain’s WebBaseLoader
. This approach is ideal for building chatbots or knowledge systems that rely on up-to-date web data.
The Verge recently published an in-depth article on Meta’s AI assistant powered by the new Llama 3 model. We’ll use LangChain’s web loader to pull both the text and its metadata for downstream processing.
Prerequisites
Note
Make sure you have installed the community document loaders:
pip install langchain_community
You also need network access to fetch external URLs.
Step 1: Fetching Web Content with WebBaseLoader
LangChain’s WebBaseLoader
retrieves the full text of a page along with rich metadata (title, description, source URL, and more). Here’s a simple example:
from langchain_community.document_loaders import WebBaseLoader
URL = "https://www.theverge.com/2024/4/18/24133808/meta-ai-assistant-llama-3-chatgpt-openai-rival"
loader = WebBaseLoader(URL)
data = loader.load()
After executing the code above, data
will be a list containing a single Document
object:
# Confirm we have one document
len(data)
# Inspect the first Document
print(data[0])
# Document(
# page_content="Meta releases new AI assistant powered by Llama 3 model - The Verge ...",
# metadata={
# "source": "https://www.theverge.com/2024/4/18/24133808/...-rival",
# "title": "Meta releases new AI assistant powered by Llama 3 model",
# "description": "Meta’s AI assistant brings Llama 3 to ChatGPT competition.",
# "date": "2024-04-18",
# ...
# }
# )
You can access:
- Raw text:
data[0].page_content
- Metadata fields:
data[0].metadata["title"]
,data[0].metadata["source"]
, etc.
Understanding the Loaded Document
Here’s a quick overview of the key metadata fields provided by WebBaseLoader
:
Metadata Field | Description | Example |
---|---|---|
source | The original URL of the webpage | https://www.theverge.com/2024/4/18/...-rival |
title | The HTML <title> content | Meta releases new AI assistant powered by Llama 3 model |
description | The page’s meta description (if available) | Meta’s AI assistant brings Llama 3 to ChatGPT competition. |
date | Publication date (if parseable) | 2024-04-18 |
Step 2: Preparing for Text Splitting
Once the page is loaded into a Document
object, the next step is to split its contents into manageable chunks. This enables efficient embedding, similarity search, and retrieval in downstream applications.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
docs = text_splitter.split_documents(data)
print(f"Split into {len(docs)} chunks")
Each chunk can now be embedded and stored in a vector database or used directly in a retrieval-augmented generation (RAG) pipeline.
Warning
Splitting too aggressively (very small chunks) can degrade context. Tune chunk_size
and chunk_overlap
according to your application needs.
Next Steps
With your webpage content properly loaded and chunked, you can:
- Generate embeddings for semantic search
- Build a conversational agent over the content
- Index and query using a vector database
References
Watch Video
Watch video content