LangChain

Performing Retrieval

Loading Webpages

In this tutorial, you’ll learn how to fetch and process the contents of a live webpage using LangChain’s WebBaseLoader. This approach is ideal for building chatbots or knowledge systems that rely on up-to-date web data.

The image is a webpage from The Verge discussing Meta's competition with ChatGPT, featuring a presentation of Meta's AI assistant on various platforms.

The Verge recently published an in-depth article on Meta’s AI assistant powered by the new Llama 3 model. We’ll use LangChain’s web loader to pull both the text and its metadata for downstream processing.

Prerequisites

Note

Make sure you have installed the community document loaders:

pip install langchain_community

You also need network access to fetch external URLs.

Step 1: Fetching Web Content with WebBaseLoader

LangChain’s WebBaseLoader retrieves the full text of a page along with rich metadata (title, description, source URL, and more). Here’s a simple example:

from langchain_community.document_loaders import WebBaseLoader

URL = "https://www.theverge.com/2024/4/18/24133808/meta-ai-assistant-llama-3-chatgpt-openai-rival"
loader = WebBaseLoader(URL)
data = loader.load()

After executing the code above, data will be a list containing a single Document object:

# Confirm we have one document
len(data)  
# Inspect the first Document
print(data[0])
# Document(
#   page_content="Meta releases new AI assistant powered by Llama 3 model - The Verge ...",
#   metadata={
#     "source": "https://www.theverge.com/2024/4/18/24133808/...-rival",
#     "title": "Meta releases new AI assistant powered by Llama 3 model",
#     "description": "Meta’s AI assistant brings Llama 3 to ChatGPT competition.",
#     "date": "2024-04-18",
#     ...
#   }
# )

You can access:

  • Raw text: data[0].page_content
  • Metadata fields: data[0].metadata["title"], data[0].metadata["source"], etc.

Understanding the Loaded Document

Here’s a quick overview of the key metadata fields provided by WebBaseLoader:

Metadata FieldDescriptionExample
sourceThe original URL of the webpagehttps://www.theverge.com/2024/4/18/...-rival
titleThe HTML <title> contentMeta releases new AI assistant powered by Llama 3 model
descriptionThe page’s meta description (if available)Meta’s AI assistant brings Llama 3 to ChatGPT competition.
datePublication date (if parseable)2024-04-18

Step 2: Preparing for Text Splitting

Once the page is loaded into a Document object, the next step is to split its contents into manageable chunks. This enables efficient embedding, similarity search, and retrieval in downstream applications.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
docs = text_splitter.split_documents(data)
print(f"Split into {len(docs)} chunks")

Each chunk can now be embedded and stored in a vector database or used directly in a retrieval-augmented generation (RAG) pipeline.

Warning

Splitting too aggressively (very small chunks) can degrade context. Tune chunk_size and chunk_overlap according to your application needs.

Next Steps

With your webpage content properly loaded and chunked, you can:

  • Generate embeddings for semantic search
  • Build a conversational agent over the content
  • Index and query using a vector database

References

Watch Video

Watch video content

Previous
Loading PDFs