Add Web Scraping Capabilities to AI with LangChain

This guide focuses on BrowserlessLoader, which is useful for scraping and document ingestion. If you need full browser control or multi-step automation, start with BrowserQL or one of the agent-focused guides.

LangChain is a framework for developing applications powered by language models. By integrating Browserless with LangChain, you can give your AI applications web scraping and content processing capabilities without managing browser infrastructure.

Prerequisites

Python 3.8 or higher
An active Browserless API token (available in your account dashboard)
Basic understanding of LangChain concepts

Step-by-Step Setup

1. Get your API token

Go to your Browserless account dashboard and copy your API token.

Then set the BROWSERLESS_API_TOKEN environment variable in your .env file:

.env file
Command line

BROWSERLESS_API_TOKEN=your-token-here

export BROWSERLESS_API_TOKEN=your-token-here

2. Create a virtual environment

Set up a Python virtual environment to manage your dependencies:

venv
conda

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

conda create -n langchain-env python=3.8
conda activate langchain-env

3. Install required packages

Install LangChain and other required packages:

pip
Poetry

pip install langchain-community python-dotenv

poetry add langchain-community python-dotenv

4. Create your first script

Create a file named scraper.py with the following complete code:

from dotenv import load_dotenv
import os
from langchain_community.document_loaders import BrowserlessLoader

def main():
    # Load environment variables
    load_dotenv()
    
    # Initialize the loader with your API token
    loader = BrowserlessLoader(
        api_token=os.getenv("BROWSERLESS_API_TOKEN"),
        urls=["https://example.com"],
        text_content=True  # Get text content instead of raw HTML
    )

    # Load and process the documents
    documents = loader.load()
    
    # Print the results
    for doc in documents:
        print(f"Source: {doc.metadata.get('source')}")
        print(f"Content: {doc.page_content[:200]}...")

if __name__ == "__main__":
    main()

5. Run your application

Run your application with the following command:

python scraper.py

You should see output showing the scraped content from the example website.

How It Works

1. Connection Setup: BrowserlessLoader connects to Browserless using your API token

2. Content Loading: The loader fetches and processes web content

3. Document Creation: Content is converted into LangChain Documents

4. Processing: Documents can be further processed with LangChain's tools

Advanced Configuration

Multiple URLs

Process multiple websites in a single operation:

loader = BrowserlessLoader(
    api_token=api_token,
    urls=[
        "https://example1.com",
        "https://example2.com",
        "https://example3.com"
    ]
)

Raw HTML Mode

Get raw HTML content instead of text:

loader = BrowserlessLoader(
    api_token=api_token,
    urls=["https://example.com"],
    text_content=False
)

Performance Optimization

Batch Processing
- Process multiple URLs in batches
- Implement proper error handling
- Use async/await for better performance
Resource Management
- Monitor memory usage
- Implement proper cleanup
- Handle timeouts appropriately

Security Best Practices

API Token Management
- Never commit tokens to version control
- Use environment variables
- Rotate tokens regularly
Input Validation
- Validate URLs before processing
- Implement rate limiting
- Handle sensitive data appropriately

Common Use Cases

News Aggregation

def aggregate_news(api_token, news_sites):
    loader = BrowserlessLoader(
        api_token=api_token,
        urls=news_sites,
        text_content=True
    )
    documents = loader.load()
    
    # Process and analyze the news content
    for doc in documents:
        print(f"Source: {doc.metadata.get('source')}")
        print(f"Content: {doc.page_content[:200]}...")

Content Analysis

from langchain.text_splitter import RecursiveCharacterTextSplitter

def analyze_content(api_token, url):
    # Load content
    loader = BrowserlessLoader(
        api_token=api_token,
        urls=[url],
        text_content=True
    )
    documents = loader.load()
    
    # Split content into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )
    chunks = text_splitter.split_documents(documents)
    
    # Process chunks
    for chunk in chunks:
        print(f"Chunk: {chunk.page_content[:100]}...")

For more advanced usage scenarios, please refer to:

Step-by-Step Setup​

How It Works​

Advanced Configuration​

Multiple URLs​

Raw HTML Mode​

Performance Optimization​

Security Best Practices​

Common Use Cases​

News Aggregation​

Content Analysis​