Add Web Scraping Capabilities to AI with LangChain
This guide focuses on BrowserlessLoader, which is useful for scraping and document ingestion. If you need full browser control or multi-step automation, start with BrowserQL or one of the agent-focused guides.
LangChain is a framework for developing applications powered by language models. By integrating Browserless with LangChain, you can give your AI applications web scraping and content processing capabilities without managing browser infrastructure.
- Python 3.8 or higher
- An active Browserless API token (available in your account dashboard)
- Basic understanding of LangChain concepts
Step-by-Step Setup
Go to your Browserless account dashboard and copy your API token.
Then set the BROWSERLESS_API_TOKEN environment variable in your .env file:
- .env file
- Command line
BROWSERLESS_API_TOKEN=your-token-here
export BROWSERLESS_API_TOKEN=your-token-here
Set up a Python virtual environment to manage your dependencies:
- venv
- conda
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
conda create -n langchain-env python=3.8
conda activate langchain-env
Install LangChain and other required packages:
- pip
- Poetry
pip install langchain-community python-dotenv
poetry add langchain-community python-dotenv
Create a file named scraper.py with the following complete code:
from dotenv import load_dotenv
import os
from langchain_community.document_loaders import BrowserlessLoader
def main():
# Load environment variables
load_dotenv()
# Initialize the loader with your API token
loader = BrowserlessLoader(
api_token=os.getenv("BROWSERLESS_API_TOKEN"),
urls=["https://example.com"],
text_content=True # Get text content instead of raw HTML
)
# Load and process the documents
documents = loader.load()
# Print the results
for doc in documents:
print(f"Source: {doc.metadata.get('source')}")
print(f"Content: {doc.page_content[:200]}...")
if __name__ == "__main__":
main()
Run your application with the following command:
python scraper.py
You should see output showing the scraped content from the example website.
How It Works
1. Connection Setup: BrowserlessLoader connects to Browserless using your API token
2. Content Loading: The loader fetches and processes web content
3. Document Creation: Content is converted into LangChain Documents
4. Processing: Documents can be further processed with LangChain's tools
Advanced Configuration
Multiple URLs
Process multiple websites in a single operation:
loader = BrowserlessLoader(
api_token=api_token,
urls=[
"https://example1.com",
"https://example2.com",
"https://example3.com"
]
)
Raw HTML Mode
Get raw HTML content instead of text:
loader = BrowserlessLoader(
api_token=api_token,
urls=["https://example.com"],
text_content=False
)
Performance Optimization
-
Batch Processing
- Process multiple URLs in batches
- Implement proper error handling
- Use async/await for better performance
-
Resource Management
- Monitor memory usage
- Implement proper cleanup
- Handle timeouts appropriately
Security Best Practices
-
API Token Management
- Never commit tokens to version control
- Use environment variables
- Rotate tokens regularly
-
Input Validation
- Validate URLs before processing
- Implement rate limiting
- Handle sensitive data appropriately
Common Use Cases
News Aggregation
def aggregate_news(api_token, news_sites):
loader = BrowserlessLoader(
api_token=api_token,
urls=news_sites,
text_content=True
)
documents = loader.load()
# Process and analyze the news content
for doc in documents:
print(f"Source: {doc.metadata.get('source')}")
print(f"Content: {doc.page_content[:200]}...")
Content Analysis
from langchain.text_splitter import RecursiveCharacterTextSplitter
def analyze_content(api_token, url):
# Load content
loader = BrowserlessLoader(
api_token=api_token,
urls=[url],
text_content=True
)
documents = loader.load()
# Split content into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)
# Process chunks
for chunk in chunks:
print(f"Chunk: {chunk.page_content[:100]}...")
For more advanced usage scenarios, please refer to: