Skip to main content

Parsing Libraries

BrowserQL retrieves HTML from any page and passes it to parsing libraries like Beautiful Soup, Scrapy, or Cheerio. Use it to handle JavaScript rendering, bot protection, or any interaction required before you can access the HTML you want to parse.

tip

Use the BQL Editor to build and test your queries before integrating them into code.

Example

The query below navigates to https://browserless.io/, clicks the Pricing link, and returns the page HTML.

mutation RetrieveHTML {
goto(url: "https://browserless.io/") {
status
}
click(selector: "a[href=\"/pricing\"]") {
time
}

html {
html
}
}

Beautiful Soup

Beautiful Soup is a Python library for parsing HTML and XML. Use it when you need to extract structured data from pages that require real browser interactions before the content is available.

import requests
from bs4 import BeautifulSoup

url = 'https://browserless.io/'
token = 'YOUR_API_TOKEN_HERE'
timeout = 5 * 60 * 1000

query = '''
mutation RetrieveHTML($url: String!) {
goto(url: $url) {
status
}
click(selector: "a[href=\"/pricing\"]") {
time
}
html {
html
}
}
'''

variables = {"url": url}
endpoint = f'https://production-sfo.browserless.io/chromium/bql?timeout={timeout}&token={token}'

payload = {"query": query, "variables": variables}

response = requests.post(endpoint, json=payload, headers={'content-type': 'application/json'})
html_content = response.json()['data']['html']['html']

soup = BeautifulSoup(html_content, 'html.parser')
plans = [tag.text.strip() for tag in soup.find_all('div', class_='tag_price margin-bottom margin-large')]

print(plans)

Scrapy

Scrapy is a Python web crawling framework with a powerful selector and data pipeline system. Pair it with Browserless to scrape pages that block traditional crawlers.

import requests
from scrapy.selector import Selector

url = 'https://browserless.io/'
token = 'YOUR_API_TOKEN_HERE'
timeout = 5 * 60 * 1000

query = '''
mutation RetrieveHTML($url: String!) {
goto(url: $url) {
status
}
click(selector: "a[href=\"/pricing\"]") {
time
}
html {
html
}
}
'''

variables = {"url": url}
endpoint = f'https://production-sfo.browserless.io/chromium/bql?timeout={timeout}&token={token}'

payload = {"query": query, "variables": variables}

response = requests.post(endpoint, json=payload, headers={'content-type': 'application/json'})
html_content = response.json()['data']['html']['html']

selector = Selector(text=html_content)
plans = selector.css('.tag_price.margin-bottom.margin-large::text').getall()

print(plans)

Cheerio

Cheerio is a jQuery-like HTML parser for Node.js. Use it to extract data from HTML returned by Browserless without running a second browser instance.

import fetch from 'node-fetch';
import cheerio from 'cheerio';

const url = 'https://browserless.io/';
const token = 'YOUR_API_TOKEN_HERE';
const timeout = 5 * 60 * 1000;

const queryParams = new URLSearchParams({ timeout, token }).toString();

const query = `
mutation RetrieveHTML($url: String!) {
goto(url: $url) {
status
}
click(selector: "a[href=\"/pricing\"]") {
time
}
html {
html
}
}
`;

const endpoint = `https://production-sfo.browserless.io/chromium/bql?${queryParams}`;

const options = {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify({ query, variables: { url } }),
};

(async () => {
try {
const response = await fetch(endpoint, options);
const { data } = await response.json();

const $ = cheerio.load(data.html.html);
const plans = [];
$('.tag_price.margin-bottom.margin-large').each((_, element) => {
plans.push($(element).text().trim());
});

console.log(plans);
} catch (error) {
console.error('Error fetching or parsing HTML:', error);
}
})();