Beautiful Soup (Python)
Beautiful Soup is one of the most popular scraping libraries for Python. It parses HTML strings so you can extract structured data. Use it alongside the /content API or BrowserQL to scrape any website.
Both APIs render the page in a browser before returning HTML. BrowserQL adds advanced stealth techniques to bypass bot detectors first.
Basic Usage
Like Cheerio, Beautiful Soup is only a parser: it does not fetch HTML on its own. To get HTML from a website, you would use the requests library to download the page:
import requests
from bs4 import BeautifulSoup
url = 'https://browserless.io/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
entries = soup.find_all('a')
for entry in entries:
print(entry.text.strip())
The problem: requests downloads only the HTML source, so it cannot return content that depends on JavaScript or user interaction to render.
The /content API solves this. It renders and evaluates the page in a browser before returning the HTML. You can call it with the requests library:
import requests
from bs4 import BeautifulSoup
response = requests.post("https://production-sfo.browserless.io/content",
params={ "token": "YOUR_API_TOKEN_HERE"},
json={
"waitForTimeout": 5000,
"url": "https://puppeteer.github.io/pptr.dev/"
}
)
soup = BeautifulSoup(response.text, 'html.parser')
entries = soup.find_all("a", class_="pptr-sidebar-item")
for entry in entries:
print(entry.text.strip())
In the example, we are using the old Puppeteer doc site, which relies heavily on JS to render its content. With a usual requests or cURL request, it would only download the page's source code, the JavaScript wouldn't be interpreted and the content wouldn't be rendered.
Using our API, you can use all the options available in the /content API, use stealth mode, our residential proxies and more! For more reference, please check our OpenAPI.
Bypass bot-blockers using /unblock
In cases where websites implement aggressive bot-detection mechanisms, you can use the /unblock API to bypass these. The /unblock API uses a variety of tools and strategies to override and hide the footprints that headless browsers leave behind, allowing you to access bot-protected websites from a remote interface.
Similar to the /content API, the /unblock API renders and evaluates the page in a browser, but with extra stealth features. This makes it ideal for scraping highly protected websites.
import json
import requests
from bs4 import BeautifulSoup
response = requests.post("https://production-sfo.browserless.io/unblock",
params={ "token": "YOUR_API_TOKEN_HERE"},
json={
"waitForTimeout": 5000,
"url": "https://puppeteer.github.io/pptr.dev/"
}
)
html_content = json.loads(response.text)['content']
soup = BeautifulSoup(html_content, 'html.parser')
entries = soup.find_all("a", class_="pptr-sidebar-item")
for entry in entries:
print(entry.text.strip())