Skip to main content

Data Scraping and Extraction

BrowserQL offers three main approaches to extracting data: grab the page HTML with the html mutation, map DOM elements to a structured JSON with mapSelector or querySelectorAll, or intercept raw API responses with the response mutation. Choose the approach that fits your downstream processing needs.

Basic Usage

The html mutation returns the full page HTML. Wait for the page to load before extracting to avoid empty results.

mutation ExtractHTML {
goto(url: "https://www.browserless.io/", waitUntil: domContentLoaded) {
status
}

html {
html
}
}

Targeting a Specific Element

Pass a selector to return HTML from a single element instead of the full page:

html(selector: ".navbar_container") {
html
}

Cleaning the HTML

The clean argument strips non-text nodes (scripts, video, canvas), DOM attributes, and excess whitespace. It can reduce payload size by up to 1,000x:

html(clean: {
removeAttributes: true
removeNonTextNodes: true
}) {
html
}

Creating a JSON with mapSelector

mapSelector is designed for pages with repetitive, hierarchical structure: product listings, comment threads, search results, or any repeating pattern. It iterates over a NodeList, similar to document.querySelectorAll, and returns a structured array of objects. Use it to extract attributes, text content, or nested elements.

The query below navigates to Hacker News and extracts the href of every post link:

mutation ScrapeHackerNews {
goto(
url: "https://news.ycombinator.com"
waitUntil: firstContentfulPaint
) {
status
}

posts: mapSelector(selector: ".submission .titleline > a", wait: true) {
link: attribute(name: "href") {
value
}
}
}

Response

{
"data": {
"posts": [
{ "link": { "value": "https://churchofturing.github.io/landscapeoflisp.html" } },
{ "link": { "value": "https://www.jjj.de/fxt/fxtbook.pdf" } },
{ "link": { "value": "https://ereader-swedish.fly.dev/" } }
]
}
}

Nest mapSelector calls to traverse hierarchical DOM structures. The example below extracts post authors and scores from each submission:

mutation ScrapeHackerNewsMetadata {
goto(url: "https://news.ycombinator.com") {
status
}

posts: mapSelector(selector: ".subtext .subline") {
author: mapSelector(selector: ".hnuser") {
authorName: innerText
}

score: mapSelector(selector: ".score") {
score: innerText
}
}
}

mapSelector always returns an array, whether one or many elements match. It returns null when no elements are found.

Scraping Network Responses

The response mutation records HTTP responses made by the browser, filtered by URL pattern, method, or resource type. BQL waits for the response automatically.

The example below captures the raw document response from a page load:

mutation DocumentResponses {
goto(url: "https://example.com/", waitUntil: load) {
status
}

response(type: document) {
url
body
headers {
name
value
}
}
}

Filter by method and operator to narrow responses to a specific type. The example below captures only XHR GET responses:

mutation AJAXGetCalls {
goto(url: "https://msn.com/", waitUntil: load) {
status
}

response(type: xhr, method: GET, operator: and) {
url
type
method
body
headers {
name
value
}
}
}

Using querySelectorAll

The querySelectorAll mutation returns an array of matched elements with their HTML properties. Use it when you need fast element extraction without the nested mapping of mapSelector.

mutation FindLinks {
goto(url: "https://browserless.io") {
status
}

links: querySelectorAll(selector: "a") {
innerText
outerHTML
}
}

Each result includes innerHTML, innerText, outerHTML, id, className, and childElementCount. Use innerText to get visible text, or outerHTML to get the full element markup.

Processing Data with JavaScript

The evaluate mutation runs JavaScript in the browser context and returns the result. Use it when you need calculations, filtering, or transformations that go beyond what mapSelector or querySelectorAll support.

mutation CountHeadings {
goto(url: "https://browserless.io") {
status
}

headingCount: evaluate(
content: "document.querySelectorAll('h2').length"
) {
value
}
}

The content field accepts any JavaScript expression or block. Wrap multi-step logic in a function body and use return to pass values back. For examples using await, external scripts, and complex transformations, see Multi-line JavaScript.

Next Steps