Skip to main content

Extraction rules

Extraction rules allow you to specify rules which will be applied to resulting HTML to extract values into JSON.

Rules are passed in extract_rules parameter. Remember to encode this parameter like in the examples below.

Extract rules is an object where each key specifies a selector in either simple or extended form. In response (instead of HTML) you'll get a JSON in which keys are the same as in input and values are extracted according to the rules in the input.

By default only a first matching matching element is returned. If you wish to extract all matching elements, use {type: "all", ...}.

Simple form

Simple form is a CSS selector in form of a string. Using it will result in an array of text of all elements matching the selector.

Extract an array of all paragraphs

The following example uses simple form to extract all text in all paragraphs (<p>) out of the scraped page:

curl -G \
--data-urlencode 'url=https://example.com' \
--data-urlencode 'extract_rules={"paragraphs": "p"}' \
'https://scraping.narf.ai/api/v1/?api_key=[your api key]'

Extended form

An extended form is an object in which you can specify not only the selector (selector key) but also how the output should be processed (output key) and whether all or only first matching element should be extracted (type key). Extract rules using extended form look like the following:

extract_rules={
"something": {
"type": "all",
"selector": "p",
"output": "text"
}
}

type

Specifies whether to extract all or just the first element. Must be "all" or "first". Defaults to "first".

selector

Any CSS selector.

output

Specifies type of output. Must be one of:

  • "text" - extracts elements' inner text only (without HTML tags) - this is the default.
  • "html" - extracts elements' inner html.
  • "table_json" - extracts headers and rows of a HTML table(s) to json, e.g. [{"header1": "row1_value1", "header2": "row1_value2"}, {"header1": "row2_value1", "header2": "row2_value2"}]. If there's no headers, keys are incrementing integers, "0", "1" and so on.
  • "table_array" - extracts rows of a HTML table(s) to array(s), e.g. [["row1_value1", "row1_value2"], ["row2_value1", "row2_value2"]].
  • "@<any_string>" - extracts attribute value of elements, e.g. use @href to extract link values.
  • Nested object - extracts data from inner element. See: nested rules.

Here's how you can extract all links in a scraped website:

curl -G \
--data-urlencode 'url=https://example.com' \
--data-urlencode 'extract_rules={"links": {"type": "all", "selector": "a", "output": "@href"}}' \
'https://scraping.narf.ai/api/v1/?api_key=[your api key]'

Nested rules

If you want to create a nested output structure, set "output" key to a nested extract rules object in a recursive manner. Example:

extract_rules={
"items": {
"type": "all",
"selector": ".item",
"output": {
"price": ".price",
"date": ".date",
"details": {
"selector": ".details",
"output": {
"title": ".title",
"description": ".description"
}
}
}
}
}