Extraction rules

Extraction rules allow you to specify rules which will be applied to resulting HTML to extract values in JSON format.

Rules are passed with extract_rules parameter.

Extraction rules parameter is an object in which each key specifies a CSS selector in either simple or extended form. In the response, you'll get a JSON, instead of HTML, in which keys are the same as in the extract_rules query param and values are extracted according to the provided rules.

Simple form

Simple form is a CSS selector in the form of a string.

Example: Extract the first paragraph

The following example uses the simple extraction rule form to extract text content from the first paragraph (<p>) out of the scraped page:

GET
/api/v1/
import requests
import json

payload = {
  "api_key": "[your API key]",
  "url": "https://example.com",
  "extract_rules": json.dumps({"paragraph": "p"}),
}

response = requests.get("https://scraping.narf.ai/api/v1/", params=payload)
print(response.content)

In the response, you receive a JSON with contents of the first <p> element in the "paragraph" key:

Response

{
  "paragraph": "This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission."
}

Extended form

Extended form is an object in which you can specify not only the selector (selector key) but also how the output should be processed (output key), if the text value should be cleaned from redundant whitespace characters (clean key), and whether all or only the first matching element should be extracted (type key). Extraction rules using extended form have the following structure:

{
  "extract_rules": {
    "something": {
      "type": "all",
      "selector": "p",
      "output": "text",
      "clean": true
    }
  }
}
  • Name
    type
    Type
    "all" | "first"
    Description
    Specifies whether to extract all or just the first element. Defaults to "first".
  • Name
    selector
    Type
    string
    Description
    A valid CSS selector
  • Name
    output
    Type
    "text" | "html" | "table_json" | "table_array" | "@<any_string>" | object
    Description

    Specifies the type of output. Defaults to "text".

    • "text" - extracts elements innert text only, without HTML tags.
    • "html" - extracts elements inner HTML.
    • "table_json" - extracts headers and rows of a HTML table(s) to JSON, e.g. [{"header1": "row1_value1", "header2": "row1_value2"}, {"header1": "row2_value1", "header2": "row2_value2"}]. If there are no headers, keys are incrementing integers, "0", "1" and so on
    • "table_array" - extracts rows of an HTML table(s) to array(s), e.g. [["row1_value1", "row1_value2"], ["row2_value1", "row2_value2"]].
    • "@<any_string>" - extracts attribute value of elements, e.g. use @href to extract link values
    • Nested object - extracts data from inner element. See: nested rules.
  • Name
    clean
    Type
    bool
    Description
    Specifies if the extracted text should be cleaned from redundant whitespace and newline characters. Defaults to false.

Here's how you can extract all links from the scraped website:

GET
/api/v1/
import requests
import json

payload = {
  "api_key": "[your API key]",
  "url": "https://example.com",
  "extract_rules": json.dumps({
    "links": {"type": "all","selector": "a", "output": "@href"}
  }),
}

response = requests.get("https://scraping.narf.ai/api/v1/", params=payload)
print(response.content)

In the response, your receive a JSON with a list of all links from the scraped website in the "links" key:

Response

{"links":["https://www.iana.org/domains/example"]}

Nested rules

If you want to create a nested output structure, set the "output" key to a nested extraction rules object in a recursive manner. Example:

{
  "extract_rules": {
    "items": {
      "type": "all",
      "selector": ".item",
      "output": {
        "price": ".price",
        "date": ".date",
        "details": {
          "selector": ".details",
          "output": {
            "title": ".title",
            "description": ".description"
          }
        }
      }
    }
  }
}