Extraction rules
Extraction rules allow you to specify rules which will be applied to resulting HTML to extract values in JSON format.
Rules are passed with extract_rules
parameter.
Remember to encode this parameter like in the examples below.
Extraction rules parameter is an object in which each key specifies a CSS selector in either simple or extended form. In the response, you'll get a JSON, instead of HTML, in which keys are the same as in the extract_rules
query param and values are extracted according to the provided rules.
By default, only the first matching element is returned. If you wish to extract all matching elements, use {type: "all", ...}
.
Simple form
Simple form is a CSS selector in the form of a string.
Example: Extract the first paragraph
The following example uses the simple extraction rule form to extract text content from the first paragraph (<p>
) out of the scraped page:
import requests
import json
payload = {
"api_key": "[your API key]",
"url": "https://example.com",
"extract_rules": json.dumps({"paragraph": "p"}),
}
response = requests.get("https://scraping.narf.ai/api/v1/", params=payload)
print(response.content)
In the response, you receive a JSON with contents of the first <p>
element in the "paragraph"
key:
Response
{
"paragraph": "This domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission."
}
Extended form
Extended form is an object in which you can specify not only the selector (selector
key) but also how the output should be processed (output
key), if the text value should be cleaned from redundant whitespace characters (clean
key), and whether all or only the first matching element should be extracted (type
key).
Extraction rules using extended form have the following structure:
{
"extract_rules": {
"something": {
"type": "all",
"selector": "p",
"output": "text",
"clean": true
}
}
}
- Name
type
- Type
- "all" | "first"
- Description
- Specifies whether to extract all or just the first element. Defaults to "first".
- Name
selector
- Type
- string
- Description
- A valid CSS selector
- Name
output
- Type
- "text" | "html" | "table_json" | "table_array" | "@<any_string>" | object
- Description
Specifies the type of output. Defaults to
"text"
."text"
- extracts elements innert text only, without HTML tags."html"
- extracts elements inner HTML."table_json"
- extracts headers and rows of a HTML table(s) to JSON, e.g.[{"header1": "row1_value1", "header2": "row1_value2"}, {"header1": "row2_value1", "header2": "row2_value2"}]
. If there are no headers, keys are incrementing integers,"0"
,"1"
and so on"table_array"
- extracts rows of an HTML table(s) to array(s), e.g.[["row1_value1", "row1_value2"], ["row2_value1", "row2_value2"]]
."@<any_string>"
- extracts attribute value of elements, e.g. use@href
to extract link values- Nested object - extracts data from inner element. See: nested rules.
- Name
clean
- Type
- bool
- Description
- Specifies if the extracted text should be cleaned from redundant whitespace and newline characters. Defaults to
false
.
Example: Extract all links
Here's how you can extract all links from the scraped website:
import requests
import json
payload = {
"api_key": "[your API key]",
"url": "https://example.com",
"extract_rules": json.dumps({
"links": {"type": "all","selector": "a", "output": "@href"}
}),
}
response = requests.get("https://scraping.narf.ai/api/v1/", params=payload)
print(response.content)
In the response, your receive a JSON with a list of all links from the scraped website in the "links"
key:
Response
{"links":["https://www.iana.org/domains/example"]}
Nested rules
If you want to create a nested output structure, set the "output"
key to a nested extraction rules object in a recursive manner.
Example:
{
"extract_rules": {
"items": {
"type": "all",
"selector": ".item",
"output": {
"price": ".price",
"date": ".date",
"details": {
"selector": ".details",
"output": {
"title": ".title",
"description": ".description"
}
}
}
}
}
}