Web Scraping Benchmark

We developed a benchmark to test selected Web Scraping APIs. It involves scraping various web pages that are commonly targeted in web scraping workflows. The results let us evaluate Web Scraping APIs in terms of reliability, proxy quality, speed and cost.

Python script which we used to run the benchmark is publicly available in a GitHub repository. It can also be used to run a scraping job with Scraping Fish API by providing an input file with a list of URLs to scrape.

Metodology overview

The benchmark includes URLs from 5 categories:

  1. Alexa: URLs from the top 1,000 Alexa rank
  2. Amazon: Amazon product URLs
  3. Google: Google search queries
  4. Instagram: the top 10 Instagram profiles (as of 2022)
  5. Similarweb: websites from the similarweb ranking (excluding adult and russian websites)

For each category, we made 1,000 requests and recorded:

  • Successful requests
  • Failed requests
  • Blocked requests
  • Average requests processing time (seconds / URL)
  • Cost of running the benchmark (1000 requests)

More details on methodology and instruction to reproduce the results are provided in the GitHub repository.

Results

Scraping Fish

TestSuccessfulFailedBlockedProcessing timeCost
Alexa99.9%0.1%0%2.63$2
Amazon100%0%0%3.37$2
Google100%0%0%1.63$2
Instagram99.9%0.1%0%1.9$2
Similarweb100%0%0%2.50$2
Total99.96%0.04%0%2.4$10

$0.002 per each successfully scraped URL. The highest overall success rate and the best processing time.

Other web scraping APIs

Scraping Ant

Benchmarks were run using --api “https://api.scrapingant.com/v1/general/?proxy_type=residential&“ parameter and the code was adjusted to pass API key as a header instead of query parameter.

TestSuccessfulFailedBlockedProcessing timeCost
Alexa100%0%0%6.92$19
Amazon98%2%0%9.84$19
Google95%5%0%13.8$19
Instagram99.5%0.5%0%6.76$19
Similarweb96%4%0%7.40$19
Total97.7%2.3%0%8.94$49

$49 Startup subscription required to scrape 5,000 URLs in total (each consuming 50 or 250 API credits) and using 5 concurrent connections.

ScrapingBee

Benchmarks were run using --api “https://app.scrapingbee.com/api/v1/?premium_proxy=true&“ and custom_google parameter set to true for Google benchmark

TestSuccessfulFailedBlockedProcessing timeCost
Alexa81%18%1%4.86$99
Amazon99%1%0%11.48$99
Google100%0%0%3.74$99
Instagram99%1%0%18.52$59
Similarweb90%8%2%4.70$99
Total93.8%5.6%0.6%8.66$99

$99 Startup subscription required to scrape 5,000 URLs in total (each consuming 10, 20, or 25 API credits) and using 5 concurrent connections.

ScraperAPI

Benchmarks were run using --api “http://api.scraperapi.com/?premium=true“ parameter.

TestSuccessfulFailedBlockedProcessing timeCost
Alexa95.5%4.5%0%7.19$49
Amazon96%4%0%10.97$49
Google100%0%0%4.5$49
Instagram*0%100%0%0-
Similarweb90%8%2%4.70$49
Total76.3%23.3%0.4%6.84$49

* Scraping Instagram is not allowed and returns 403 status code.

$49 Hobby subscription required to scrape 5,000 URLs in total (each consuming 10 or 25 API credits) and using 5 concurrent connections.

Conclusions

Scraping Fish 🐟 achieved the highest total success rate of 99.96% with the best average processing time of 3.23 seconds/URL. Moreover, thanks to Scraping Fish API simple and transparent pricing, the total cost of running the benchmark was 5-10 times smaller compared to other tested APIs.

Try Scraping Fish API

To run the scraping script for your use case, you can get a starter pack of 1,000 API requests for only $2.

Try it for just $2