State of Web Scraping 2023 Results

Welcome to the inaugural 2023 Web Scraping Survey!

This survey is our attempt to understand the evolving landscape of web scraping, capturing insights from a diverse pool of developers, data scientists, and hobbyists. We've collected responses on a broad range of topics, from technical skills and tool preferences to the ethical considerations and financial aspects of web scraping.

In the spirit of openness and community learning, we are excited to share the raw data containing all responses. For those interested in a deeper dive, you can access the data at this link.

Our goal with this survey is to paint a detailed picture of the current state of web scraping, the prevailing trends, practices, challenges, and opportunities associated with it. The rich insights garnered from these responses should help inform and stimulate valuable discussions around the future of web scraping. Now, let's dive into the findings and see what the data has to say!

Demographics of Survey Participants

We have collected responses from 136 participants in total with a majority of them identifying as Software Developers (58.8%) and Web Scrapers (47.1%).

Geographically, most respondents were from the United States (26.5%).

Participants were mostly in the 25 - 34 (40.4%) and 35 - 44 (33.8%) age groups, showing a strong youth presence.

A high percentage of participants hold a degree in a related field (41.9%), indicating that although web scraping is accessible, many practitioners have formal education in the field.

Web Scraping Expertise and Experience

Participants mostly identified themselves as "Advanced" (39.7%) or "Intermediate" (37.5%) in their web scraping expertise, revealing a largely experienced community.

Interestingly, a quarter of respondents (26.5%) don't work in web scraping, suggesting an interest beyond professional involvement.

Although the majority spend 5 hours or less per week on web scraping (52.9%), a dedicated group (10.3%) engage in web scraping full-time.

Programming Languages and Tools for Web Scraping

Python is the leading language for web scraping, used by 62.5% of respondents, with JavaScript trailing at 34.6%. This underscores Python's continued dominance in the web scraping domain.

For development tools, Selenium leads with a 39.7% usage rate, followed closely by Scrapy (31.6%) and Playwright (28.7%). This reveals a variety of tool preferences, demonstrating the diversity in the web scraping field.

Proxy and Web Scraping API Usage

A notable 49.3% of respondents do not use proxies for web scraping, while those who do showed a preference for Residential (31.6%) and Datacenter (30.1%) proxies. Interestingly, 16.9% utilize custom-built solutions, suggesting a degree of technical proficiency in the community.

Geotargeting isn't a major concern for over half the respondents (52.9%). Those who do need geotargeting mainly use IPs originating from the United States (29.4%).

When it comes to Web Scraping APIs, a large number of respondents (75.7%) do not use them. Among the users, ScrapingBee (7.4%) and Scraping Fish (6.6%) are the most popular choices.

Ethics in Web Scraping

The importance of ethics in web scraping appears to be polarizing, with equal respondents indicating it as very important (18.4%) and not important at all (18.4%).

A similar split is seen in the sourcing of IPs for web scraping, with 21.3% caring very much about ethics and 22.8% not at all.

This highlights differing perspectives on ethical considerations within the community, suggesting a need for more dialogue and consensus on the topic.

:::note

Ethics in sourcing IP addresses for web scraping is a very important topic for Scraping Fish. You can read more about our view on this here: How IPs for web scraping are sourced

:::

Making Money in Web Scraping

While a considerable proportion (36.0%) doesn't earn from web scraping, others have found various ways to monetize their skills. The most common approach is building products based on scraped data (36.8%) and being employed as a software engineer involved in web scraping (27.9%).

It's interesting to note that a large proportion of respondents (52.9%) operate solo, with the majority not making any money on web scraping (46.3%). For those who do generate income, earnings are quite spread out, but 11.8% earn a modest $1,000 - $10,000 annually.

Types of Scraped Websites and Data

Respondents show a wide range of interests, with the largest group (36.8%) scraping all kinds of data. E-commerce (28.7%) and public government data (24.3%) are also popular targets for web scraping.

When it comes to the complexity of the websites being scraped, most respondents (34.6%) rate it as a 4 out of 6 in terms of difficulty. This suggests a moderate level of challenge.

Regarding dynamic websites, the majority of respondents (45.6%) scrape them less than 50% of the time, with a significant number (19.9%) not engaging in this at all. We expect to see this situation changing in the future as more and more websites load content dynamically.

Sources of Information About Web Scraping

YouTube (50.7%) and StackOverflow (48.5%) are the most popular sources of information about web scraping among the respondents. Blog posts (39.7%) and the r/WebScraping subreddit (22.8%) are also frequently used resources. Some respondents (3.7%) even turn to ChatGPT for advice on web scraping.

We find the diverse range of information sources particularly interesting. In fact, we're preparing a detailed blog post to delve into these findings. We aim to provide further insights and perhaps introduce some unexpected resources for web scraping enthusiasts. Stay tuned!

Summary and Conclusions

The 2023 Web Scraping Survey has provided us with illuminating insights into the behaviors, preferences, and challenges that our community navigates in the dynamic field of web scraping. Our participants, predominantly solo developers fluent in Python, displayed a diverse level of web scraping expertise and an interest in a wide array of data, with particular attention towards e-commerce and public government data.

While some community members have been successful in monetizing their web scraping efforts, a considerable portion is not currently engaged in such activities. Of note was the significant emphasis on the ethics of web scraping, underscoring the community's dedication to responsible data collection practices. In terms of technological preferences, Python emerged as the predominant language for web scraping, with Selenium and Scrapy being the tools of choice for many.

Acknowledging the limitations of the survey, primarily the sample size, we would like to express our profound gratitude to all participants for their valuable input. Your feedback forms the backbone of our understanding of the current web scraping landscape.

Looking ahead, we aim to dive deeper into these trends in our 2024 edition to chart the evolution of the web scraping community. In line with that, we are open to suggestions on areas you would like to explore further in the upcoming survey. Connect with us on 𝕏 @ScrapingF and share your thoughts.

We encourage a broader participation in next year's survey to enrich our collective understanding. Through these efforts, we hope to continue shaping the future of web scraping, ensuring it remains a dynamic, ethical, and thriving discipline.