Scraping Instagram (updated for 2024)

This blog post is a comprehensive tutorial for scraping public Instagram profile information and posts using Scraping Fish API. We will be scraping posts from a profile that lists old houses for sale to find the best deal.

We prepared accompanying python notebook shared on GitHub repository: instagram-scraping-fish. To be able to run it and actually scrape the data, you will need Scraping Fish API key which you can get here: Scraping Fish Requests Packs. A starter pack of 1,000 API requests costs only $2 and will let you run this tutorial and play with the API on your own ⛹️. Without Scraping Fish API key you are likely to get blocked instantly ⛔️.

It’s important to point out that we are using Instagram private (undocumented) API for scraping. This blog post and the code were last updated in February 2024. If Instagram changes something in their API that we rely on, this tutorial may no longer work and will have to be adjusted. If you experience any problem, feel free to open an issue on GitHub and we will investigate it.

Scraping use case

As an example to test Scraping Fish capabilities to scrape Instagram we will fetch and parse data from posts shared by a public profile Stare domy 🏚 (Old Houses). It is an aggregate listing of old houses for sale in Poland. Post descriptions in this profile provide fairly structured data about the property, including location, price, size, etc.

Instagram profile endpoint

The first endpoint that we need to call for our profile is: https://i.instagram.com/api/v1/users/web_profile_info/?username=staredomynasprzedaz. We also have to include a custom header "x-ig-app-id": "936619743392459". The response gives us a JSON with:

user identifier needed for next requests,
general profile information (not used in this blog post but available in the response),
the first page of posts and
next page cursor to retrieve the next batch of posts.

Pagination with Instagram GraphQL API

For the following requests, to obtain next pages with posts, we will use Instagram GraphQL API endpoint that requires user identifier and cursor from the previous response: https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first=24&after={end_cursor}

The query_id parameter is fixed to 17888483320059182. You do not need to change it and the value stays the same regardless of Instagram profile and page with user posts.

You can play with the value of first query parameter to retrieve larger number of posts per page than 24 set in this tutorial and reduce the total number of requests. Keep in mind, however, that using value that is too large might look suspicious and result in Instagram login prompt instead of valid JSON.

To get the next page information from JSON response we can use the following function:

def parse_page_info(response_json: Dict[str, Any]) -> Dict[str, Union[Optional[bool], Optional[str]]]:
    top_level_key = "graphql" if "graphql" in response_json else "data"
    user_data = response_json[top_level_key].get("user", {})
    page_info = user_data.get("edge_owner_to_timeline_media", {}).get("page_info", {})
    return page_info

Parsing posts from JSON response

Now that we have endpoints for user profile and pages with posts figured out, we can code a function to parse JSON response to retrieve post information that we need.

The response structure from both endpoints described above is roughly the same but the top level object key for profile response is graphql whereas for the GraphQL query it is data. We already accounted for this in the next page info parsing code.

A function that retrieves basic post information is all about accessing relevant keys in the response JSON:

def parse_posts(response_json: Dict[str, Any]) -> List[Dict[str, Any]]:
    top_level_key = "graphql" if "graphql" in response_json else "data"
    user_data = response_json[top_level_key].get("user", {})
    post_edges = user_data.get("edge_owner_to_timeline_media", {}).get("edges", [])
    posts = []
    for node in post_edges:
        post_json = node.get("node", {})
        shortcode = post_json.get("shortcode")
        image_url = post_json.get("display_url")
        caption_edges = post_json.get("edge_media_to_caption", {}).get("edges", [])
        description = caption_edges[0].get("node", {}).get("text") if len(caption_edges) > 0 else None
        n_comments = post_json.get("edge_media_to_comment", {}).get("count")
        likes_key = "edge_liked_by" if "edge_liked_by" in post_json else "edge_media_preview_like"
        n_likes = post_json.get(likes_key, {}).get("count")
        timestamp = post_json.get("taken_at_timestamp")
        posts.append({
            "shortcode": shortcode,
            "image_url": image_url,
            "description": description,
            "n_comments": n_comments,
            "n_likes": n_likes,
            "timestamp": timestamp,
        })
    return posts

It returns a list of dictionaries representing posts that contain:

shortcode: you can use it to access the post at https://www.instagram.com/p/<shortcode>/
image_url 🏞
description: post text 📝
n_comments: number of comments 💬
n_likes: number of likes 👍
timestamp: when the post was created ⏰

Complete Instagram scraping logic

Now we are ready to put all the pieces together to implement a complete Instagram user profile scraping logic that retrieves all user posts page by page:

def scrape_ig_profile(username: str, sf_api_key: str) -> List[Dict[str, Any]]:
    params = {
        "api_key": sf_api_key,
        "url": f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}",
        "headers": json.dumps({"x-ig-app-id": "936619743392459"}),
    }

    def request_json(url, params) -> Dict[str, Any]:
        response = requests.get(url, params=params, timeout=95)
        response.raise_for_status()
        return response.json()

    response_json = request_json(url="https://scraping.narf.ai/api/v1/", params=params)

    # get user_id from response to request next pages with posts
    user_id = response_json.get("data", {}).get("user", {}).get("id")
    if not user_id:
        print(f"User {username} not found.")
        return []
    # parse the first batch of posts from user profile response
    posts = parse_posts(response_json=response_json)
    page_info = parse_page_info(response_json=response_json)
    # get next page cursor
    end_cursor = page_info.get("end_cursor")
    while end_cursor:
        params = {
            "api_key": sf_api_key,
            "url": f"https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first=24&after={end_cursor}",
        }
        response_json = request_json(url="https://scraping.narf.ai/api/v1/", params=params)
        posts.extend(parse_posts(response_json=response_json))
        page_info = parse_page_info(response_json=response_json)
        end_cursor = page_info.get("end_cursor")
    return posts

And that’s it. We can use this function to scrape all posts from arbitrary public Instagram profile. In our case, we will be staredomynasprzedaz:

SF_API_KEY = "YOUR_SCRAPING_FISH_API_KEY"
posts = scrape_ig_profile(username="staredomynasprzedaz", sf_api_key=SF_API_KEY)

With Scraping Fish API it should take about 2 seconds per page so a profile with 300 posts will be scraped in about 25 seconds.

Since the returned posts are structured dictionaries in a list, we can create pandas data frame from it for easier data processing:

df = pd.DataFrame(posts)

	shortcode	image_url	description	n_comments	n_likes	timestamp
0	CbrYIabMBXS	link	Różany, Gronowo Elbląskie, woj. warmińsko-mazu...	14	475	1648535479
1	CbnRiJwsxsc	link	Komorów, Michałowice, woj. mazowieckie \nCena:...	28	761	1648397802
2	CbhVQU3MtTR	link	Pomorowo, Lidzbark Warmiński, woj. warmińsko-m...	14	526	1648198427
3	CbakX60Me4r	link	Smyków, Radgoszcz, woj. małopolskie \nCena: 37...	10	264	1647971472
4	CbXGs-JNK0U	link	Dębowa Łęka, Wschowa, woj. lubuskie\nCena: 389...	3	436	1647855253
...	...	...	...	...	...	...

Parsing property features from post description

From a structured part of the post description, we can parse more detailed information about properties:

location (address and province) 📍
price in PLN 💰
house size in m² 🏠
plot area in m² 📐

For a function that implements description parsing based on regular expressions refer to the notebook accompanying this blog post here: https://github.com/mateuszbuda/instagram-scraping-fish/blob/master/instagram-tutorial.ipynb

From this information, we can simply compute additional derived features, e.g., price per m² of the house and plot:

df["price_per_house_m2"] = df["price"].div(df["house_size"])
df["price_per_plot_m2"] = df["price"].div(df["plot_area"])

Data exploration

Based on the data frame that we created, we can extract some usefull stats, e.g., the number of houses in each provine and the mean price per m²:

df.groupby("province").agg({"price_per_house_m2": ["mean", "count"]}).sort_values(by=("price_per_house_m2", "mean"))

province	price_per_house_m2 (mean)	price_per_house_m2 (count)
opolskie	1643.662176	3
dolnośląskie	1749.198578	40
zachodniopomorskie	1873.457787	22
lubuskie	1997.868677	10
podkarpackie	2283.578380	46
łódzkie	2444.755891	7
podlaskie	2689.675717	46
warmińsko - mazurskie	2733.333333	1
lubelskie	2781.316515	18
małopolskie	2879.480040	47
śląskie	2969.714365	25
świętokrzyskie	3005.367271	2
wielkopolskie	3084.229161	7
warmińsko-mazurskie	3099.135703	29
pomorskie	3135.444546	8
kujawsko-pomorskie	3885.582011	6
mazowieckie	4167.280252	20

We can also filter the data to find houses that we might be interested in. Example below searches for houses with a price below 200,000 PLN and of a size between 100 m² and 200 m². Here is a link to one of them based on its shortcode: https://www.instagram.com/p/CYv93e8Nvwh/

df[(df["price"] < 200000.0) & (df["house_size"] < 200.0) & (df["house_size"] > 100.0)]

	shortcode	image_url	description	n_comments	n_likes	timestamp	address	province	price	house_size	plot_area	price_per_house_m2	price_per_plot_m2
31	CYv93e8Nvwh	link	Ponikwa, Bystrzyca Kłodzka, woj. dolnośląskie ...	23	701	1642247030	Ponikwa, Bystrzyca Kłodzka	dolnośląskie	165000.0	120.00	3882.0	1375.000000	42.503864
57	CXGiAE6sw6y	link	Gadowskie Holendry, Tuliszków, woj. wielkopols...	7	462	1638709205	Gadowskie Holendry, Tuliszków	wielkopolskie	199000.0	111.00	3996.0	1792.792793	49.799800
122	CTSiKe4sGsx	link	Gotówka, Ruda - Huta, woj. lubelskie\nCena: 18...	3	189	1630522009	Gotówka, Ruda - Huta	lubelskie	186000.0	120.00	1832.0	1550.000000	101.528384
149	CR040yusVyV	link	Leżajsk, woj. podkarpackie \nCena: 175 000 zł...	26	547	1627379773	Leżajsk	podkarpackie	175000.0	108.00	912.0	1620.370370	191.885965
181	CQvirvJM0vi	link	Rząśnik, Świerzawa, woj. dolnośląskie \nCena: ...	4	239	1625052909	Rząśnik, Świerzawa	dolnośląskie	199000.0	160.00	1900.0	1243.750000	104.736842
190	CQhJJTFsyPt	link	Szymbark, Gorlice, woj. małopolskie \nCena: 19...	2	222	1624569758	Szymbark, Gorlice	małopolskie	199000.0	136.00	9574.0	1463.235294	20.785461
...	...	...	...	...	...	...	...	...	...	...	...	...	...

Conclusion

I hope you now feel more confident in scraping. As you can see, it is super easy to scrape publicly available data with Scraping Fish API from even as challenging websites as Instagram. In a similar way, you can scrape other user profiles as well as other websites that contain relevant information for you or your business 📈.

Let's talk about your use case 💼

Feel free to reach out using our contact form. We can assist you in integrating Scraping Fish API into your existing scraping workflow or help you set up scraping system for your use case.