Scraping Instagram

Mateusz Buda
Mateusz Buda
Co-Founder
Paweł Kobojek
Paweł Kobojek
Co-Founder

Scraping Instagram (updated for 2024)

This blog post is a comprehensive tutorial for scraping public Instagram profile information and posts using Scraping Fish API. We will be scraping posts from a profile that lists old houses for sale to find the best deal.

We prepared accompanying python notebook shared on GitHub repository: instagram-scraping-fish. To be able to run it and actually scrape the data, you will need Scraping Fish API key which you can get here: Scraping Fish Requests Packs. A starter pack of 1,000 API requests costs only $2 and will let you run this tutorial and play with the API on your own ⛹️. Without Scraping Fish API key you are likely to get blocked instantly ⛔️.

Scraping use case

As an example to test Scraping Fish capabilities to scrape Instagram we will fetch and parse data from posts shared by a public profile Stare domy 🏚 (Old Houses). It is an aggregate listing of old houses for sale in Poland. Post descriptions in this profile provide fairly structured data about the property, including location, price, size, etc.

Instagram profile endpoint

The first endpoint that we need to call for our profile is: https://i.instagram.com/api/v1/users/web_profile_info/?username=staredomynasprzedaz. We also have to include a custom header "x-ig-app-id": "936619743392459". The response gives us a JSON with:

  • user identifier needed for next requests,

  • general profile information (not used in this blog post but available in the response),

  • the first page of posts and

  • next page cursor to retrieve the next batch of posts.

Pagination with Instagram GraphQL API

For the following requests, to obtain next pages with posts, we will use Instagram GraphQL API endpoint that requires user identifier and cursor from the previous response: https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first=24&after={end_cursor}

The query_id parameter is fixed to 17888483320059182. You do not need to change it and the value stays the same regardless of Instagram profile and page with user posts.

You can play with the value of first query parameter to retrieve larger number of posts per page than 24 set in this tutorial and reduce the total number of requests. Keep in mind, however, that using value that is too large might look suspicious and result in Instagram login prompt instead of valid JSON.

To get the next page information from JSON response we can use the following function:

def parse_page_info(response_json: Dict[str, Any]) -> Dict[str, Union[Optional[bool], Optional[str]]]:
    top_level_key = "graphql" if "graphql" in response_json else "data"
    user_data = response_json[top_level_key].get("user", {})
    page_info = user_data.get("edge_owner_to_timeline_media", {}).get("page_info", {})
    return page_info

Parsing posts from JSON response

Now that we have endpoints for user profile and pages with posts figured out, we can code a function to parse JSON response to retrieve post information that we need.

The response structure from both endpoints described above is roughly the same but the top level object key for profile response is graphql whereas for the GraphQL query it is data. We already accounted for this in the next page info parsing code.

A function that retrieves basic post information is all about accessing relevant keys in the response JSON:

def parse_posts(response_json: Dict[str, Any]) -> List[Dict[str, Any]]:
    top_level_key = "graphql" if "graphql" in response_json else "data"
    user_data = response_json[top_level_key].get("user", {})
    post_edges = user_data.get("edge_owner_to_timeline_media", {}).get("edges", [])
    posts = []
    for node in post_edges:
        post_json = node.get("node", {})
        shortcode = post_json.get("shortcode")
        image_url = post_json.get("display_url")
        caption_edges = post_json.get("edge_media_to_caption", {}).get("edges", [])
        description = caption_edges[0].get("node", {}).get("text") if len(caption_edges) > 0 else None
        n_comments = post_json.get("edge_media_to_comment", {}).get("count")
        likes_key = "edge_liked_by" if "edge_liked_by" in post_json else "edge_media_preview_like"
        n_likes = post_json.get(likes_key, {}).get("count")
        timestamp = post_json.get("taken_at_timestamp")
        posts.append({
            "shortcode": shortcode,
            "image_url": image_url,
            "description": description,
            "n_comments": n_comments,
            "n_likes": n_likes,
            "timestamp": timestamp,
        })
    return posts

It returns a list of dictionaries representing posts that contain:

  • shortcode: you can use it to access the post at https://www.instagram.com/p/<shortcode>/

  • image_url 🏞

  • description: post text 📝

  • n_comments: number of comments 💬

  • n_likes: number of likes 👍

  • timestamp: when the post was created ⏰

Complete Instagram scraping logic

Now we are ready to put all the pieces together to implement a complete Instagram user profile scraping logic that retrieves all user posts page by page:

def scrape_ig_profile(username: str, sf_api_key: str) -> List[Dict[str, Any]]:
    params = {
        "api_key": sf_api_key,
        "url": f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}",
        "headers": json.dumps({"x-ig-app-id": "936619743392459"}),
    }

    def request_json(url, params) -> Dict[str, Any]:
        response = requests.get(url, params=params, timeout=95)
        response.raise_for_status()
        return response.json()

    response_json = request_json(url="https://scraping.narf.ai/api/v1/", params=params)

    # get user_id from response to request next pages with posts
    user_id = response_json.get("data", {}).get("user", {}).get("id")
    if not user_id:
        print(f"User {username} not found.")
        return []
    # parse the first batch of posts from user profile response
    posts = parse_posts(response_json=response_json)
    page_info = parse_page_info(response_json=response_json)
    # get next page cursor
    end_cursor = page_info.get("end_cursor")
    while end_cursor:
        params = {
            "api_key": sf_api_key,
            "url": f"https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first=24&after={end_cursor}",
        }
        response_json = request_json(url="https://scraping.narf.ai/api/v1/", params=params)
        posts.extend(parse_posts(response_json=response_json))
        page_info = parse_page_info(response_json=response_json)
        end_cursor = page_info.get("end_cursor")
    return posts

And that’s it. We can use this function to scrape all posts from arbitrary public Instagram profile. In our case, we will be staredomynasprzedaz:

SF_API_KEY = "YOUR_SCRAPING_FISH_API_KEY"
posts = scrape_ig_profile(username="staredomynasprzedaz", sf_api_key=SF_API_KEY)

With Scraping Fish API it should take about 2 seconds per page so a profile with 300 posts will be scraped in about 25 seconds.

Since the returned posts are structured dictionaries in a list, we can create pandas data frame from it for easier data processing:

df = pd.DataFrame(posts)
shortcodeimage_urldescriptionn_commentsn_likestimestamp
0CbrYIabMBXSlinkRóżany, Gronowo Elbląskie, woj. warmińsko-mazu...144751648535479
1CbnRiJwsxsclinkKomorów, Michałowice, woj. mazowieckie \nCena:...287611648397802
2CbhVQU3MtTRlinkPomorowo, Lidzbark Warmiński, woj. warmińsko-m...145261648198427
3CbakX60Me4rlinkSmyków, Radgoszcz, woj. małopolskie \nCena: 37...102641647971472
4CbXGs-JNK0UlinkDębowa Łęka, Wschowa, woj. lubuskie\nCena: 389...34361647855253
.....................

Parsing property features from post description

From a structured part of the post description, we can parse more detailed information about properties:

  • location (address and province) 📍

  • price in PLN 💰

  • house size in m² 🏠

  • plot area in m² 📐

For a function that implements description parsing based on regular expressions refer to the notebook accompanying this blog post here: https://github.com/mateuszbuda/instagram-scraping-fish/blob/master/instagram-tutorial.ipynb

From this information, we can simply compute additional derived features, e.g., price per m² of the house and plot:

df["price_per_house_m2"] = df["price"].div(df["house_size"])
df["price_per_plot_m2"] = df["price"].div(df["plot_area"])

Data exploration

Based on the data frame that we created, we can extract some usefull stats, e.g., the number of houses in each provine and the mean price per m²:

df.groupby("province").agg({"price_per_house_m2": ["mean", "count"]}).sort_values(by=("price_per_house_m2", "mean"))
provinceprice_per_house_m2
(mean)
price_per_house_m2
(count)
opolskie1643.6621763
dolnośląskie1749.19857840
zachodniopomorskie1873.45778722
lubuskie1997.86867710
podkarpackie2283.57838046
łódzkie2444.7558917
podlaskie2689.67571746
warmińsko - mazurskie2733.3333331
lubelskie2781.31651518
małopolskie2879.48004047
śląskie2969.71436525
świętokrzyskie3005.3672712
wielkopolskie3084.2291617
warmińsko-mazurskie3099.13570329
pomorskie3135.4445468
kujawsko-pomorskie3885.5820116
mazowieckie4167.28025220

We can also filter the data to find houses that we might be interested in. Example below searches for houses with a price below 200,000 PLN and of a size between 100 m² and 200 m². Here is a link to one of them based on its shortcode: https://www.instagram.com/p/CYv93e8Nvwh/

df[(df["price"] < 200000.0) & (df["house_size"] < 200.0) & (df["house_size"] > 100.0)]
shortcodeimage_urldescriptionn_commentsn_likestimestampaddressprovincepricehouse_sizeplot_areaprice_per_house_m2price_per_plot_m2
31CYv93e8NvwhlinkPonikwa, Bystrzyca Kłodzka, woj. dolnośląskie ...237011642247030Ponikwa, Bystrzyca Kłodzkadolnośląskie165000.0120.003882.01375.00000042.503864
57CXGiAE6sw6ylinkGadowskie Holendry, Tuliszków, woj. wielkopols...74621638709205Gadowskie Holendry, Tuliszkówwielkopolskie199000.0111.003996.01792.79279349.799800
122CTSiKe4sGsxlinkGotówka, Ruda - Huta, woj. lubelskie\nCena: 18...31891630522009Gotówka, Ruda - Hutalubelskie186000.0120.001832.01550.000000101.528384
149CR040yusVyVlinkLeżajsk, woj. podkarpackie \nCena: 175 000 zł...265471627379773Leżajskpodkarpackie175000.0108.00912.01620.370370191.885965
181CQvirvJM0vilinkRząśnik, Świerzawa, woj. dolnośląskie \nCena: ...42391625052909Rząśnik, Świerzawadolnośląskie199000.0160.001900.01243.750000104.736842
190CQhJJTFsyPtlinkSzymbark, Gorlice, woj. małopolskie \nCena: 19...22221624569758Szymbark, Gorlicemałopolskie199000.0136.009574.01463.23529420.785461
..........................................

Conclusion

I hope you now feel more confident in scraping. As you can see, it is super easy to scrape publicly available data with Scraping Fish API from even as challenging websites as Instagram. In a similar way, you can scrape other user profiles as well as other websites that contain relevant information for you or your business 📈.

Let's talk about your use case 💼

Feel free to reach out using our contact form. We can assist you in integrating Scraping Fish API into your existing scraping workflow or help you set up scraping system for your use case.