Scraping Instagram
Scraping Instagram (updated for 2024)
This blog post is a comprehensive tutorial for scraping public Instagram profile information and posts using Scraping Fish API. We will be scraping posts from a profile that lists old houses for sale to find the best deal.
We prepared accompanying python notebook shared on GitHub repository: instagram-scraping-fish. To be able to run it and actually scrape the data, you will need Scraping Fish API key which you can get here: Scraping Fish Requests Packs. A starter pack of 1,000 API requests costs only $2 and will let you run this tutorial and play with the API on your own ⛹️. Without Scraping Fish API key you are likely to get blocked instantly ⛔️.
It’s important to point out that we are using Instagram private (undocumented) API for scraping. This blog post and the code were last updated in February 2024. If Instagram changes something in their API that we rely on, this tutorial may no longer work and will have to be adjusted. If you experience any problem, feel free to open an issue on GitHub and we will investigate it.
Scraping use case
As an example to test Scraping Fish capabilities to scrape Instagram we will fetch and parse data from posts shared by a public profile Stare domy 🏚 (Old Houses). It is an aggregate listing of old houses for sale in Poland. Post descriptions in this profile provide fairly structured data about the property, including location, price, size, etc.
Instagram profile endpoint
The first endpoint that we need to call for our profile is: https://i.instagram.com/api/v1/users/web_profile_info/?username=staredomynasprzedaz
.
We also have to include a custom header "x-ig-app-id": "936619743392459"
.
The response gives us a JSON with:
-
user identifier needed for next requests,
-
general profile information (not used in this blog post but available in the response),
-
the first page of posts and
-
next page cursor to retrieve the next batch of posts.
Pagination with Instagram GraphQL API
For the following requests, to obtain next pages with posts, we will use Instagram GraphQL API endpoint that requires user identifier and cursor from the previous response: https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first=24&after={end_cursor}
The query_id
parameter is fixed to 17888483320059182
. You do not need to change it and the value stays the same regardless of Instagram profile and page with user posts.
You can play with the value of first
query parameter to retrieve larger number of posts per page than 24 set in this tutorial and reduce the total number of requests. Keep in mind, however, that using value that is too large might look suspicious and result in Instagram login prompt instead of valid JSON.
To get the next page information from JSON response we can use the following function:
def parse_page_info(response_json: Dict[str, Any]) -> Dict[str, Union[Optional[bool], Optional[str]]]:
top_level_key = "graphql" if "graphql" in response_json else "data"
user_data = response_json[top_level_key].get("user", {})
page_info = user_data.get("edge_owner_to_timeline_media", {}).get("page_info", {})
return page_info
Parsing posts from JSON response
Now that we have endpoints for user profile and pages with posts figured out, we can code a function to parse JSON response to retrieve post information that we need.
The response structure from both endpoints described above is roughly the same but the top level object key for profile response is graphql
whereas for the GraphQL query it is data
. We already accounted for this in the next page info parsing code.
A function that retrieves basic post information is all about accessing relevant keys in the response JSON:
def parse_posts(response_json: Dict[str, Any]) -> List[Dict[str, Any]]:
top_level_key = "graphql" if "graphql" in response_json else "data"
user_data = response_json[top_level_key].get("user", {})
post_edges = user_data.get("edge_owner_to_timeline_media", {}).get("edges", [])
posts = []
for node in post_edges:
post_json = node.get("node", {})
shortcode = post_json.get("shortcode")
image_url = post_json.get("display_url")
caption_edges = post_json.get("edge_media_to_caption", {}).get("edges", [])
description = caption_edges[0].get("node", {}).get("text") if len(caption_edges) > 0 else None
n_comments = post_json.get("edge_media_to_comment", {}).get("count")
likes_key = "edge_liked_by" if "edge_liked_by" in post_json else "edge_media_preview_like"
n_likes = post_json.get(likes_key, {}).get("count")
timestamp = post_json.get("taken_at_timestamp")
posts.append({
"shortcode": shortcode,
"image_url": image_url,
"description": description,
"n_comments": n_comments,
"n_likes": n_likes,
"timestamp": timestamp,
})
return posts
It returns a list of dictionaries representing posts that contain:
-
shortcode: you can use it to access the post at
https://www.instagram.com/p/<shortcode>/
-
image_url 🏞
-
description: post text 📝
-
n_comments: number of comments 💬
-
n_likes: number of likes 👍
-
timestamp: when the post was created ⏰
Complete Instagram scraping logic
Now we are ready to put all the pieces together to implement a complete Instagram user profile scraping logic that retrieves all user posts page by page:
def scrape_ig_profile(username: str, sf_api_key: str) -> List[Dict[str, Any]]:
params = {
"api_key": sf_api_key,
"url": f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}",
"headers": json.dumps({"x-ig-app-id": "936619743392459"}),
}
def request_json(url, params) -> Dict[str, Any]:
response = requests.get(url, params=params, timeout=95)
response.raise_for_status()
return response.json()
response_json = request_json(url="https://scraping.narf.ai/api/v1/", params=params)
# get user_id from response to request next pages with posts
user_id = response_json.get("data", {}).get("user", {}).get("id")
if not user_id:
print(f"User {username} not found.")
return []
# parse the first batch of posts from user profile response
posts = parse_posts(response_json=response_json)
page_info = parse_page_info(response_json=response_json)
# get next page cursor
end_cursor = page_info.get("end_cursor")
while end_cursor:
params = {
"api_key": sf_api_key,
"url": f"https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first=24&after={end_cursor}",
}
response_json = request_json(url="https://scraping.narf.ai/api/v1/", params=params)
posts.extend(parse_posts(response_json=response_json))
page_info = parse_page_info(response_json=response_json)
end_cursor = page_info.get("end_cursor")
return posts
And that’s it. We can use this function to scrape all posts from arbitrary public Instagram profile. In our case, we will be staredomynasprzedaz
:
SF_API_KEY = "YOUR_SCRAPING_FISH_API_KEY"
posts = scrape_ig_profile(username="staredomynasprzedaz", sf_api_key=SF_API_KEY)
With Scraping Fish API it should take about 2 seconds per page so a profile with 300 posts will be scraped in about 25 seconds.
Since the returned posts are structured dictionaries in a list, we can create pandas data frame from it for easier data processing:
df = pd.DataFrame(posts)
shortcode | image_url | description | n_comments | n_likes | timestamp | |
---|---|---|---|---|---|---|
0 | CbrYIabMBXS | link | Różany, Gronowo Elbląskie, woj. warmińsko-mazu... | 14 | 475 | 1648535479 |
1 | CbnRiJwsxsc | link | Komorów, Michałowice, woj. mazowieckie \nCena:... | 28 | 761 | 1648397802 |
2 | CbhVQU3MtTR | link | Pomorowo, Lidzbark Warmiński, woj. warmińsko-m... | 14 | 526 | 1648198427 |
3 | CbakX60Me4r | link | Smyków, Radgoszcz, woj. małopolskie \nCena: 37... | 10 | 264 | 1647971472 |
4 | CbXGs-JNK0U | link | Dębowa Łęka, Wschowa, woj. lubuskie\nCena: 389... | 3 | 436 | 1647855253 |
... | ... | ... | ... | ... | ... | ... |
Parsing property features from post description
From a structured part of the post description, we can parse more detailed information about properties:
-
location (address and province) 📍
-
price in PLN 💰
-
house size in m² 🏠
-
plot area in m² 📐
For a function that implements description parsing based on regular expressions refer to the notebook accompanying this blog post here: https://github.com/mateuszbuda/instagram-scraping-fish/blob/master/instagram-tutorial.ipynb
From this information, we can simply compute additional derived features, e.g., price per m² of the house and plot:
df["price_per_house_m2"] = df["price"].div(df["house_size"])
df["price_per_plot_m2"] = df["price"].div(df["plot_area"])
Data exploration
Based on the data frame that we created, we can extract some usefull stats, e.g., the number of houses in each provine and the mean price per m²:
df.groupby("province").agg({"price_per_house_m2": ["mean", "count"]}).sort_values(by=("price_per_house_m2", "mean"))
province | price_per_house_m2 (mean) | price_per_house_m2 (count) |
---|---|---|
opolskie | 1643.662176 | 3 |
dolnośląskie | 1749.198578 | 40 |
zachodniopomorskie | 1873.457787 | 22 |
lubuskie | 1997.868677 | 10 |
podkarpackie | 2283.578380 | 46 |
łódzkie | 2444.755891 | 7 |
podlaskie | 2689.675717 | 46 |
warmińsko - mazurskie | 2733.333333 | 1 |
lubelskie | 2781.316515 | 18 |
małopolskie | 2879.480040 | 47 |
śląskie | 2969.714365 | 25 |
świętokrzyskie | 3005.367271 | 2 |
wielkopolskie | 3084.229161 | 7 |
warmińsko-mazurskie | 3099.135703 | 29 |
pomorskie | 3135.444546 | 8 |
kujawsko-pomorskie | 3885.582011 | 6 |
mazowieckie | 4167.280252 | 20 |
We can also filter the data to find houses that we might be interested in. Example below searches for houses with a price below 200,000 PLN and of a size between 100 m² and 200 m². Here is a link to one of them based on its shortcode: https://www.instagram.com/p/CYv93e8Nvwh/
df[(df["price"] < 200000.0) & (df["house_size"] < 200.0) & (df["house_size"] > 100.0)]
shortcode | image_url | description | n_comments | n_likes | timestamp | address | province | price | house_size | plot_area | price_per_house_m2 | price_per_plot_m2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
31 | CYv93e8Nvwh | link | Ponikwa, Bystrzyca Kłodzka, woj. dolnośląskie ... | 23 | 701 | 1642247030 | Ponikwa, Bystrzyca Kłodzka | dolnośląskie | 165000.0 | 120.00 | 3882.0 | 1375.000000 | 42.503864 |
57 | CXGiAE6sw6y | link | Gadowskie Holendry, Tuliszków, woj. wielkopols... | 7 | 462 | 1638709205 | Gadowskie Holendry, Tuliszków | wielkopolskie | 199000.0 | 111.00 | 3996.0 | 1792.792793 | 49.799800 |
122 | CTSiKe4sGsx | link | Gotówka, Ruda - Huta, woj. lubelskie\nCena: 18... | 3 | 189 | 1630522009 | Gotówka, Ruda - Huta | lubelskie | 186000.0 | 120.00 | 1832.0 | 1550.000000 | 101.528384 |
149 | CR040yusVyV | link | Leżajsk, woj. podkarpackie \nCena: 175 000 zł... | 26 | 547 | 1627379773 | Leżajsk | podkarpackie | 175000.0 | 108.00 | 912.0 | 1620.370370 | 191.885965 |
181 | CQvirvJM0vi | link | Rząśnik, Świerzawa, woj. dolnośląskie \nCena: ... | 4 | 239 | 1625052909 | Rząśnik, Świerzawa | dolnośląskie | 199000.0 | 160.00 | 1900.0 | 1243.750000 | 104.736842 |
190 | CQhJJTFsyPt | link | Szymbark, Gorlice, woj. małopolskie \nCena: 19... | 2 | 222 | 1624569758 | Szymbark, Gorlice | małopolskie | 199000.0 | 136.00 | 9574.0 | 1463.235294 | 20.785461 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Conclusion
I hope you now feel more confident in scraping. As you can see, it is super easy to scrape publicly available data with Scraping Fish API from even as challenging websites as Instagram. In a similar way, you can scrape other user profiles as well as other websites that contain relevant information for you or your business 📈.
Let's talk about your use case 💼
Feel free to reach out using our contact form. We can assist you in integrating Scraping Fish API into your existing scraping workflow or help you set up scraping system for your use case.