Skip to main content

Scraping Instagram

Mateusz Buda
Paweł Kobojek

This blog post is a comprehensive tutorial for scraping public Instagram profile information and posts using Scraping Fish API. We will be scraping posts from a profile that lists old houses for sale to find the best deal.

We prepared accompanying python notebook shared on GitHub repository: instagram-scraping-fish. To be able to run it and actually scrape the data, you will need Scraping Fish API key which you can get here: Scraping Fish Request Packs. A starter pack of 1,000 API requests costs only $2 and will let you run this tutorial and play with the API on your own ⛹️. Without Scraping Fish API key you are likely to get blocked instantly ⛔️.

info

It’s important to point out that we are using Instagram private (undocumented) API for scraping and the code we share works as of April 2022. If Instagram changes something in their API that we rely on, this tutorial may no longer work and will have to be adjusted. If you experience any problem, feel free to open an issue on GitHub and we will investigate it.

Scraping use case

As an example to test Scraping Fish capabilities to scrape Instagram we will fetch and parse data from posts shared by a public profile Stare domy 🏚 (Old Houses). It is an aggregate listing of old houses for sale in Poland. Post descriptions in this profile provide fairly structured data about the property, including location, price, size, etc.

Instagram profile endpoint

The first endpoint that we need to call for our profile is: https://www.instagram.com/staredomynasprzedaz/?__a=1. It gives us a JSON with:

  • user identifier needed for next requests,
  • general profile information (not used in this blog post but available in response),
  • the first page of posts and
  • next page cursor to retrieve the next batch of posts.

Pagination with Instagram GraphQL API

For the following requests to obtain next pages with posts, we will use Instagram GraphQL API endpoint that requires user identifier and cursor from the previous response: https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first=24&after={end_cursor}

The query_id parameter is fixed to 17888483320059182. You do not need to change it and the value stays the same regardless of Instagram profile and page with user posts.

You can play with the value of first query parameter to retrieve larger number of posts per page than 24 set in this tutorial and reduce the total number of requests. Keep in mind, however, that using value that is too large might look suspicious and result in Instagram login prompt instead of valid JSON.

To get the next page information from JSON response we can use the following function:

def parse_page_info(response_json: Dict[str, Any]) -> Dict[str, Union[Optional[bool], Optional[str]]]:
top_level_key = "graphql" if "graphql" in response_json else "data"
user_data = response_json[top_level_key].get("user", {})
page_info = user_data.get("edge_owner_to_timeline_media", {}).get("page_info", {})
return page_info

Parsing posts from JSON response

Now that we have endpoints for user profile and pages with posts figured out, we can code a function to parse JSON response to retrieve post information that we need.

The response structure from both endpoints described above is roughly the same but the top level object key for profile response is graphql whereas for the GraphQL query it is data. We already accounted for this in the next page info parsing code.

A function that retrieves basic post information is all about accessing relevant keys in response JSON:

def parse_posts(response_json: Dict[str, Any]) -> List[Dict[str, Any]]:
top_level_key = "graphql" if "graphql" in response_json else "data"
user_data = response_json[top_level_key].get("user", {})
post_edges = user_data.get("edge_owner_to_timeline_media", {}).get("edges", [])
posts = []
for node in post_edges:
post_json = node.get("node", {})
shortcode = post_json.get("shortcode")
image_url = post_json.get("display_url")
caption_edges = post_json.get("edge_media_to_caption", {}).get("edges", [])
description = caption_edges[0].get("node", {}).get("text") if len(caption_edges) > 0 else None
n_comments = post_json.get("edge_media_to_comment", {}).get("count")
likes_key = "edge_liked_by" if "edge_liked_by" in post_json else "edge_media_preview_like"
n_likes = post_json.get(likes_key, {}).get("count")
timestamp = post_json.get("taken_at_timestamp")
posts.append({
"shortcode": shortcode,
"image_url": image_url,
"description": description,
"n_comments": n_comments,
"n_likes": n_likes,
"timestamp": timestamp,
})
return posts

It returns a list of dictionaries representing posts that contain:

  • shortcode: you can use it to access the post at https://www.instagram.com/p/<shortcode>/
  • image_url 🏞
  • description: post text 📝
  • n_comments: number of comments 💬
  • n_likes: number of likes 👍
  • timestamp: when the post was created ⏰

Complete Instagram scraping logic

Now we are ready to put all the pieces together to implement a complete Instagram user profile scraping logic that retrieves all user posts page by page:

def scrape_ig_profile(username: str, url_prefix: str = "") -> List[Dict[str, Any]]:
# url in Scraping Fish API must be encoded: https://scrapingfish.com/docs/scraping-urls-with-query-params
ig_profile_url = quote_plus(f"https://www.instagram.com/{username}/?__a=1")

def request_json(url: str) -> Dict[str, Any]:
response = requests.get(url)
response.raise_for_status()
return response.json()

response_json = request_json(f"{url_prefix}{ig_profile_url}")

# get user_id from response to request next pages with posts
user_id = response_json.get("graphql", {}).get("user", {}).get("id")
if not user_id:
print(f"User {username} not found.")
return []
# parse the first batch of posts from user profile response
posts = parse_posts(response_json=response_json)
page_info = parse_page_info(response_json=response_json)
# get next page cursor
end_cursor = page_info.get("end_cursor")
while end_cursor:
posts_url = quote_plus(
f"https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first=24&after={end_cursor}"
)
response_json = request_json(f"{url_prefix}{posts_url}")
posts.extend(parse_posts(response_json=response_json))
page_info = parse_page_info(response_json=response_json)
end_cursor = page_info.get("end_cursor")
return posts

And that’s it. We can use this function to scrape all posts from arbitrary public Instagram profile. In our case, we will be staredomynasprzedaz:

url_prefix = f"https://scraping.narf.ai/api/v1/?api_key={API_KEY}&url="
posts = scrape_ig_profile(username="staredomynasprzedaz", url_prefix=url_prefix)

With Scraping Fish API it should take about 2 seconds per page so a profile with 300 posts will be scraped in about 25 seconds.

Since the returned posts are structured dictionaries in a list, we can create pandas data frame from it for easier data processing:

df = pd.DataFrame(posts)

shortcode

image_url

description

n_comments

n_likes

timestamp

0

CbrYIabMBXS

https://scontent-frt3-1.cdninstagram.com/v/t51...

Różany, Gronowo Elbląskie, woj. warmińsko-mazu...

14

475

1648535479

1

CbnRiJwsxsc

https://scontent-frt3-2.cdninstagram.com/v/t51...

Komorów, Michałowice, woj. mazowieckie \nCena:...

28

761

1648397802

2

CbhVQU3MtTR

https://scontent-frx5-2.cdninstagram.com/v/t51...

Pomorowo, Lidzbark Warmiński, woj. warmińsko-m...

14

526

1648198427

3

CbakX60Me4r

https://scontent-frt3-2.cdninstagram.com/v/t51...

Smyków, Radgoszcz, woj. małopolskie \nCena: 37...

10

264

1647971472

4

CbXGs-JNK0U

https://scontent-frt3-1.cdninstagram.com/v/t51...

Dębowa Łęka, Wschowa, woj. lubuskie\nCena: 389...

3

436

1647855253

...

...

...

...

...

...

...

Parsing property features from post description

From a structured part of the post description, we can parse more detailed information about properties:

  • location (address and province) 📍
  • price in PLN 💰
  • house size in m² 🏠
  • plot area in m² 📐

For a function that implements description parsing based on regular expressions refer to the notebook accompanying this blog post here: https://github.com/mateuszbuda/instagram-scraping-fish/blob/master/instagram-tutorial.ipynb

From this information, we can simply compute additional derived features, e.g., price per m² of the house and plot:

df["price_per_house_m2"] = df["price"].div(df["house_size"])
df["price_per_plot_m2"] = df["price"].div(df["plot_area"])

Data exploration

Based on the data frame that we created, we can extract some usefull stats, e.g., the number of houses in each provine and the mean price per m²:

df.groupby("province").agg({"price_per_house_m2": ["mean", "count"]}).sort_values(by=("price_per_house_m2", "mean"))

province

price_per_house_m2

mean

count

opolskie

1643.662176

3

dolnośląskie

1749.198578

40

zachodniopomorskie

1873.457787

22

lubuskie

1997.868677

10

podkarpackie

2283.578380

46

łódzkie

2444.755891

7

podlaskie

2689.675717

46

warmińsko - mazurskie

2733.333333

1

lubelskie

2781.316515

18

małopolskie

2879.480040

47

śląskie

2969.714365

25

świętokrzyskie

3005.367271

2

wielkopolskie

3084.229161

7

warmińsko-mazurskie

3099.135703

29

pomorskie

3135.444546

8

kujawsko-pomorskie

3885.582011

6

mazowieckie

4167.280252

20

We can also filter the data to find houses that we might be interested in. Example below searches for houses with a price below 200,000 PLN and of a size between 100 m² and 200 m². Here is a link to one of them based on its shortcode: https://www.instagram.com/p/CYv93e8Nvwh/

df[(df["price"] < 200000.0) & (df["house_size"] < 200.0) & (df["house_size"] > 100.0)]

shortcode

image_url

description

n_comments

n_likes

timestamp

address

province

price

house_size

plot_area

price_per_house_m2

price_per_plot_m2

31

CYv93e8Nvwh

https://scontent-frt3-1.cdninstagram.com/v/t51...

Ponikwa, Bystrzyca Kłodzka, woj. dolnośląskie ...

23

701

1642247030

Ponikwa, Bystrzyca Kłodzka

dolnośląskie

165000.0

120.00

3882.0

1375.000000

42.503864

57

CXGiAE6sw6y

https://scontent-frx5-1.cdninstagram.com/v/t51...

Gadowskie Holendry, Tuliszków, woj. wielkopols...

7

462

1638709205

Gadowskie Holendry, Tuliszków

wielkopolskie

199000.0

111.00

3996.0

1792.792793

49.799800

122

CTSiKe4sGsx

https://scontent-frx5-1.cdninstagram.com/v/t51...

Gotówka, Ruda - Huta, woj. lubelskie\nCena: 18...

3

189

1630522009

Gotówka, Ruda - Huta

lubelskie

186000.0

120.00

1832.0

1550.000000

101.528384

149

CR040yusVyV

https://scontent-vie1-1.cdninstagram.com/v/t51...

Leżajsk, woj. podkarpackie \nCena: 175 000 zł\...

26

547

1627379773

Leżajsk

podkarpackie

175000.0

108.00

912.0

1620.370370

191.885965

181

CQvirvJM0vi

https://scontent-vie1-1.cdninstagram.com/v/t51...

Rząśnik, Świerzawa, woj. dolnośląskie \nCena: ...

4

239

1625052909

Rząśnik, Świerzawa

dolnośląskie

199000.0

160.00

1900.0

1243.750000

104.736842

190

CQhJJTFsyPt

https://scontent-vie1-1.cdninstagram.com/v/t51...

Szymbark, Gorlice, woj. małopolskie \nCena: 19...

2

222

1624569758

Szymbark, Gorlice

małopolskie

199000.0

136.00

9574.0

1463.235294

20.785461

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Conclusion

I hope you now feel more confident in scraping. As you can see, it is super easy to scrape publicly available data with Scraping Fish API from even as challenging websites as Instagram. In a similar way, you can scrape other user profiles as well as other websites that contain relevant information for you or your business 📈.

Let's talk about your use case 💼

Feel free to reach out using our contact form. We can assist you in integrating Scraping Fish API into your existing scraping workflow or help you set up scraping system for your use case.