How to Scrape Amazon.com Reviews using Python?

We search for lots of things on the internet every day in order to buy something, to compare one product to another, how do we decide that one particular product is better than the other? – We directly hit onto the reviews and see how much stars/ positive feedback has been given to the product, right!!

In this blog, we’re going to scrape reviews from amazon.com. Not only just review but also how many stars it got, who posted the review, etc.

We will be saving data in an excel spreadsheet (CSV). Here are detailed data-fields that we are going to extract:

1. Review Title
2. Rating
3. Reviewer Name
4. Review Description/Review of Content
5. Helpful Count

So let’s get started.

We prefer Scrapy – a python framework for a large-scale web scraping. Along with that, some other packages will be required in order to scrape Amazon reviews.

Requests – to send the request of a URL
pandas – to export CSV
by MySQL – to connect MySQL server and store data there
math – to implement mathematical operations

As you know, you can always install such packages just like below with pip or conda.

pip install scrapy

conda intall -c conda-forge scrapy

Let’s define Start URL to extract seller links

Let’s first see what it’s like to scrape reviews for one product.
We are taking URL: https://www.amazon.com/dp/B07N9255CG
It will look like the below image.

Now if we get to the review section, it’ll look like the image below. It may have some different names in reviews.

But if you closely inspect those requests going on the back while loading the page and play a little with the next and previous page of review, you might notice that there’s a post request loading that contains all the content on the page.

Here we’ll have a look at payload & headers required for a successful response. If you have properly inspected all the pages, you’ll know the difference between shifting the page and how it reflects on the requests passed for it.

NEXT PAGE --- PAGE 2

https://www.amazon.com/hz/reviews-render/ajax/reviews/get/ref=cm_cr_arp_d_paging_btm_
next_2

Headers:

accept: text/html,*/*

accept-encoding: gzip, deflate, br

accept-language: en-US,en;q=0.9

content-type: application/x-www-form-urlencoded;charset=UTF-8

origin: https://www.amazon.com

referer: https://www.amazon.com/Moto-Alexa-Hands-Free-camera-included/productreviews/B07N9255CG?ie=UTF8&reviewerType=all_reviews

user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)

Chrome/81.0.4044.113 Safari/537.36

x-requested-with: XMLHttpRequest

Payload:

reviewerType: all_reviews

pageNumber: 2

shouldAppend: undefined

reftag: cm_cr_arp_d_paging_btm_next_2

pageSize: 10

asin: B07N9255CG

PREVIOUS PAGE --- PAGE 1

https://www.amazon.com/hz/reviewsrender/ajax/reviews/get/ref=cm_cr_getr_d_paging_btm_prev
_1

Headers:

accept: text/html,*/*

accept-encoding: gzip, deflate, br

accept-language: en-US,en;q=0.9

content-type: application/x-www-form-urlencoded;charset=UTF-8

origin: https://www.amazon.com

referer: https://www.amazon.com/Moto-Alexa-Hands-Free-camera-included/
productreviews/B07N9255CG/

ref=cm_cr_arp_d_paging_btm_next_2?

ie=UTF8&reviewerType=all_reviews& pageNumber=2

user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)

AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36

x-requested-with: XMLHttpRequest

Payload:

reviewerType: all_reviews

pageNumber: 2

shouldAppend: undefined

reftag: cm_cr_arp_d_paging_btm_next_2

pageSize: 10

asin: B07N9255CG

The Main Part: CODE/script

There are two different ways to make a script:

1. Create a whole scrapy project

2. Just create a bunch of files in a folder to narrow down size of project

As in the last tutorial we showed you a whole scrapy project and details to create and modify that. Well, we’re going the most possible narrowed way this time. Yes, just a bunch of files and all the reviews in amazon will be right there!!

As we are using scrapy & python to extract all the reviews, it’s easy, rather to be said convenient to take the road of xpath.

The most important part of xpath is to capture a pattern. Because to copy same xpath from google inspect window and paste that, it’s pretty simple but very old school and also not at all efficient every time.

Here’s what we’re going to do, we’ll observe xpath for same field, let say “Review Title” and see how it creates a pattern or something like that to narrow down the xpath.

There are two examples of a similar xpath below.

(Review-1)

(Review-2)

As you can see there are similar attributes to the tag which contains the information about “Review Title”.

Hence, resulting xpath for Review Title will be,

//a[contains(@class,"review-title-content")]/span/text()

Just like this we’ve listed all xpaths for all the fields we are going to scrape.

Review Title : //a[contains(@class,"review-title-content")]/span/text()
Rating : //a[contains(@title,"out of 5 stars")]/@title
Reviewer Name : //div[@id="cm_cr-review_list"]//span[@class="a-profile-name"]/text()
Review Description/Review Content : //span[contains(@class,"review-text-content")]/span/text()
Helpful Count : /span[contains(@class,"cr-vote-text")]/text()

Obviously, some stripping and joining to the end results in some xpath is indeed important in order to get perfect data. Also, don’t forget to remove extra white spaces.

Alright,

Now we have seen how to move across the pages and also how to extract information from them, Time to assemble those all!!

Below is the whole code for extraction of all reviews for one product!!!

import math, requests, json, pymysql

from scrapy.http import HtmlResponse

import pandas as pd

con = pymysql.connect ( 'localhost', 'root', 'password','database' )

raw_dataframe = [ ]

res = requests.get( 'https://www.amazon.com/Moto-Alexa-Hands-Free-camera-included/ product-reviews/B07N9255CG?ie=UTF8&reviewerType.all_reviews' )

response = HtmlResponse( url=res.url,body=res.content )

product_name = response.xpath( '//h1/a/text()').extract_first( default=' ' ).strip()

total_reviews = response.xpath('//span[contains(text(),"Showing")]/text()').extract_first(default='').strip().split()[-2]]

total_pages = math.ceil(int(total_reviews)/10)

for i in range(0,total_pages):

url = f"https//www.amazon.com/hz/reviews-render/ajax/reviews/get/ref=cm_crarp_d_paging_btm_next_{str(i+2)}"

head = {'accept': 'text/html, */*',

'accept-encoding': 'gzip,deflate,br',

'accept-language': 'en-US,en;q=0.9',

'content-type': 'application/x-www-form-urlencoded;charset=UTF-8', 'origin': 'https://www.amazon.com,

'referer':response.url,

'user-agent': 'Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KWH, like Gecko) Chrome/81.0.4044.113 Safari/537.36',

'x- requested-with': 'XMLHttpRequest'

}

payload = {'reviewerType':'all_reviews'

'pageNumber': i+2,

'shouldAppend': 'undefined',

'reftag': f'cm_crarp_d_paging_btm_next_{str(i+2))',

'pageSize': 10,

'asin': '807N9255C',

}

res = requests.post(url,headers=head,data=json.dumps(payload))

response = HtmlResponse(url=res.url, body=res.content)

loop = response.xpath('//div[contains(@class,"a-section review")]')

for part in loop:

review_title = part.xpath('.//a[contains(@Class,"review-title-content")]/span/text()').extract_first(default=' ').strip()

rating =part.xpath('.//a[contains(@title,"out of 5 stars")]/@title').extract_first(default=' ').strip().split()[0].strip()

reviewername = part.xpath('.//span[@class."a-profile-name']/text()').extract_first(default=' ').strip()

description =''.join(part. xpath('.//span[contains(@class,"review-text-content")]/span/text()') .extract()).strip()

helpful_count =part.xpath('.//span[contains(@class,"cr-vote-text")]/ text()').extract_first(default ='').strip().split()[0].strip()

raw_dataframe.append([product_name,review_title,rating,reviewer_name, description,helpful_count])

df =pd.Dataframe,(raw_dataframe,columns['Product Name','Review Title','Review Rating','Reviewer Name','Description','Helpful Count' ]),

#inserting into mySQL table

df.to_sql("review_table",if_exists='append',con=con)

#exporting csv

df.to_csv("amazon reviews.csv",index=None)

Pressure Points while scraping Amazon Reviews

The whole process looks very easy to implement but there can be some issues while executing that such as response issue, captcha issue. To bypass the same, you should always keep some proxies or vpns handy. So that the process can be a whole lot smoother.
Also there are some times when the website changes its structure. If the extraction is going to be a long run for you then, you should always keep error logs in your script or an error alert would also work. So that you can be aware of that the moment structure is changed.

Conclusion

Any kind of review scraping is a lot helpful. Why? Read below cases.

To monitor views on products by customers if you are a seller on the same website
Also to monitor, other party sellers
To create a dataset which is used for a research whether for academic purpose or industrial purpose?

Visit Us: www.xbyte.io

Search This Blog

X-Byte Enterprise Crawling's Blog List