How to Scrape Yellow Pages using Python & LXML?
How to Scrape Yellow Pages using Python & LXML?
In this scraping tutorial, let’s see we will let you know how to scrape YellowPages.com using Python and LXML, which will scrape business information based on the category and city from the Yellow Pages.
To use this yellow pages scraper, let’s go through yellow pages of data for different restaurants in the city. Then extract business information from first page outcomes.
What Data Do We Extract?
Here are the data fields that we will scrape:
- Rankings
- Business Name
- Business Pages
- Phone Numbers
- Website
- Street’s Name
- Category
- Ratings
- Region
- Locality
- URL
- Zip Code
Here is the screenshot of information we will scrape from Yellow Pages using yellow pages API.
How to Find Data?
Initially, we should find the data, which is available in the current page’s HTML Tags before to start creating a Yellow pages scraper. You should understand HTML tags of the content of the pages for doing so.
If you know Python and HTML, it will be easier for you. You don’t require superior programming skills for most parts of this tutorial.
Let’s examine the HTML of a web page as well as discover where the data is situated. What we are going to do is this:
Get the HTML tags, which enclose the listing of links where we require data from
Get links from that and scrape data
Reviewing the HTML
Why do we need to inspect the elements? – To get any elements on web pages using the XML path appearance.
Open a web browser (we have used Google Chrome here) and visit https://www.yellowpages.com/search?search_terms=restaurant&geo_location_terms=Boston link.
Then right-click on the page link and select – Inspect Element. A toolbar will get opened showing ‘HTML Content’ of this web page in the well-structured format.
The Image here shows data that we require to scrape in a DIV tag. In case, you see closely, it has the attribute named ‘class’ known as ‘result’. The DIV contains data fields that we have to scrape.
Let’s discover the HTML tag(s) that has links we require to scrape. You may right-click on a link title in a browser as well as perform ‘Inspect Element’. This will open HTML Content and highlight the tag that holds the data that you have right-clicked on. See, the image below to get data fields well-structured.
How to Set Your Computer for Web Scraping Development?
We will utilize Python 3 for the Python web scraping tutorial. The code won’t run in case, you use Python 2.7. To begin, your system requires PIP and Python 3 installed in that.
The majority of UNIX operating systems including Mac OS and Linux comes with pre-installed Python. However, not all Linux Operating Systems distribute with by default Python 3.
To validate the python version, just open the terminal (in Mac OS and Linux) or Command Prompt (on the Windows) as well as type:
python --version
After that, press the Enter key. In case, your output looks like Python 3.x.x, then you have got Python 3 installed. Similarly, if it is Python 2.x.x, then you are having Python 2. However, if that prints errors, you probably don’t have the python installed.
Installing Python 3 with Pip
You can use this guide for installing Python 3 for Linux –
http://docs.python-guide.org/en/latest/starting/install3/linux/
If you are a Mac user, you can follow the guide at – http://docs.python-guide.org/en/latest/starting/install3/osx/
Package Installation
Use Python Requests for making requests as well as downloading HTML content for different pages at (http://docs.python-requests.org/en/master/user/install/).
Use Python LXML to parse HTML’s Tree Structure with Xpaths (Know more at – http://lxml.de/installation.html)
The Code
#! /usr/bin/env python # -*- coding: utf-8 -*- import requests from lxml import html import unicodecsv as csv import argparse def parse_listing(keyword, place): """ Function to process yellow page listing page : param keyword: search query : param place: place name """ url = "https://www.yellowpages.com/search?search_terms={0}&geo_location_terms={1}".format(keyword, place) print("retrieving ", url) headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-GB,en;q=0.9,en-US;q=0.8,ml;q=0.7', 'Cache-Control': 'max-age=0', 'Connection': 'keep-alive', 'Host': 'www.yellowpages.com', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36' } # Adding retries for retry in range(10): try: response = requests.get(url, verify=False, headers=headers) print("parsing page") if response.status_code == 200: parser = html.fromstring(response.text) # making links absolute base_url = "https://www.yellowpages.com" parser.make_links_absolute(base_url) XPATH_LISTINGS = "//div[@class='search-results organic']//div[@class='v-card']" listings = parser.xpath(XPATH_LISTINGS) scraped_results = [] for results in listings: XPATH_BUSINESS_NAME = ".//a[@class='business-name']//text()" XPATH_BUSSINESS_PAGE = ".//a[@class='business-name']//@href" XPATH_TELEPHONE = ".//div[@class='phones phone primary']//text()" XPATH_ADDRESS = ".//div[@class='info']//div//p[@itemprop='address']" XPATH_STREET = ".//div[@class='street-address']//text()" XPATH_LOCALITY = ".//div[@class='locality']//text()" XPATH_REGION = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='addressRegion']//text()" XPATH_ZIP_CODE = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='postalCode']//text()" XPATH_RANK = ".//div[@class='info']//h2[@class='n']/text()" XPATH_CATEGORIES = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='categories']//text()" XPATH_WEBSITE = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='links']//a[contains(@class,'website')]/@href" XPATH_RATING = ".//div[@class='info']//div[contains(@class,'info-section')]//div[contains(@class,'result-rating')]//span//text()" raw_business_name = results.xpath(XPATH_BUSINESS_NAME) raw_business_telephone = results.xpath(XPATH_TELEPHONE) raw_business_page = results.xpath(XPATH_BUSSINESS_PAGE) raw_categories = results.xpath(XPATH_CATEGORIES) raw_website = results.xpath(XPATH_WEBSITE) raw_rating = results.xpath(XPATH_RATING) # address = results.xpath(XPATH_ADDRESS) raw_street = results.xpath(XPATH_STREET) raw_locality = results.xpath(XPATH_LOCALITY) raw_region = results.xpath(XPATH_REGION) raw_zip_code = results.xpath(XPATH_ZIP_CODE) raw_rank = results.xpath(XPATH_RANK) business_name = ''.join(raw_business_name).strip() if raw_business_name else None telephone = ''.join(raw_business_telephone).strip() if raw_business_telephone else None business_page = ''.join(raw_business_page).strip() if raw_business_page else None rank = ''.join(raw_rank).replace('.\xa0', '') if raw_rank else None category = ','.join(raw_categories).strip() if raw_categories else None website = ''.join(raw_website).strip() if raw_website else None rating = ''.join(raw_rating).replace("(", "").replace(")", "").strip() if raw_rating else None street = ''.join(raw_street).strip() if raw_street else None locality = ''.join(raw_locality).replace(',\xa0', '').strip() if raw_locality else None locality, locality_parts = locality.split(',') _, region, zipcode = locality_parts.split(' ') business_details = { 'business_name': business_name, 'telephone': telephone, 'business_page': business_page, 'rank': rank, 'category': category, 'website': website, 'rating': rating, 'street': street, 'locality': locality, 'region': region, 'zipcode': zipcode, 'listing_url': response.url } scraped_results.append(business_details) return scraped_results elif response.status_code == 404: print("Could not find a location matching", place) # no need to retry for non existing page break else: print("Failed to process page") return [] except: print("Failed to process page") return [] if __name__ == "__main__": argparser = argparse.ArgumentParser() argparser.add_argument('keyword', help='Search Keyword') argparser.add_argument('place', help='Place Name') args = argparser.parse_args() keyword = args.keyword place = args.place scraped_data = parse_listing(keyword, place) if scraped_data: print("Writing scraped data to %s-%s-yellowpages-scraped-data.csv" % (keyword, place)) with open('%s-%s-yellowpages-scraped-data.csv' % (keyword, place), 'wb') as csvfile: fieldnames = ['rank', 'business_name', 'telephone', 'business_page', 'category', 'website', 'rating', 'street', 'locality', 'region', 'zipcode', 'listing_url'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames, quoting=csv.QUOTE_ALL) writer.writeheader() for data in scraped_data: writer.writerow(data)
You need to execute the complete code through typing a script name trailed by a -h in the terminal or command prompt:
usage: yellow_pages.py [-h] keyword place positional arguments: keyword Search Keyword place Place Name optional arguments: -h, --help show this help message and exit
The positional argument keywords represent a category as well as positing is the preferred location to look for the business. Let’s take an example and get the business information for restaurants in Boston, MA. Its script would be implemented like:
python3 yellow_pages.py restaurants Boston, MA
You need to see a file named restaurants-Boston-Yellowpages-scraped-data.csv in the same folder as a script, with extracted data. Let’s see some sample information about business data scrapped from the YellowPages.com.
Click on the given below link and contact us for services and a free quote.
https://www.xbyte.io/contact-us/
Some Limitations
This code needs to be capable of scraping business information from the majority of locations. However, if you need any professional assistance in scraping websites, feel free to contact us. Just fill-up the details below.
Comments
Post a Comment