How to Scrape Yellow Pages using Python & LXML?

How to Scrape Yellow Pages using Python & LXML?



In this scraping tutorial, let’s see we will let you know how to scrape YellowPages.com using Python and LXML, which will scrape business information based on the category and city from the Yellow Pages.

To use this yellow pages scraper, let’s go through yellow pages of data for different restaurants in the city. Then extract business information from first page outcomes.

What Data Do We Extract?

What-Data-We-Extract.jpg

Here are the data fields that we will scrape:

  • Rankings
  • Business Name
  • Business Pages
  • Phone Numbers
  • Website
  • Street’s Name
  • Category
  • Ratings
  • Region
  • Locality
  • URL
  • Zip Code

Here is the screenshot of information we will scrape from Yellow Pages using yellow pages API.

yellowpages_extract_details (1).png

How to Find Data?

Initially, we should find the data, which is available in the current page’s HTML Tags before to start creating a Yellow pages scraper. You should understand HTML tags of the content of the pages for doing so.

If you know Python and HTML, it will be easier for you. You don’t require superior programming skills for most parts of this tutorial.

Let’s examine the HTML of a web page as well as discover where the data is situated. What we are going to do is this:

Get the HTML tags, which enclose the listing of links where we require data from

Get links from that and scrape data

Reviewing the HTML

Reviewing-the-HTML.jpg

Why do we need to inspect the elements? – To get any elements on web pages using the XML path appearance.

Open a web browser (we have used Google Chrome here) and visit https://www.yellowpages.com/search?search_terms=restaurant&geo_location_terms=Boston link.

Then right-click on the page link and select – Inspect Element. A toolbar will get opened showing ‘HTML Content’ of this web page in the well-structured format.

yellowpages_inspect gif.jpg

The Image here shows data that we require to scrape in a DIV tag. In case, you see closely, it has the attribute named ‘class’ known as ‘result’. The DIV contains data fields that we have to scrape.

Let’s discover the HTML tag(s) that has links we require to scrape. You may right-click on a link title in a browser as well as perform ‘Inspect Element’. This will open HTML Content and highlight the tag that holds the data that you have right-clicked on. See, the image below to get data fields well-structured.

yellowpages_inside_div gif.jpg

How to Set Your Computer for Web Scraping Development?

We will utilize Python 3 for the Python web scraping tutorial. The code won’t run in case, you use Python 2.7. To begin, your system requires PIP and Python 3 installed in that.

The majority of UNIX operating systems including Mac OS and Linux comes with pre-installed Python. However, not all Linux Operating Systems distribute with by default Python 3.

To validate the python version, just open the terminal (in Mac OS and Linux) or Command Prompt (on the Windows) as well as type:

python --version 

After that, press the Enter key. In case, your output looks like Python 3.x.x, then you have got Python 3 installed. Similarly, if it is Python 2.x.x, then you are having Python 2. However, if that prints errors, you probably don’t have the python installed.

Installing Python 3 with Pip

You can use this guide for installing Python 3 for Linux –
http://docs.python-guide.org/en/latest/starting/install3/linux/

If you are a Mac user, you can follow the guide at – http://docs.python-guide.org/en/latest/starting/install3/osx/

Package Installation

Use Python Requests for making requests as well as downloading HTML content for different pages at (http://docs.python-requests.org/en/master/user/install/).

Use Python LXML to parse HTML’s Tree Structure with Xpaths (Know more at – http://lxml.de/installation.html)

The Code

 #! /usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from lxml import html
import unicodecsv as csv
import argparse
def parse_listing(keyword, place):
    """
    Function to process yellow page listing page
    : param keyword: search query
    : param place: place name
    """
    url = "https://www.yellowpages.com/search?search_terms={0}&geo_location_terms={1}".format(keyword, place)

    print("retrieving ", url)

    headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
               'Accept-Encoding': 'gzip, deflate, br',
               'Accept-Language': 'en-GB,en;q=0.9,en-US;q=0.8,ml;q=0.7',
               'Cache-Control': 'max-age=0',
               'Connection': 'keep-alive',
               'Host': 'www.yellowpages.com',
               'Upgrade-Insecure-Requests': '1',
               'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'
               }
    # Adding retries
    for retry in range(10):
        try:
            response = requests.get(url, verify=False, headers=headers)
            print("parsing page")
            if response.status_code == 200:
                parser = html.fromstring(response.text)
                # making links absolute
                base_url = "https://www.yellowpages.com"
                parser.make_links_absolute(base_url)

                XPATH_LISTINGS = "//div[@class='search-results organic']//div[@class='v-card']"
                listings = parser.xpath(XPATH_LISTINGS)
                scraped_results = []

                for results in listings:
                    XPATH_BUSINESS_NAME = ".//a[@class='business-name']//text()"
                    XPATH_BUSSINESS_PAGE = ".//a[@class='business-name']//@href"
                    XPATH_TELEPHONE = ".//div[@class='phones phone primary']//text()"
                    XPATH_ADDRESS = ".//div[@class='info']//div//p[@itemprop='address']"
                    XPATH_STREET = ".//div[@class='street-address']//text()"
                    XPATH_LOCALITY = ".//div[@class='locality']//text()"
                    XPATH_REGION = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='addressRegion']//text()"
                    XPATH_ZIP_CODE = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='postalCode']//text()"
                    XPATH_RANK = ".//div[@class='info']//h2[@class='n']/text()"
                    XPATH_CATEGORIES = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='categories']//text()"
                    XPATH_WEBSITE = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='links']//a[contains(@class,'website')]/@href"
                    XPATH_RATING = ".//div[@class='info']//div[contains(@class,'info-section')]//div[contains(@class,'result-rating')]//span//text()"

                    raw_business_name = results.xpath(XPATH_BUSINESS_NAME)
                    raw_business_telephone = results.xpath(XPATH_TELEPHONE)
                    raw_business_page = results.xpath(XPATH_BUSSINESS_PAGE)
                    raw_categories = results.xpath(XPATH_CATEGORIES)
                    raw_website = results.xpath(XPATH_WEBSITE)
                    raw_rating = results.xpath(XPATH_RATING)
                    # address = results.xpath(XPATH_ADDRESS)
                    raw_street = results.xpath(XPATH_STREET)
                    raw_locality = results.xpath(XPATH_LOCALITY)
                    raw_region = results.xpath(XPATH_REGION)
                    raw_zip_code = results.xpath(XPATH_ZIP_CODE)
                    raw_rank = results.xpath(XPATH_RANK)

                    business_name = ''.join(raw_business_name).strip() if raw_business_name else None
                    telephone = ''.join(raw_business_telephone).strip() if raw_business_telephone else None
                    business_page = ''.join(raw_business_page).strip() if raw_business_page else None
                    rank = ''.join(raw_rank).replace('.\xa0', '') if raw_rank else None
                    category = ','.join(raw_categories).strip() if raw_categories else None
                    website = ''.join(raw_website).strip() if raw_website else None
                    rating = ''.join(raw_rating).replace("(", "").replace(")", "").strip() if raw_rating else None
                    street = ''.join(raw_street).strip() if raw_street else None
                    locality = ''.join(raw_locality).replace(',\xa0', '').strip() if raw_locality else None
                    locality, locality_parts = locality.split(',')
                    _, region, zipcode = locality_parts.split(' ')

                    business_details = {
                        'business_name': business_name,
                        'telephone': telephone,
                        'business_page': business_page,
                        'rank': rank,
                        'category': category,
                        'website': website,
                        'rating': rating,
                        'street': street,
                        'locality': locality,
                        'region': region,
                        'zipcode': zipcode,
                        'listing_url': response.url
                    }
                    scraped_results.append(business_details)

                return scraped_results

            elif response.status_code == 404:
                print("Could not find a location matching", place)
                # no need to retry for non existing page
                break
            else:
                print("Failed to process page")
                return []

        except:
            print("Failed to process page")
            return []


if __name__ == "__main__":

    argparser = argparse.ArgumentParser()
    argparser.add_argument('keyword', help='Search Keyword')
    argparser.add_argument('place', help='Place Name')

    args = argparser.parse_args()
    keyword = args.keyword
    place = args.place

    scraped_data = parse_listing(keyword, place)

    if scraped_data:
        print("Writing scraped data to %s-%s-yellowpages-scraped-data.csv" % (keyword, place))
        with open('%s-%s-yellowpages-scraped-data.csv' % (keyword, place), 'wb') as csvfile:
            fieldnames = ['rank', 'business_name', 'telephone', 'business_page', 'category', 'website', 'rating',
                          'street', 'locality', 'region', 'zipcode', 'listing_url']
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames, quoting=csv.QUOTE_ALL)
            writer.writeheader()
            for data in scraped_data:
                writer.writerow(data)

                

You need to execute the complete code through typing a script name trailed by a -h in the terminal or command prompt:

usage: yellow_pages.py [-h] keyword place
positional arguments:
  keyword     Search Keyword
  place       Place Name

optional arguments:
  -h, --help show this help message and exit

The positional argument keywords represent a category as well as positing is the preferred location to look for the business. Let’s take an example and get the business information for restaurants in Boston, MA. Its script would be implemented like:

python3 yellow_pages.py restaurants Boston, MA

You need to see a file named restaurants-Boston-Yellowpages-scraped-data.csv in the same folder as a script, with extracted data. Let’s see some sample information about business data scrapped from the YellowPages.com.

yellow_extract.png

Click on the given below link and contact us for services and a free quote.

https://www.xbyte.io/contact-us/

Some Limitations

This code needs to be capable of scraping business information from the majority of locations. However, if you need any professional assistance in scraping websites, feel free to contact us. Just fill-up the details below.

Comments

Popular posts from this blog

Top 5 Programming Languages for Web Scraping

How to Monitor the Minimum Advertised Price (MAP) Pricing?

How to Do Web Scraping Job Posts from Glassdoor Using Python?