How to Scrape LinkedIn for Public Company Data
How to Scrape LinkedIn for Public Company Data?
At X-Byte Enterprise Crawling, we feel very happy that you have visited out page about how to scrape LinkedIn for public company Data and you won’t be disappointed!
Through this tutorial, we will demonstrate you how to scrape LinkedIn public pages. For people, who have come on this page with no understanding about why they need to scrape LinkedIn company data, let’s discuss a few points:
Automation in LinkedIn Search: You wish to work for the company having some particular criteria as well as they are not the normal suspects. You may have the shortlist, but that list isn’t short and more like the long list. You need a tool like Google Finance, which could help in filtering companies depending on the criteria they get published on LinkedIn. You may take the “long list” to scrape this data into a well-structured format and create a wonderful analysis tool.
Interest: You are interested about the companies on LinkedIn as well as want to collect a good set of data to satisfy your interest.
Tinkerer: You want to tinker as well as found that you might like to learn Python as well as need something helpful to begin.
Whatever the reason might be, you have come at the right place!
In the tutorial, basic steps are given about how to scrape data from LinkedIn using Python.
Prerequisites:
In this tutorial, we will use basic Python as well as some python packages – LXML and requests. We won’t use more complex packages like Scrapy for anything simple.
You require to install following things:
Python 2.7 accessible here
( https://www.python.org/downloads/)
Python Requests accessible here (http://docs.python-requests.org/en/master/user/install/). You could need Python pips to install this accessible here –
https://pip.pypa.io/en/stable/installing/)
Python LXML (Study how to install it here – http://lxml.de/installation.html)
The code is scraping LinkedIn is entrenched below as well as if you are not capable to see that in the browser, this could be downloaded from GIST here.
from lxml import html
import csv, os, json
import requests
from exceptions import ValueError
from time import sleep
def linkedin_companies_parser(url):
for i in range(5):
try:
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'
}
print "Fetching :",url
response = requests.get(url, headers = headers,verify=False)
formatted_response = response.content.replace('<!--', '').replace('-->', '')
doc = html.fromstring(formatted_response)
datafrom_xpath = doc.xpath('//code[@id="stream-promo-top-bar-embed-id-content"]//text()')
content_about = doc.xpath('//code[@id="stream-about-section-embed-id-content"]')
if not content_about:
content_about = doc.xpath('//code[@id="stream-footer-embed-id-content"]')
if content_about:
pass
# json_text = content_about[0].html_content().replace('<code id="stream-footer-embed-id-content"><!--','').replace('<code id="stream-about-section-embed-id-content"><!--','').replace('--></code>','')
if datafrom_xpath:
try
json_formatted_data = json.loads(datafrom_xpath[0])
company_name = json_formatted_data['companyName'] if 'companyName' in json_formatted_data.keys() else None
size = json_formatted_data['size'] if 'size' in json_formatted_data.keys() else None
industry = json_formatted_data['industry'] if 'industry' in json_formatted_data.keys() else None
description = json_formatted_data['description'] if 'description' in json_formatted_data. keys() else None
follower_count = json_formatted_data['followerCount'] if 'followerCount' in json_form atted_data.keys() else None
year_founded = json_formatted_data['yearFounded'] if 'yearFounded' in json_forma tted_data.keys() else None
website = json_formatted_data['website'] if 'website' in json_formatted_data.keys() else None
type = json_formatted_data['companyType'] if 'companyType' in json_formatted_data .keys() else None
specialities = json_formatted_data['specialties'] if 'specialties' in json_formatted_data. keys() else None
if "headquarters" in json_formatted_data.keys():
city = json_formatted_data["headquarters"]['city'] if 'city' in json_formatted_data["he adquarters"].keys() else None
country = json_formatted_data["headquarters"]['country'] if 'country' in json_formatted _data['headquarters'].keys() else None
state = json_formatted_data["headquarters"]['state'] if 'state' in json_formatted_data[' headquarters'].keys() else None
street1 = json_formatted_data["headquarters"]['street1'] if 'street1' in json_formatted _data['headquarters'].keys() else None
street2 = json_formatted_data["headquarters"]['street2'] if 'street2' in json_formatted _data['headquarters'].keys() else None
zip = json_formatted_data["headquarters"]['zip'] if 'zip' in json_formatted_data['headq uarters'].keys() else None
street = street1 + ', ' + street2
else:
zip = none
city = None
state = None
country = none
street = none
street1 = None
street2 = None
data = {
'company_name': company_name,
'size': size,
'industry': industry,
'description': description,
'follower_count': follower_count,
'founded': year_founded,
'website': website,
'type': type,
'specialities': specialities,
'city': city,
'country': country,
'state': state,
'street': street,
'zip': zip,
'url': url
}
return data
except:
print "cant parse page", url
# Retry in case of captcha or login page redirection
if len(response.content) < 2000 or "trk=login_reg_redirect" in url:
if response.status_code == 404:
print "linkedin page not found"
else
raise ValueError('redirecting to login page or captcha found')
except :
print "retrying :",url
def readurls():
companyurls = ['https://www.linkedin.com/company/tata-consultancy-services']
extracted_data = []
for url in companyurls:
extracted_data.append(linkedin_companies_parser(url))
f = open('data.json', 'w')
json.dump(extracted_data, f, indent=4)
if __name__ == "__main__":
readurls()
You just need to change a URL in that line
or add some URLs detached by different commas to that list You may save a file as well as run that using Python – python filename.py
The result will be in the file named data.json using the similar directory as well as will look somewhat like this
{
"website": "https://www.xbyte.io",
"description": "X-Byte Enterprise Crawling is among the best web scraping companies in the world for the reason.\r\n We won’t leave you with the \"self-service\" screen for building your individual scrapers.\r\n We have the real humans, which will chat to you inside hours of the request as well as help you in your requirement.\r\n Although we are the leading providers in this field, our investment in the automation has helped us in providing a totally \"full service\" at affordable prices.\r\n Contact us at www.xbyte.io and experience our amazing customer service "
"founded": 2012,
"street": Houston,
"specialities": [
"Web Scraping Service",
"Website Scraping",
"Screen scraping",
"Data scraping",
"Web crawling",
"Data as a Service",
"Data extraction API",
"Scrapy",
"Python",
"DaaS"
],
"size": "100-150 employees",
"city": Houston,
"zip": TX-770143,
"url": "https://www.linkedin.com/company/xbyte-crawling",
"country": USA,
"industry": "Computer Software",
"state": Texas,
"company_name": "X-Byte Enterprise Crawling",
"follower_count": 2262,
"type": "Privately Held"
}
Or in case you are running that for Cisco
companyurls = ['https://www.linkedin.com/company/cisco']
The result will be like this
"website": "http://www.cisco.com",
"description": "Cisco (NASDAQ: CSCO) allows people to create powerful connections-- in education, philanthropy, business, or imagination. Cisco software, hardware, and services offerings are utilized for creating the Internet solutions, which make networks possible--offering easy use to data anywhere an time. \r\n\r\n Cisco was initiated in 1984 by the small group of computer professionals from Stanford University. Ever since the company's origin, Cisco engineers are the leaders in development of the Internet Protocol (IP)-based networking skills. Today, having over 71,000 employees globally, this practice of revolution continues with the industry-leading solutions and products in company's key development areas of switching and routing, and with advanced technologies like IP telephony, home networking, security, optical networking, wireless technology, and storage area networking. Besides its products, Cisco offers an extensive range of service offerings like advanced services and technical support. \r\n\r\n Cisco sells its services and products, both directly using its individual sales force or using the channel partners, commercial businesses, larger enterprises, consumers, and service providers."
"founded": 1984,
"street": "Tasman Way, ",
"specialities": [
"Networking",
"Wireless",
"Security",
"Unified Communication",
"Telepresence",
"Collaboration",
"Data Center",
"Virtualization",
"Unified Computing Systems"
],
"size": "10,001+ employees",
"city": "San Jose","zip": "95134",
"zip": "95134",
"url": "https://www.linkedin.com/company/cisco",
"country": "United States",
"industry": "Computer Networking",
"state": "CA",
"company_name": "Cisco",
"follower_count": 1201541,
"type": "Public Company"
}
Warning: As LinkedIn requires you to log in whenever you open the website, this code might not work for you.
You can easily change the fields or URLs you wish to scrape. Contact us for scraping LinkedIn for public company data!
Comments
Post a Comment