Web scraping is a method for transforming unstructured data on the web into machine-readable, structured data for analysis. In general web, scraping is a complex process, but Python programming language has made it an easy and effective means. Python libraries such as Selenium, Beautiful soup and Pandas are used for web scraping.
It is essential that for practising any of the new data related technologies we need well-designed data sets. Many users believe they have to collect their own data but it’s simply not true.
There are hundreds of open data sets accessible, ready to be used and analyzed by anyone willing to look for them. Below is a list of most globally interesting open data websites.
1.US Census Bureau http://www.census.gov/data.html
3.European Union Open Data Portal http://open-data.europa.eu/en/data/
4.Data.gov.uk http://data.gov.uk/ Data from the UK Government.
5.UNICEF volunteers statistics on the sphere of women and children worldwide.
Many of these open source datasets are from the government and public organisations, where they bury the data in drill-down links and tables. This often requires the users to use best guess navigation which helps in finding the specific data users are looking for. Scraping the data with the help of Python and saving it as JSON is what users need to do to get started.
Here’s Where Selenium Comes In
From selenium import webdrivere
From selenium.webdriver.common.keys import Keys
From bs4 import BeautifulSoup
Import pandas as pd
Selenium will now begin a browser session. For Selenium to work, it must access the browser driver. By default, it will look in the corresponding directory as the Python script. Connections to Chrome, Firefox, Edge, and Safari drivers available here.
The sample code below uses Chrome
url= “http:// website name/division/sub_division.format”
#create a new chrome session
python_button.click()#click $$$$$ link
Handing It Over To Beautiful Soup
Beautiful Soup is the best way to cross the DOM(Document Object Model) and scrape the data. After representing an empty list and a counter variable, it is time to examine Beautiful Soup to seize all the links on the page that coordinate a regular expression.
#Selenium hands the page source to Beautiful Soup
datalist= #empty list
For link in soup_level1.find_all(‘a’,id=re.compile(“^##file_location##”));
## code to execute in for loop ##
#Beautiful soup grabs all the specified links
All specified links
For link in soup_level.find_all(‘variable’, id=re.compile(“^##data_set path”));
# selenium visits each specified page
python_button=driver.find_element_by_##variable(‘##path’ + str(x))
python_button.click() #click link
# Selenium hands of the source of the specific page to Beautiful Soup
#beautiful Soup grabe the HTML table on the page
#giving the HTML table to pandas to put in a dataframe object
#Store the dataframe in a list
#Ask Selenium to click the back button
#increment the counter variable before starting the loop over
Passing On To Pandas
Beautiful Soup transfers the conclusions to Pandas. Pandas use its read_htmlfunction to read the HTML table data into a data frame. The data frame is added to the previously defined empty list. Before the code block of the loop is terminated, Selenium needs to click the back button in the browser. This is so the next link in the loop will be available to click on the specified listing page.
When the for/in a loop has completed, Selenium will visit every specified title link. Beautiful Soup will recover the table from each page. Pandas will store the data from each table in a data frame. Each data frame is an item in the data list. The individual table data frames will merge into one extended data frame. The data will then be converted to JSON format.
#loop has completed
#end the Selenium browser session
#combine all pandas dataframes in the list into one gaint dataframe
result=pd.concat([pd.Dataframe(datalist(i]) for i in range(len(datalist))], ignore_index=true)
#convert the pandas dataframe to JSON
#get current working directory
#open, write, and close the file
A Quick Way
The automated web scraping method described above completes quickly. Selenium begins a browser window which users can see running. This enables developers to show users a screen grab of how fast the process is. A user sees how fast the script follows a link, fetches the data, goes back, and clicks the resulting link. It shortens the process of retrieving the data from hundreds of links exponentially.