Skip to content
Snippets Groups Projects
Commit d1c993b4 authored by Francesco Bailo's avatar Francesco Bailo
Browse files

Update beautifulsoup-03-scraping-a-web-page.ipynb

parent f5ac9b4e
No related merge requests found
%% Cell type:markdown id: tags:
## 1. Import necessary libraries
%% Cell type:code id: tags:
``` python
#import the library used to query a website
from bs4 import BeautifulSoup
import urllib.request
#specify the url
wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"
#Query the website and return the html to the variable 'page'
page = urllib.request.urlopen(wiki)
#import the Beautiful soup functions to parse the data returned from the website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page)
```
%% Cell type:markdown id: tags:
## 2. Use function "prettify" to look at nested structure of HTML page
%% Cell type:code id: tags:
``` python
print(soup.prettify())
```
%% Cell type:markdown id: tags:
Here you will see the structure of the HTML tags. This will help you to know about different available tags and how can you play with these to extract information.
## 3. Work with HTML tags
a. `soup.<tag>`: Return content between opening and closing tag including tag.
%% Cell type:code id: tags:
``` python
soup.title
```
%% Cell type:markdown id: tags:
b. `soup.<tag>.string`: Return string within given tag
%% Cell type:code id: tags:
``` python
soup.title.string
```
%% Cell type:markdown id: tags:
c. Find all the links within page's `<a>` tags: We know that, we can tag a link using tag `<a>`. So, we should go with option `soup.a` and it should return the links available in the web page. Let's do it.
%% Cell type:code id: tags:
``` python
soup.a
```
%% Cell type:markdown id: tags:
Above, you can see that, we have only one output. Now to extract all the links within `<a\>`, we will use `find_all()`. Doing this will show all links including titles, links and other information.
To show only links, we need to iterate over each a tag and then return the link using attribute "href" with get.
## 4. Find the right table
As we are seeking a table to extract information about state capitals, we should identify the right table first. Let's write the command to extract information within all `table` tags.
%% Cell type:code id: tags:
``` python
all_tables = soup.find_all('table')
```
%% Cell type:markdown id: tags:
Now to identify the right table, we will use attribute "class" of table and use it to filter the right table. In chrome, you can check the class name by right click on the required table of web page, then "Inspect element", and "Copy" the class name OR go through the output of above command find the class name of right table.
%% Cell type:code id: tags:
``` python
right_table = soup.find('table', attrs={'class':'wikitable sortable plainrowheaders'})
right_table
```
%% Cell type:markdown id: tags:
Above, we are able to identify the right table.
%% Cell type:markdown id: tags:
## 5. Extract the information to DataFrame
Here, we need to iterate through each row (`tr`) and then assign each element of `tr` (`td`) to a variable and append it to a list. Let's first look at the HTML structure of the table (I am not going to extract information for table heading `<th>`)
Above, you can notice that second element of `<tr>` is within tag `<th>` not `<td>` so we need to take care for this. Now to access value of each element, we will use `find(text=True)` option with each element. Let's look at the code:
%% Cell type:code id: tags:
``` python
#Generate lists
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
G=[]
for row in right_table.findAll("tr"):
cells = row.findAll('td')
states=row.findAll('th') #To store second column data
if len(cells)==6: #Only extract table body not heading
A.append(cells[0].find(text=True))
B.append(states[0].find(text=True))
C.append(cells[1].find(text=True))
D.append(cells[2].find(text=True))
E.append(cells[3].find(text=True))
F.append(cells[4].find(text=True))
G.append(cells[5].find(text=True))
#import pandas to convert list to data frame
import pandas as pd
df=pd.DataFrame(A,columns=['Number'])
df['State/UT']=B
df['Admin_Capital']=C
df['Legislative_Capital']=D
df['Judiciary_Capital']=E
df['Year_Capital']=F
df['Former_Capital']=G
df
```
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment