GitLab now enforces expiry dates on tokens that originally had no set expiration date. Those tokens were given an expiration date of one year later. Please review your personal access tokens, project access tokens, and group access tokens to ensure you are aware of upcoming expirations. Administrators of GitLab can find more information on how to identify and mitigate interruption in our documentation.
#Query the website and return the html to the variable 'page'
page=urllib.request.urlopen(wiki)
#import the Beautiful soup functions to parse the data returned from the website
frombs4importBeautifulSoup
#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup=BeautifulSoup(page)
```
%% Cell type:markdown id: tags:
## 2. Use function "prettify" to look at nested structure of HTML page
%% Cell type:code id: tags:
``` python
print(soup.prettify())
```
%% Cell type:markdown id: tags:
Here you will see the structure of the HTML tags. This will help you to know about different available tags and how can you play with these to extract information.
## 3. Work with HTML tags
a. `soup.<tag>`: Return content between opening and closing tag including tag.
%% Cell type:code id: tags:
``` python
soup.title
```
%% Cell type:markdown id: tags:
b. `soup.<tag>.string`: Return string within given tag
%% Cell type:code id: tags:
``` python
soup.title.string
```
%% Cell type:markdown id: tags:
c. Find all the links within page's `<a>` tags: We know that, we can tag a link using tag `<a>`. So, we should go with option `soup.a` and it should return the links available in the web page. Let's do it.
%% Cell type:code id: tags:
``` python
soup.a
```
%% Cell type:markdown id: tags:
Above, you can see that, we have only one output. Now to extract all the links within `<a\>`, we will use `find_all()`. Doing this will show all links including titles, links and other information.
To show only links, we need to iterate over each a tag and then return the link using attribute "href" with get.
## 4. Find the right table
As we are seeking a table to extract information about state capitals, we should identify the right table first. Let's write the command to extract information within all `table` tags.
%% Cell type:code id: tags:
``` python
all_tables=soup.find_all('table')
```
%% Cell type:markdown id: tags:
Now to identify the right table, we will use attribute "class" of table and use it to filter the right table. In chrome, you can check the class name by right click on the required table of web page, then "Inspect element", and "Copy" the class name OR go through the output of above command find the class name of right table.
Here, we need to iterate through each row (`tr`) and then assign each element of `tr` (`td`) to a variable and append it to a list. Let's first look at the HTML structure of the table (I am not going to extract information for table heading `<th>`)
Above, you can notice that second element of `<tr>` is within tag `<th>` not `<td>` so we need to take care for this. Now to access value of each element, we will use `find(text=True)` option with each element. Let's look at the code:
%% Cell type:code id: tags:
``` python
#Generate lists
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
G=[]
forrowinright_table.findAll("tr"):
cells=row.findAll('td')
states=row.findAll('th')#To store second column data
iflen(cells)==6:#Only extract table body not heading