Update beautifulsoup-03-scraping-a-web-page.ipynb

d1c993b4 · Francesco Bailo · f5ac9b4e · d1c993b4
Commit d1c993b4 authored 5 years ago by Francesco Bailo
--- a/web-scraping-with-python/beautifulsoup-03-scraping-a-web-page.ipynb
+++ b/web-scraping-with-python/beautifulsoup-03-scraping-a-web-page.ipynb
@@ -10,11 +10,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
   "outputs": [],
   "source": [
    "#import the library used to query a website\n",
    "from bs4 import BeautifulSoup\n",
+    "import urllib.request\n",
    "\n",
    "#specify the url\n",
    "wiki = \"https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India\"\n",
@@ -39,7 +42,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
   "outputs": [],
   "source": [
    "print(soup.prettify())"
@@ -59,7 +64,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
   "outputs": [],
   "source": [
    "soup.title"
@@ -93,7 +100,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
   "outputs": [],
   "source": [
    "soup.a"
@@ -133,7 +142,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
   "outputs": [],
   "source": [
    "right_table = soup.find('table', attrs={'class':'wikitable sortable plainrowheaders'})\n",
@@ -161,7 +172,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
   "outputs": [],
   "source": [
    "#Generate lists\n",

 %% Cell type:markdown id: tags:

 ## 1. Import necessary libraries

 %% Cell type:code id: tags:

 ``` python
 #import the library used to query a website
 from bs4 import BeautifulSoup
+import urllib.request

 #specify the url
 wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"

 #Query the website and return the html to the variable 'page'
 page = urllib.request.urlopen(wiki)

 #import the Beautiful soup functions to parse the data returned from the website
 from bs4 import BeautifulSoup

 #Parse the html in the 'page' variable, and store it in Beautiful Soup format
 soup = BeautifulSoup(page)
 ```

 %% Cell type:markdown id: tags:

 ## 2. Use function "prettify" to look at nested structure of HTML page

 %% Cell type:code id: tags:

 ``` python
 print(soup.prettify())
 ```

 %% Cell type:markdown id: tags:

 Here you will see the structure of the HTML tags. This will help you to know about different available tags and how can you play with these to extract information.

 ## 3.  Work with HTML tags

 a. `soup.<tag>`: Return content between opening and closing tag including tag.

 %% Cell type:code id: tags:

 ``` python
 soup.title
 ```

 %% Cell type:markdown id: tags:

 b. `soup.<tag>.string`: Return string within given tag

 %% Cell type:code id: tags:

 ``` python
 soup.title.string
 ```

 %% Cell type:markdown id: tags:

 c. Find all the links within page's `<a>` tags: We know that, we can tag a link using tag `<a>`. So, we should go with option `soup.a` and it should return the links available in the web page. Let's do it.

 %% Cell type:code id: tags:

 ``` python
 soup.a
 ```

 %% Cell type:markdown id: tags:

 Above, you can see that, we have only one output. Now to extract all the links within `<a\>`, we will use `find_all()`. Doing this will show all links including titles, links and other information.

 To show only links, we need to iterate over each a tag and then return the link using attribute "href" with get.

 ## 4. Find the right table

 As we are seeking a table to extract information about state capitals, we should identify the right table first. Let's write the command to extract information within all `table` tags.

 %% Cell type:code id: tags:

 ``` python
 all_tables = soup.find_all('table')
 ```

 %% Cell type:markdown id: tags:

 Now to identify the right table, we will use attribute "class" of table and use it to filter the right table. In chrome, you can check the class name by right click on the required table of web page, then "Inspect element", and "Copy" the class name OR go through the output of above command find the class name of right table.

 %% Cell type:code id: tags:

 ``` python
 right_table = soup.find('table', attrs={'class':'wikitable sortable plainrowheaders'})
 right_table
 ```

 %% Cell type:markdown id: tags:

 Above, we are able to identify the right table.

 %% Cell type:markdown id: tags:

 ## 5. Extract the information to DataFrame

 Here, we need to iterate through each row (`tr`) and then assign each element of `tr` (`td`) to a variable and append it to a list. Let's first look at the HTML structure of the table (I am not going to extract information for table heading `<th>`)

 Above, you can notice that second element of `<tr>` is within tag `<th>` not `<td>` so we need to take care for this. Now to access value of each element, we will use `find(text=True)` option with each element.  Let's look at the code:

 %% Cell type:code id: tags:

 ``` python
 #Generate lists
 A=[]
 B=[]
 C=[]
 D=[]
 E=[]
 F=[]
 G=[]
 for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    states=row.findAll('th') #To store second column data
    if len(cells)==6: #Only extract table body not heading
        A.append(cells[0].find(text=True))
        B.append(states[0].find(text=True))
        C.append(cells[1].find(text=True))
        D.append(cells[2].find(text=True))
        E.append(cells[3].find(text=True))
        F.append(cells[4].find(text=True))
        G.append(cells[5].find(text=True))

 #import pandas to convert list to data frame
 import pandas as pd
 df=pd.DataFrame(A,columns=['Number'])
 df['State/UT']=B
 df['Admin_Capital']=C
 df['Legislative_Capital']=D
 df['Judiciary_Capital']=E
 df['Year_Capital']=F
 df['Former_Capital']=G
 df
 ```