Update beautifulsoup-02-writing-your-scrape-script.ipynb

e5957bc9 · Francesco Bailo · 8bc6a5dc · e5957bc9
Commit e5957bc9 authored 3 years ago by Francesco Bailo
--- a/web-scraping-with-python/beautifulsoup-02-writing-your-scrape-script.ipynb
+++ b/web-scraping-with-python/beautifulsoup-02-writing-your-scrape-script.ipynb
@@ -90,7 +90,7 @@
    "\n",
    "Remember the unique layers of our data? `BeautifulSoup()` can help us get into these layers and extract the content out easily by using `find()`. In this case, since the HTML tag containing the title of the movie (`h1`) is very unique on this page, we can simple query `<h1>`:\n",
    "\n",
-    "![](https://cloudstor.aarnet.edu.au/plus/s/kwhmcCROYRayye5/download)"
+    "![](https://cloudstor.aarnet.edu.au/plus/s/Z3EHXJmRHvgXqtq/download)"
   ]
  },
  {
@@ -129,7 +129,7 @@
   "metadata": {},
   "source": [
    "Similarly, we can get the movie score too. Still, in this case we want to be more precise, instead of just searching for the the tag `span`, we also want to specifiy that the attribute `itemprop` has value `ratingValue`. \n",
-    "![](https://cloudstor.aarnet.edu.au/plus/s/1tGYxebthzgTjHd/download)"
+    "![](https://cloudstor.aarnet.edu.au/plus/s/yPB2LzRZnODbuwl/download)"
   ]
  },
  {
@@ -154,7 +154,7 @@
    "\n",
    "How would you get the duration of the movie?\n",
    "\n",
-    "![](https://cloudstor.aarnet.edu.au/plus/s/WaqoM0wrJtlMHQ2/download)"
+    "![](https://cloudstor.aarnet.edu.au/plus/s/yPB2LzRZnODbuwl/download)"
   ]
  },
  {

 %% Cell type:markdown id: tags:

 Now that we know where our data is, we can start coding our web scraper. You can follow this tutorial either in Jupyter or by executing the code on your own computer.

 First, we need to import all the libraries that we are going to use.

 %% Cell type:code id: tags:

 ``` python
 from bs4 import BeautifulSoup
 import urllib.request
 ```

 %% Cell type:markdown id: tags:

 If you get an error, make sure you have:
 1. Correctly installed `BeautifulSoup`: see [here](https://dsm-uts.github.io/data-code/web-scraping-with-python/getting-started)
 2. Restarted the Kernel after completing the installation with "Kernel" -> "Restart".

 Next, declare a variable for the url of the page of Apocalypse Now on the *Internet Movie Database* (IMDb).

 %% Cell type:code id: tags:

 ``` python
 # specify the url
 imdb_page = 'https://www.imdb.com/title/tt0078788/'
 ```

 %% Cell type:markdown id: tags:

 Then, make use of the **Python urllib** to get the HTML page of the url declared.

 %% Cell type:code id: tags:

 ``` python
 # query the website and return the html to the variable 'page'
 page = urllib.request.urlopen(imdb_page)
 ```

 %% Cell type:markdown id: tags:

 Finally, parse the page into BeautifulSoup format so we can use `BeautifulSoup()` to work on it

 %% Cell type:code id: tags:

 ``` python
 # parse the html using beautiful soap and store in variable `soup`
 soup = BeautifulSoup(page, 'html.parser')
 ```

 %% Cell type:markdown id: tags:

 Now we have a variable `soup` containing the HTML of the page. Here's where we can start coding the part that extracts the data.

 Remember the unique layers of our data? `BeautifulSoup()` can help us get into these layers and extract the content out easily by using `find()`. In this case, since the HTML tag containing the title of the movie (`h1`) is very unique on this page, we can simple query `<h1>`:

-![](https://cloudstor.aarnet.edu.au/plus/s/kwhmcCROYRayye5/download)
+![](https://cloudstor.aarnet.edu.au/plus/s/Z3EHXJmRHvgXqtq/download)

 %% Cell type:code id: tags:

 ``` python
 # Take out the <div> of name and get its value
 movie_title = soup.find('h1')
 ```

 %% Cell type:markdown id: tags:

 After we have the tag, we can get the data by getting its `text`.

 %% Cell type:code id: tags:

 ``` python
 movie_title = movie_title.text.strip() # strip() is used to remove starting and trailing
 print(movie_title)
 ```

 %% Cell type:markdown id: tags:

 Similarly, we can get the movie score too. Still, in this case we want to be more precise, instead of just searching for the the tag `span`, we also want to specifiy that the attribute `itemprop` has value `ratingValue`.
-![](https://cloudstor.aarnet.edu.au/plus/s/1tGYxebthzgTjHd/download)
+![](https://cloudstor.aarnet.edu.au/plus/s/yPB2LzRZnODbuwl/download)

 %% Cell type:code id: tags:

 ``` python
 # get the score
 score_box = soup.find('span', attrs={'itemprop':'ratingValue'})
 score = score_box.text
 print(score)
 ```

 %% Cell type:markdown id: tags:

 ## Exercise

 How would you get the duration of the movie?

-![](https://cloudstor.aarnet.edu.au/plus/s/WaqoM0wrJtlMHQ2/download)
+![](https://cloudstor.aarnet.edu.au/plus/s/yPB2LzRZnODbuwl/download)

 %% Cell type:code id: tags:

 ``` python
 duration_box = soup.find('_____')
 duration = duration_box._____.strip()
 print(score)
 ```