GitLab now enforces expiry dates on tokens that originally had no set expiration date. Those tokens were given an expiration date of one year later. Please review your personal access tokens, project access tokens, and group access tokens to ensure you are aware of upcoming expirations. Administrators of GitLab can find more information on how to identify and mitigate interruption in our documentation.
"Remember the unique layers of our data? `BeautifulSoup()` can help us get into these layers and extract the content out easily by using `find()`. In this case, since the HTML tag containing the title of the movie (`h1`) is very unique on this page, we can simple query `<h1>`:\n",
"Similarly, we can get the movie score too. Still, in this case we want to be more precise, instead of just searching for the the tag `span`, we also want to specifiy that the attribute `itemprop` has value `ratingValue`. \n",
Now that we know where our data is, we can start coding our web scraper. You can follow this tutorial either in Jupyter or by executing the code on your own computer.
First, we need to import all the libraries that we are going to use.
%% Cell type:code id: tags:
``` python
frombs4importBeautifulSoup
importurllib.request
```
%% Cell type:markdown id: tags:
If you get an error, make sure you have:
1. Correctly installed `BeautifulSoup`: see [here](https://dsm-uts.github.io/data-code/web-scraping-with-python/getting-started)
2. Restarted the Kernel after completing the installation with "Kernel" -> "Restart".
Next, declare a variable for the url of the page of Apocalypse Now on the *Internet Movie Database* (IMDb).
%% Cell type:code id: tags:
``` python
# specify the url
imdb_page='https://www.imdb.com/title/tt0078788/'
```
%% Cell type:markdown id: tags:
Then, make use of the **Python urllib** to get the HTML page of the url declared.
%% Cell type:code id: tags:
``` python
# query the website and return the html to the variable 'page'
page=urllib.request.urlopen(imdb_page)
```
%% Cell type:markdown id: tags:
Finally, parse the page into BeautifulSoup format so we can use `BeautifulSoup()` to work on it
%% Cell type:code id: tags:
``` python
# parse the html using beautiful soap and store in variable `soup`
soup=BeautifulSoup(page,'html.parser')
```
%% Cell type:markdown id: tags:
Now we have a variable `soup` containing the HTML of the page. Here's where we can start coding the part that extracts the data.
Remember the unique layers of our data? `BeautifulSoup()` can help us get into these layers and extract the content out easily by using `find()`. In this case, since the HTML tag containing the title of the movie (`h1`) is very unique on this page, we can simple query `<h1>`:
After we have the tag, we can get the data by getting its `text`.
%% Cell type:code id: tags:
``` python
movie_title=movie_title.text.strip()# strip() is used to remove starting and trailing
print(movie_title)
```
%% Cell type:markdown id: tags:
Similarly, we can get the movie score too. Still, in this case we want to be more precise, instead of just searching for the the tag `span`, we also want to specifiy that the attribute `itemprop` has value `ratingValue`.