Forem Creators and Builders 🌱

Cover image for Scraping Data: Harry Potter Wiki in Python, Beautiful Soup, and Requests
Mariah Dominique Rucker
Mariah Dominique Rucker

Posted on • Updated on

Scraping Data: Harry Potter Wiki in Python, Beautiful Soup, and Requests

This tutorial teaches you how to get information on Harry Potter from Wikipedia using Python and two libraries – Beautiful Soup and Requests.

The term web scraping refers to a process whereby a computer program obtains data from other websites. Retrieving data from web pages is done using bots or crawlers which fetch the information into a structured format like a spreadsheet or database, just like a robot retrieves goods from a warehouse to another. It is as if you are waving a magic wand and pulling out whatever information you want from the Harry Potter Wiki. At the end of this tutorial you would have waved a wand of a website into meaningful information.

Web Scraping Diagram

Scraping data from the Harry Potter wiki:

Getting data from the Harry Potter wiki is as simple as getting all ingredients that are needed for a magic potion. Using the Python library “beautifulsoup4” helps you grab the three crucial magical ingredients of this story that is - the character, spells, and house.

Hermione & Potions

  1. First import the requests and Beautiful Soup libraries.
  2. Specify the URLs of the various pages that describe the house, spell, or character pages we are about to crawl.
  3. Retrieve the HTML content from the GET requests made to the URLs.
  4. Use BeautifulSoup to extract text from HTML.
  5. Locate our target tables that will be scraped.
  6. Retrieving rows from the respective tables.
  7. Pull out the raw data from the columns and save it into arrays.
  8. Pick the character name, character description, spell name, spell description, house name, and house description in every given row.
  9. Add onto the relevant character, spell, and house arrays with the acquired data from each row.

Web Scraping

import requests
from bs4 import BeautifulSoup

# Define the URLs for the different pages we want to scrape
character_url = "https://harrypotter.fandom.com/wiki/List_of_characters"
spell_url = "https://harrypotter.fandom.com/wiki/List_of_spells"
house_url = "https://harrypotter.fandom.com/wiki/Hogwarts_Houses"

# Make a GET request to the URLs and get the HTML content
character_content = requests.get(character_url).content
spell_content = requests.get(spell_url).content
house_content = requests.get(house_url).content

# Parse the HTML content using BeautifulSoup
character_soup = BeautifulSoup(character_content, "html.parser")
spell_soup = BeautifulSoup(spell_content, "html.parser")
house_soup = BeautifulSoup(house_content, "html.parser")

# Find the tables containing the data we want to scrape
character_table = character_soup.find_all("table", class_="wikitable")[0]
spell_table = spell_soup.find_all("table", class_="wikitable")[0]
house_table = house_soup.find_all("table", class_="wikitable")[0]

# Get the rows from the tables
character_rows = character_table.find_all("tr")
spell_rows = spell_table.find_all("tr")
house_rows = house_table.find_all("tr")

# Extract the data from the rows
characters = []
for row in character_rows[1:]:
    columns = row.find_all("td")
    if len(columns) > 1:
        character_name = columns[0].text.strip()
        character_description = columns[1].text.strip()
        characters.append((character_name, character_description))

spells = []
for row in spell_rows[1:]:
    columns = row.find_all("td")
    if len(columns) > 1:
        spell_name = columns[0].text.strip()
        spell_description = columns[1].text.strip()
        spells.append((spell_name, spell_description))

houses = []
for row in house_rows[1:]:
    columns = row.find_all("td")
    if len(columns) > 1:
        house_name = columns[0].text.strip()
        house_description = columns[1].text.strip()
        houses.append((house_name, house_description))
Enter fullscreen mode Exit fullscreen mode

Storing the data in a SQLite database:

Using a sqlite3 library to create or run queries on a SQLite database would be the equivalent of spell casting with a magic wand in a wizard’s world, akin to the mythical Harry Potter universe. The library is the wand and the actions taken with the wand for storing, retrieving, changing the value of or deleting a data point in the database are called spells. Put simply, the sqlite3 library enables users to carry out different procedures with the information that gets stored.

Hermione & The Library

  1. Include import sqlite3 as a module in Python.
  2. Connect to the database.
  3. Draft the list of characters, spells, and houses.
  4. Populate data into the tables.
  5. Then commit the changes, and disconnect.
import sqlite3

# Create a connection to the database
conn = sqlite3.connect("harrypotter.db")

# Create the tables for characters, spells, and houses
conn.execute("CREATE TABLE IF NOT EXISTS characters (name TEXT, description TEXT)")
conn.execute("CREATE TABLE IF NOT EXISTS spells (name TEXT, description TEXT)")
conn.execute("CREATE TABLE IF NOT EXISTS houses (name TEXT, description TEXT)")

# Insert the data into the tables
conn.executemany("INSERT INTO characters (name, description) VALUES (?, ?)", characters)
conn.executemany("INSERT INTO spells (name, description) VALUES (?, ?)", spells)
conn.executemany("INSERT INTO houses (name, description) VALUES (?, ?)", houses)

# Commit the changes and close the connection
conn.commit()
conn.close()
Enter fullscreen mode Exit fullscreen mode

Querying the database:

Then, you can analyze the acquired data from the database.

Hermione Analyzing

Just like Dumbledore, collects all of his memories and thoughts, once done with the data retrieval on the database, you can now make sense of it, and discover very valuable information.

  1. Import the sqlite3 library.
  2. Establish a link to a database christened harrypotter.db.
  3. Get all characters sorted by name, call them “characters” and store in a variable.
  4. Print the name and description of each character in “characters”.
  5. In a stored procedure, retrieve the database and sort the spells by their names in the variable “spells”.
  6. Output all names and descriptions for every spell under “spells”.
  7. Store all the houses that are sorted by name in a variable called houses. Retrieve all the houses from the database using query.
  8. Print out the name and description for each “house” in “houses”.
  9. Close the link to the database.
import sqlite3

# Create a connection to the database
conn = sqlite3.connect("harrypotter.db")

# Query the database to get all the characters sorted by name
characters = conn.execute("SELECT name, description FROM characters ORDER BY name").fetchall()
for character in characters:
    print(f"{character[0]}: {character[1]}")

# Query the database to get all the spells sorted by name
spells = conn.execute("SELECT name, description FROM spells ORDER BY name").fetchall()
for spell in spells:
    print(f"{spell[0]}: {spell[1]}")

# Query the database to get all the houses sorted by name
houses = conn.execute("SELECT name, description FROM houses ORDER BY name").fetchall()
for house in houses:
    print(f"{house[0]}: {house[1]}")

# Query the database to get all the characters sorted by the length of their description
characters = conn.execute("SELECT name, description FROM characters ORDER BY length(description)").fetchall()
for character in characters:
    print(f"{character[0]}: {character[1]}")

# Query the database to get the most common first names among the characters
first_names = conn.execute("SELECT SUBSTR(name, 1, INSTR(name, ' ')) AS first_name, COUNT(*) AS count FROM characters GROUP BY first_name ORDER BY count DESC").fetchall()
for first_name in first_names:
    print(f"{first_name[0]}: {first_name[1]}")

# Close the connection to the database
conn.close()
Enter fullscreen mode Exit fullscreen mode

It’s as if you have just entered into some kind of a magical world whenever you query a database. Same way as Harry Potter used magic to cast spell, you will only need to run particular queries so as to discover hidden facts about the database. Just as Harry combined several spells into more powerful ones, in BigQL you may use join, filtering, and even aggregation functions to make your spells more powerful.

Combining characters and houses table will give you all the characters and the respective houses they occupy such as, among many others.

# Query the database to get all the characters and their corresponding house
characters = conn.execute("SELECT c.name, c.description, h.name FROM characters c JOIN houses h ON c.house_id = h.id ORDER BY c.name").fetchall()
for character in characters:
    print(f"{character[0]}: {character[1]} ({character[2]})")
Enter fullscreen mode Exit fullscreen mode

In the character’s table, you can apply a filter to only display the profiles of characters who belong to certain houses.

# Query the database to get all the characters from Gryffindor
characters = conn.execute("SELECT name, description FROM characters WHERE house_id = 1 ORDER BY name").fetchall()
for character in characters:
    print(f"{character[0]}: {character[1]}")
Enter fullscreen mode Exit fullscreen mode

Additionally, you can apply aggregation functions such as COUNT () for finding how many characters are there for every house based on the frequency that characters show up in a scene.

# Query the database to count the number of characters in each house
houses = conn.execute("SELECT h.name, COUNT(*) FROM characters c JOIN houses h ON c.house_id = h.id GROUP BY h.name").fetchall()
for house in houses:
    print(f"{house[0]}: {house[1]}")
Enter fullscreen mode Exit fullscreen mode

This tutorial showed you how to utilise Pyhton to obtain data from the Harry Potter Wiki and use the BeautifulSoup Library to parse the HTML and locate the useful information for future exploitation. You also figured out how to employ a for loop to navigate through individual pages that contained their house, patronus, as well as wand details.

GitHub: github.com/mariahrucker
LinkedIn: linkedin.com/in/mariahrucker
Instagram: instagram.com/techmariah
Other: linktr.ee/mariahrucker

Top comments (0)