Viren Bhagat

My Notes: SQL (CS50)

5/12/2020

More Python, and SQL (Structured Query Language) today

A spreadsheet is a basic form of a database (Excel, Google Sheets, etc.)

We are creating a form, asking for favorite show and genre.

In a Google form, the results can be displayed in a spreadsheet. It gives you a three columns (time stamp, show, genre).

Within Google Sheets, we can analyze the data. We can download the sheet as a .csv file.

Google Sheet of Form Results

We can open the CSV file in our IDE. Then we can try to create a program in Python to help analyze our data.

Let's try to find the most popular show. We can tally each occurrence of tv shows.

We can create a dictionary, with key-value pairs to store that information.

We will loop through, add a title to the dictionary if it is the first time. If it is repeating, we can add to the count of that tv show.

import csv

counts = {}

with open("CS50.csv", "r") as file: # "r" is read mode
    reader = csv.DictReader(file)
    
    for row in reader:
    
        title = row["title"]
        
        if title in counts:
            counts[title] += 1 # if its in the dict, add 1
        else:
            counts[title] = 1       

def f(item):
    return item[1]          

for title, count in sorted(counts.items(), key=f): 
    print(title, count, sep=" | ")      

We define a function f(item) in order to sort the results by number of counts (highest to lowest).

We can also create lambda function and avoid defining f(item) at all. It is like creating an anonymous function.

for title, count in sorted(counts.items(), key=lambda item: item[1])
    print(title, count, sep=" | ")

One other thing which was affecting our data set was capitalizations and/or abbreviations. Some users may have wrote 'Game Of Thrones', others 'Game of Thrones', and so on. These were all represented separately in the dictionary.

In our for loop above, we can force the title's to lower case and our results improve.

title = row["title"].lower()

SQL Time

In terminal, run sqlite3

It is a command line program which lets you use SQL interactively. It will start the program.

sqlite3 favorites.db
sqlite> .mode csv
sqlite> .import "CS50.csv" favorites

The last line in the terminal, imported the csv data into a table within the favorites database. You can now run SQL commands in the terminal and retrieve the favorite tv data.

It is helpful in data retrieval. We do not have to create a Python program. Here we can just write a quick query to the table in the database.

Relational Databases

CSV files hold data but it is not organized at all. We have to parse and sort through it to find what we want.

Relational databases store data and provide faster access. There are GUIs for databases, we don't just have to work in the command line.

.schema

A schema is how the database table is set up, what data and data types it contains.

C.R.U.D.

  • Create
  • Read
  • Update
  • Delete

The main (CRUD) commands of SQL are

  • INSERT
  • SELECT
  • UPDATE
  • DELETE

The basic syntax is something like this: CREATE TABLE table (column, type, ...);

There are only a handful dozen of selected words/functions

SQL MAIN DATA TYPES

  • BLOB (binary large object)
  • INTEGER

    • smallint
    • integer
    • bigint
  • NUMERIC

    • boolean
    • date
    • datatime
    • numeric(scale,precision)
    • time
    • timestamp
  • REAL

    • real
    • double precision
  • TEXT

    • char(n)
    • varchar(n)
    • text

To discuss a problem we had in Python, capitalization or lack thereof, in SQL, we can use a keyword, LIKE

SELECT title FROM favorites WHERE title LIKE "%office%";

With the % %, anything can come before or after so it will return "The Office, "The office", "Office", "the office".

And if we are just looking for the number of Office favorites,

SELECT COUNT(title) FROM favorites WHERE title LIKE "%office%";

The general syntax for SELECT is: SELECT columns FROM table WHERE condition;

The general syntax for UPDATE is: UPDATE table SET column=value WHERE condition;

The general syntax for DELETE is: DLETE FROM table WHERE condition;

⚠️ This can be very dangerous as it will delete the table and data.

Here is how to clean up all the "Office" results with our wildcard search, using UPDATE:

UPDATE favorites SET title = "The Office" WHERE title LIKE "%office%";

All "The Office" entries are now uniform.


Working with IMDB Data (TSV)

Tab separated values, not comma separated values (.csv).

TSV Data format

Here is a Python script on reading through the .tsv file and writing a .csv file:

import csv

with open("title.basics.tsv", "r") as titles:
    reader = csv.DictReader(titles, delimiter="\t") 
    
    with open ("shows0.csv", "w") as shows:
    
        writer = csv.writer(shows)
        writer.writerow(["tconst", "primaryTitle", "startYear", "genres"])
        
        for row in reader:
        
            if row["startYear"] != "\\N":
            
                year = int(row["startYear"])
                if year >= 1970:
            
                    if row["titleType"] == "tvSeries" and row ["isAdult"] == "0":
            
                        writer.writerow([row["tconst"], row["primaryTitle"], row["startYear"], row["genres"]])

Working with Python, SQL, & IMDB Data

We are now going to write a Python script to take IMDB data from .tsv file and save it to an SQL file.

import cs50
import csv

# Create empty file
open("shows3.db", "w").close()

# Open file for SQLite
db = cs50.SQL("sqlite:///shows3.db")

# Create table in our database - SQL commands
db.execute("CREATE TABLE shows (tconst TEXT, primaryTitle TEXT, startYear NUMERIC, genres TEXT)")

# Open tsv file
with open("title.basics.tsv", "r") as titles:

    # Create DictReader
    reader = csv.DictReader(titles, delimiter="\t")

    for row in reader:
    
        if row["titleType"] == "tvSeries" and row["isAdult"] == "0":
            
            if row["startYear"] != "\\N":
                
                startYear = int(row["startYear"])
                
                if startYear >= 1970:
                
                    tconst = row["tconst"]
                    primaryTitle = row["primaryTitle"]
                    genres = row["genres"]
                    
                    db.execute("INSERT INTO shows (tconst, primaryTitle, startYear, genres) VALUES(?, ?, ?, ?), 
                    tconst, primaryTitle, startYear, genres"

So, for every non adult tv show made after 1970, we will insert it into our database.

We can run many different queries on this table now.

A New Concept...Primary Keys

import cs50
import csv


open("shows4.db", "w").close()
db = cs50.SQL("sqlite:///shows4.db")

# Create tables
db.execute("CREATE TABLE shows (id INT, title TEXT, year NUMERIC, PRIMARY KEY(id))")
db.execute("CREATE TABLE genres (show_id INT, genre TEXT, FOREIGN KEY(show_id) REFERENES shows(id))")

# Open tsv file
with open("title.basics.tsv", "r") as titles:

    # Create DictReader
    reader = csv.DictReader(titles, delimiter="\t")

    for row in reader:
    
        if row["titleType"] == "tvSeries" and row["isAdult"] == "0":
            
            if row["startYear"] != "\\N":
                
                startYear = int(row["startYear"])
                
                if startYear >= 1970:
                
                    tconst = row["tconst"]
                    primaryTitle = row["primaryTitle"]
                    genres = row["genres"]
                    
                    db.execute("INSERT INTO shows (tconst, primaryTitle, startYear, genres) VALUES(?, ?, ?, ?), 
                    tconst, primaryTitle, startYear, genres"
db.execute("CREATE TABLE shows (id INT, title TEXT, year NUMERIC, PRIMARY KEY(id))")
db.execute("CREATE TABLE genres (show_id INT, genre TEXT, FOREIGN KEY(show_id) REFERENES shows(id))")

Here we are creating two tables, which are related to each other. The genres are stored in a separate table. We can take the show id and search the genres table to find out the show's genre.

Here is a SQL query nested example:

SELECT * FROM shows WHERE id IN (SELECT show_id FROM genres WHERE genre = "Comedy") AND year = 2019;

One to many Relationship

A one-to-many relationship (A show can have 0-many genres).

CS50 team has a made a shows database with six tables (genres, people, shows, stars, writers, ratings) shown below. It is been opened in a GUI instead of the command line.

GUI of tables in our database

You can join tables as well based on the common ID. SELECT * FROM shows JOIN genres ON shows.id = genres.show_id;

All of Ellen's tv shows. We can start by selecting Ellen from the people table:

SELECT * FROM people WHERE name = "Ellen DeGeneres";

It returns her name and id (1122).

With knowing her id, we can select other information from other tables.

Static version: SELECT * FROM stars WHERE person_id = 1122;

Dynamic version: SELECT * FROM stars WHERE person_id = (SELECT id FROM people WHERE name = "Ellen DeGeneres");

This returns a list of show_id where Ellen starred in.

Alternatively, we can kind of join all that into one query to get the shows instead just the show id's. Will save us a step.

SELECT * FROM shows WHERE id IN (SELECT show_id FROM stars WHERE person_id = (SELECT id FROM people WHERE name = "Ellen DeGeneres"));

The above returns the actual show titles (along with the year and episodes) now.

Same query for Ellen using JOINs:

SELECT title FROM
people JOIN stars ON people.id = stars.person_id JOIN
shows ON stars.show_id = shows.id
WHERE name = "Ellen DeGeneres";

Searching & Race Conditions

  • PRIMARY KEY: Unique identifier
  • FOREIGN KEY: Same identifier in another table
  • UNIQUE
  • INDEX

Indexing tables...searches more efficiently.

Race conditions -

A race condition is an undesirable situation that occurs when a device or system attempts to perform two or more operations at the same time, but because of the nature of the device or system, the operations must be done in the proper sequence to be done correctly.

Source

Basically want to try to avoid having two things done at the same time. A human example is if you check the fridge for milk, there is none, so you will go buy some. If your roommate checks for milk ten seconds after you, notices the same, he/she will go buy milk as well. You don't want this happening.

Likes on Social Media

rows = db.execute("SELECT likes FROM posts WHERE id=?", id);
likes = rows[0]["likes"]
db.execute("UPDATE posts SET likes = ?", likes + 1);

Resources, Sources, & Useful Links

Lecture (on YouTube)

sqlite

CS50 IDE

CS50 Course on edX

cs50, sql