Homework 5 (SQL) -- solutions

20 Nov 2011

This homework will use the following SQLite databases (same as we used in class):

  • Several tables of UN data, from UNdata.

    Each table contains data for all (or almost all) countries for a range of years. The tables can be linked through the country and year fields. All fields are in the obvious units, except for precip, which is in millions of cubic meters.

  • 100k movie ratings, from MovieLens. The database link will be removed after the course is over, but this script will generate the sqlite database from the original MovieLens data set.

    The database contains tables movies, ratings, and users. Each user may have ranked many movies. The ratings table is linked to the others through the movie_id and user_id fields.

See the week 10 agenda for information on what software we used to query the databases.

Using the UN database, answer the following questions in a text file. Include the SQL statements you used to find the answer. The questions don't state this explicitly, but assume they always mean "given the available data" (since some countries have data missing in some years).

  1. What was the smallest total population in the data set? Which country?

    select min(male + female) from population;
    

    min(male + female)
    44

    select male+female as total, country from population 
    order by total asc limit 1;
    

    total country
    44 Pitcairn Islands

  2. What were the top 3 countries in terms of precipitation in 2000?

    select country, precip from precip 
    where year = 2000 
    order by precip desc limit 3;
    

    country precip
    China 6009200
    Colombia 2529872
    Venezuela 1360000

  3. Which country starting with "A" had the largest population in 1995?

    select country, male+female as total from population
    where country like 'A%' and year = 1995
    order by total desc limit 1;
    

    country total
    Argentina 34834902

  4. Which countries in the year 2000 had a total population of over 1 million and a ratio of males to females of over 1.2? Note: to compute the ratio, you'll have to use the expression 1.0 * male / female, because otherwise the fractional part of the ratio is dropped due to integer division.

    select country, male + female as total, 1.0 * male / female as ratio
    from population 
    where total > 1e6 and ratio > 1.2 and year = 2000;
    

    country total ratio
    Saudi Arabia 20806589 1.24039237952202
    United Arab Emirates 3247219 2.0717658766571
    Oman 2402184 1.37724668230265
    Kuwait 2228362 1.55105809354268

Use the movie database to answer the next set of questions:

  1. What are all the movie titles that were classified as both a thriller and a musical?

    select title from movies where genre_thriller = 1 and genre_musical = 1;
    

    title
    Muppet Treasure Island (1996)

  2. What movies are listed with titles that start with the word "Dead"?

    select title from movies where title like 'Dead%';
    

    title
    Dead Man Walking (1995)
    Dead Poets Society (1989)
    Dead Man (1995)
    Dead Presidents (1995)

  3. How many males and females provided ratings?

    select gender, count(*) from users group by gender;
    

    gender count(*)
    F 273
    M 670

  4. What occupation was the most common among the users?

    select occupation, count(*) as count from users
    group by occupation order by count desc;
    

    occupation count
    student 196
    other 105
    educator 95
    administrator 79
    engineer 67
    programmer 66
    librarian 51
    writer 45
    executive 32
    scientist 31
    artist 28
    technician 27
    marketing 26
    entertainment 18
    healthcare 16
    retired 14
    lawyer 12
    salesman 12
    none 9
    doctor 7
    homemaker 7

  5. What were the top 3 movies in terms of average rating? Your query should look like (you have to fill in the ??? with the correct field names):

    select ???, avg(???) as avg_rating from movies natural join ratings 
        group by ??? order by avg_rating desc limit 3;
    
    select title, avg(rating) as avg_rating from movies natural join ratings 
        group by title order by avg_rating desc limit 3;
    

    avg_rating
    Aiqing wansui (1994) 5.0
    Entertaining Angels: The Dorothy Day Story (1996) 5.0
    Great Day in Harlem, A (1994) 5.0

    (Your answers might differ, because there are more than 3 movies that had a perfect 5.0 average.)

  6. What were the top 3 movies in terms of average rating that had at least 10 ratings? Your query should look like (you have to fill in the ??? with the correct field names):

    select ???, avg(???) as avg_rating, count(*) as num_ratings 
        from movies natural join ratings 
        group by ??? having num_ratings >= 10 order by avg_rating desc limit 3;
    
    select title, avg(rating) as avg_rating, count(*) as num_ratings 
        from movies natural join ratings 
        group by title having num_ratings >= 10 order by avg_rating desc limit 3;
    


    title avg_rating num_ratings
    Close Shave, A (1995) 4.49107142857143 112
    Schindler's List (1993) 4.46644295302013 298
    Wrong Trousers, The (1993) 4.46610169491525 118