Homework 5 (SQL) -- solutions
20 Nov 2011This homework will use the following SQLite databases (same as we used in class):
Several tables of UN data, from UNdata.
Each table contains data for all (or almost all) countries for a range of years. The tables can be linked through the country and year fields. All fields are in the obvious units, except for
precip
, which is in millions of cubic meters.100k movie ratings, from MovieLens. The database link will be removed after the course is over, but this script will generate the sqlite database from the original MovieLens data set.
The database contains tables
movies
,ratings
, andusers
. Each user may have ranked many movies. The ratings table is linked to the others through themovie_id
anduser_id
fields.
See the week 10 agenda for information on what software we used to query the databases.
Using the UN database, answer the following questions in a text file. Include the SQL statements you used to find the answer. The questions don't state this explicitly, but assume they always mean "given the available data" (since some countries have data missing in some years).
What was the smallest total population in the data set? Which country?
select min(male + female) from population;
min(male + female) 44 select male+female as total, country from population order by total asc limit 1;
total country 44 Pitcairn Islands What were the top 3 countries in terms of precipitation in 2000?
select country, precip from precip where year = 2000 order by precip desc limit 3;
country precip China 6009200 Colombia 2529872 Venezuela 1360000 Which country starting with "A" had the largest population in 1995?
select country, male+female as total from population where country like 'A%' and year = 1995 order by total desc limit 1;
country total Argentina 34834902 Which countries in the year 2000 had a total population of over 1 million and a ratio of males to females of over 1.2? Note: to compute the ratio, you'll have to use the expression
1.0 * male / female
, because otherwise the fractional part of the ratio is dropped due to integer division.select country, male + female as total, 1.0 * male / female as ratio from population where total > 1e6 and ratio > 1.2 and year = 2000;
country total ratio Saudi Arabia 20806589 1.24039237952202 United Arab Emirates 3247219 2.0717658766571 Oman 2402184 1.37724668230265 Kuwait 2228362 1.55105809354268
Use the movie database to answer the next set of questions:
What are all the movie titles that were classified as both a thriller and a musical?
select title from movies where genre_thriller = 1 and genre_musical = 1;
title Muppet Treasure Island (1996) What movies are listed with titles that start with the word "Dead"?
select title from movies where title like 'Dead%';
title Dead Man Walking (1995) Dead Poets Society (1989) Dead Man (1995) Dead Presidents (1995) How many males and females provided ratings?
select gender, count(*) from users group by gender;
gender count(*) F 273 M 670 What occupation was the most common among the users?
select occupation, count(*) as count from users group by occupation order by count desc;
occupation count student 196 other 105 educator 95 administrator 79 engineer 67 programmer 66 librarian 51 writer 45 executive 32 scientist 31 artist 28 technician 27 marketing 26 entertainment 18 healthcare 16 retired 14 lawyer 12 salesman 12 none 9 doctor 7 homemaker 7 What were the top 3 movies in terms of average rating? Your query should look like (you have to fill in the
???
with the correct field names):select ???, avg(???) as avg_rating from movies natural join ratings group by ??? order by avg_rating desc limit 3; select title, avg(rating) as avg_rating from movies natural join ratings group by title order by avg_rating desc limit 3;
avg_rating Aiqing wansui (1994) 5.0 Entertaining Angels: The Dorothy Day Story (1996) 5.0 Great Day in Harlem, A (1994) 5.0 (Your answers might differ, because there are more than 3 movies that had a perfect 5.0 average.)
What were the top 3 movies in terms of average rating that had at least 10 ratings? Your query should look like (you have to fill in the
???
with the correct field names):select ???, avg(???) as avg_rating, count(*) as num_ratings from movies natural join ratings group by ??? having num_ratings >= 10 order by avg_rating desc limit 3; select title, avg(rating) as avg_rating, count(*) as num_ratings from movies natural join ratings group by title having num_ratings >= 10 order by avg_rating desc limit 3;
title avg_rating num_ratings Close Shave, A (1995) 4.49107142857143 112 Schindler's List (1993) 4.46644295302013 298 Wrong Trousers, The (1993) 4.46610169491525 118