## Homework 5 (SQL) -- solutions

20 Nov 2011

This homework will use the following SQLite databases (same as we used in class):

• Each table contains data for all (or almost all) countries for a range of years. The tables can be linked through the country and year fields. All fields are in the obvious units, except for precip, which is in millions of cubic meters.

• 100k movie ratings, from MovieLens. The database link will be removed after the course is over, but this script will generate the sqlite database from the original MovieLens data set.

The database contains tables movies, ratings, and users. Each user may have ranked many movies. The ratings table is linked to the others through the movie_id and user_id fields.

See the week 10 agenda for information on what software we used to query the databases.

Using the UN database, answer the following questions in a text file. Include the SQL statements you used to find the answer. The questions don't state this explicitly, but assume they always mean "given the available data" (since some countries have data missing in some years).

1. What was the smallest total population in the data set? Which country?

select min(male + female) from population;


min(male + female)
44

select male+female as total, country from population
order by total asc limit 1;


total country
44 Pitcairn Islands

2. What were the top 3 countries in terms of precipitation in 2000?

select country, precip from precip
where year = 2000
order by precip desc limit 3;


country precip
China 6009200
Colombia 2529872
Venezuela 1360000

3. Which country starting with "A" had the largest population in 1995?

select country, male+female as total from population
where country like 'A%' and year = 1995
order by total desc limit 1;


country total
Argentina 34834902

4. Which countries in the year 2000 had a total population of over 1 million and a ratio of males to females of over 1.2? Note: to compute the ratio, you'll have to use the expression 1.0 * male / female, because otherwise the fractional part of the ratio is dropped due to integer division.

select country, male + female as total, 1.0 * male / female as ratio
from population
where total > 1e6 and ratio > 1.2 and year = 2000;


country total ratio
Saudi Arabia 20806589 1.24039237952202
United Arab Emirates 3247219 2.0717658766571
Oman 2402184 1.37724668230265
Kuwait 2228362 1.55105809354268

Use the movie database to answer the next set of questions:

1. What are all the movie titles that were classified as both a thriller and a musical?

select title from movies where genre_thriller = 1 and genre_musical = 1;


title
Muppet Treasure Island (1996)

select title from movies where title like 'Dead%';


title

3. How many males and females provided ratings?

select gender, count(*) from users group by gender;


gender count(*)
F 273
M 670

4. What occupation was the most common among the users?

select occupation, count(*) as count from users
group by occupation order by count desc;


occupation count
student 196
other 105
educator 95
engineer 67
programmer 66
librarian 51
writer 45
executive 32
scientist 31
artist 28
technician 27
marketing 26
entertainment 18
healthcare 16
retired 14
lawyer 12
salesman 12
none 9
doctor 7
homemaker 7

5. What were the top 3 movies in terms of average rating? Your query should look like (you have to fill in the ??? with the correct field names):

select ???, avg(???) as avg_rating from movies natural join ratings
group by ??? order by avg_rating desc limit 3;

select title, avg(rating) as avg_rating from movies natural join ratings
group by title order by avg_rating desc limit 3;


avg_rating
Aiqing wansui (1994) 5.0
Entertaining Angels: The Dorothy Day Story (1996) 5.0
Great Day in Harlem, A (1994) 5.0

(Your answers might differ, because there are more than 3 movies that had a perfect 5.0 average.)

6. What were the top 3 movies in terms of average rating that had at least 10 ratings? Your query should look like (you have to fill in the ??? with the correct field names):

select ???, avg(???) as avg_rating, count(*) as num_ratings
from movies natural join ratings
group by ??? having num_ratings >= 10 order by avg_rating desc limit 3;

select title, avg(rating) as avg_rating, count(*) as num_ratings
from movies natural join ratings
group by title having num_ratings >= 10 order by avg_rating desc limit 3;


title avg_rating num_ratings
Close Shave, A (1995) 4.49107142857143 112
Schindler's List (1993) 4.46644295302013 298
Wrong Trousers, The (1993) 4.46610169491525 118