Homework 5 (SQL) -- due Nov. 6
26 Oct 2011This homework will use the following SQLite databases (same as we used in class):
Several tables of UN data, from UNdata.
Each table contains data for all (or almost all) countries for a range of years. The tables can be linked through the country and year fields. All fields are in the obvious units, except for
precip
, which is in millions of cubic meters.100k movie ratings, from MovieLens. The database link will be removed after the course is over, but this script will generate the sqlite database from the original MovieLens data set.
The database contains tables
movies
,ratings
, andusers
. Each user may have ranked many movies. The ratings table is linked to the others through themovie_id
anduser_id
fields.
See the week 10 agenda for information on what software we used to query the databases.
Using the UN database, answer the following questions in a text file. Include the SQL statements you used to find the answer. The questions don't state this explicitly, but assume they always mean "given the available data" (since some countries have data missing in some years).
- What was the smallest total population in the data set? Which country?
- What were the top 3 countries in terms of precipitation in 2000?
- Which country starting with "A" had the largest population in 1995?
- Which countries in the year 2000 had a total population of over 1 million and
a ratio of males to females of over 1.2? Note: to compute the ratio, you'll have
to use the expression
1.0 * male / female
, because otherwise the fractional part of the ratio is dropped due to integer division.
Use the movie database to answer the next set of questions:
- What are all the movie titles that were classified as both a thriller and a musical?
- What movies are listed with titles that start with the word "Dead"?
- How many males and females provided ratings?
- What occupation was the most common among the users?
What were the top 3 movies in terms of average rating? Your query should look like (you have to fill in the
???
with the correct field names):select ???, avg(???) as avg_rating from movies natural join ratings group by ??? order by avg_rating desc limit 3;
What were the top 3 movies in terms of average rating that had at least 10 ratings? Your query should look like (you have to fill in the
???
with the correct field names):select ???, avg(???) as avg_rating, count(*) as num_ratings from movies natural join ratings group by ??? having num_ratings >= 10 order by avg_rating desc limit 3;