Programming Assignment #7: File Input/Ouput

Due: at the beginnning of recitation, Week of 10/30/2006
There is no recitation for the week of 10/23/2006 (Fall Break).

Overview

For this programming assignment, you have your choice between the following two options:
  1. Text File I/O: Mini Netflix Challenge, a movie ratings predictor (100 points)
  2. Image File I/O: The Goslingator, an image manipulator (100 points)

For general information on Text or Image File I/O in Java, click on the above links or visit the ENGR 131 webpage for more details.

Even though you have two weeks for this assignment, start early and pace yourself! (In particular, it would probably be a mistake to think you could read everything you need to and write the programs the night before your recitation; a reasonable plan would have you reading this document asap, making your choice of A or B and downloading all applicable files by Wed. 25 OCT, and finishing the minimum requirements for it by Sun. 29 OCT.) Another hint is that once you have satisfied the minimum requirements for your choice, you should save those working programs and results (in separate files/directories!) before trying something more sophisticated.

In either case you choose, your life will be greatly simplified if you build upon or modify the following example programs provided in the ENGR 131 Code Repository:

GLRatings.java, ImageFun.java

This type of development, where you write or modify a minimal/similar program and always proceed by making small changes to obtain other working programs, is known as iterated development or successive refinement. This is common practice among top programmers for building complex applications. In fact, it was used by Paul Buchheit — a retired-by-30 Case grad — to create Gmail and heavily advocated in his talk at Case last Thursday. "Task breakdowns" are provided for the programs below to help you learn this style of development. For more on this topic, you should read the document entitled Program Development Plan linked here and on the course webpage.

And above all, have fun!

Part A: Mini Netflix Challenge (100 points)

Background

Netflix recently announced the $1 million Netflix Prize, which aims to "substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences."

Netflix currently has its own prediction program, Cinematch. The prize is for a program that achieves performance better than Cinematch (with $50,000 awarded for a 1% improvement and the cool $1 million for a 10% improvement). Programming teams who sign up to compete for the Netflix prize will gain access to datafiles containing 100 million movie ratings, spanning 50,000 movie titles and 500,000 users (which we will call the "training dataset"). Predictions are tested by using another set of data (the "test dataset"). Improvements are measured with respect to root mean squared error (RMSE) of the program's predicted rating for a user-movie pair, pred(u,m) versus the actual rating, actual(u,m), in the test dataset. The RMSE can be computed from the sum of squared errors (SSE) as follows:

SSE = sumover all u,m pairs in test dataset ( actual(u,m) - pred(u,m) )2

RMSE = sqrt( (1/T) SSE ),
where T is the number of items the SSE has been summed over.

In this programming assignment, you will be using (part of) a smaller data set, consisting of only 100,000 entries [Source: GroupLens Research Project, http://www.grouplens.org/] The "README" file for this data describes it as follows:

u.data ---  The full u data set, 100000 ratings by 943 users on 1682 items.
            Each user has rated at least 20 movies.  Users and items are
            numbered consecutively from 1.  The data is randomly
            ordered. This is a tab separated list of 
	       user_id | item_id | rating | timestamp. 
            The time stamps are unix seconds since 1/1/1970 UTC 
So, "user_id" is an integer number (from 1 to 943), and the "item_id" is an integer number (from 1 to 1682), which you can think of as a stand-in for its title/year. The rating is an integer number between 1 and 5, inclusive.

For the purposes of this assignment, this data has been (randomly) split into three pieces:

The first two files, as well as some smaller analogous ones that might be easier for de-bugging purposes, are available for download in the hw7 folder of the ENGR Code Repository. Each of these files contains ASCII text data in the same format as that used in "u.data" described above. However, it is now possible that some movies will not have been rated and that you will find that some users have not reviewed 20 movies (or even none at all).

The minimum assignment requirements are as follows:

What you do after that is up to you! You can use a weighted combination of the averages, take into account the number of ratings a movie gets, use the timestamp (guess the rating will be the same as the previous rating from that user), etc. to create a better predictor. The best predictor program will receive a prize (details below). You only have to perform all your calculations using a Java program written by you.

Example Task Breakdown:

Don't try to write the whole program all at once or all by yourself. Instead, start with the working program (GLRatings.java) and then modify it successively until you've met all the requirements. For example, you might follow a task breakdown that looks roughly like this:

Note: the above is an example task breakdown, and there are other ways to breakdown or approach the problem. For example, things can be done differently if you store various data (like predicted and actual ratings) in arrays. In that case, different while or for loops can be used for each piece.

Turn In

Contest:

Each TA will pick 1 or 2 of the best programs (judged by RMSE on the test10000.txt dataset and your description of the algorithm) in their section to be forwarded to Prof. Branicky for checking against the "secret" test data. However, only those which improve on the best of the two baseline algorithms by at least 1% will be considered for forwarding. The student with the best performing program (lowest RMSE error on the secret test dataset) will win their choice of the following DVDs that we have played clips from during the semester:
The Matrix (if-then); Lord of the Rings: Return of the King (while); The Matrix Reloaded (for); i, Robot (2-D Arrays); or Ice Age (Files 1)
Other well-performing programs might be imortalized on a class Leaderboard.

However, any programs which access or otherwise use the full 100,000 record data set will be disqualified.

Part B: The Goslingator (100 points)

Background

The class has been enlisted by Comedy Central to help produce a television segment in honor of Java. As part of the tribute, they wish to depict Java's creator, James Gosling, in a variety of funny/corny ways (that's their job).

They have collected a variety of Gosling images, as well as a variety of images to "paste" him into that might prove amusing. (All images are available in the hw7 folder of the ENGR 131 Code Repository.) They have split the work into 7 sections (one per recitation), as follows:

The minimum assignment requirements are as follows:

What you do after that is up to you! You can use color matching, shapes other than rectangles, etc. The best images in each category will receive fame and fortune (details in the "Contest" section below). You only have to perform all your manipulations using a Java program written by you.

Examine the writeup on Image File I/O on the course webpage and the Java program ImageFun.java in the Code Repository to provide the basis for your program. Besides reading and writing the image files, you will need to use nested for loops and the following two methods to perform the (minimum) assignment

Example Task Breakdown:

Again, don't try to write the whole program all at once or all by yourself. Instead, start with the working program (ImageFun.java) and then modify it successively until you've met all the requirements. For example, you might follow a task breakdown that looks something like this: Note: this is an example task breakdown and there are other ways to breakdown or approach the problem. For example, one different way to approach the problem is to store the cut rectangle from source1 in an array that is then pasted into source2 using a separate nested for loop. Array storage may be particularly useful if you want to try some advanced color matching/replacement (e.g., replace every bluish background pixel in the source1 rectangle with a sample background color from source2.)

Turn In

Contest:

Each TA will pick the best Goslingated image for their section. Electronic copies of the winning images will be displayed on the course website, along with the name of their creators.

Survey

Please reply to the following survey questions with regard to this assignment:
  1. Rate the difficulty for you on a scale of 1 to 10 (1 = Very easy for me, ..., 10 = Very difficult for me).
  2. Rate your understanding of the requirements on a scale of 1 to 10 (1 = Did not understand anything, ..., 10 = Understood everything).
  3. How long did it take you (total hours spent) to complete the minimum portion of the assignment?
  4. How long did you spend (hours) working on extensions?
  5. Comment on anything you particularly enjoyed or had problems with.

Created: 2006-10-14. Last Modified: 2006-10-16. © Michael S. Branicky