Programming Assignment #7: File Input/Ouput
Due: at the beginnning of recitation, Week of 10/30/2006
There is no recitation for the week of 10/23/2006 (Fall Break).
Overview
For this programming assignment, you have your choice
between the following two options:
- Text File I/O:
Mini Netflix Challenge, a movie ratings predictor (100 points)
- Image File I/O:
The Goslingator, an image manipulator (100 points)
For general information on Text or Image File I/O in Java,
click on the above links or visit the
ENGR 131 webpage for more details.
Even though you have two weeks for this assignment,
start early and pace yourself!
(In particular, it would probably be a mistake to
think you could read everything you need to and write the programs
the night before your recitation; a reasonable plan
would have you reading this document asap, making your choice of A
or B and downloading all applicable
files by Wed. 25 OCT, and finishing the minimum requirements for it
by Sun. 29 OCT.)
Another hint is that once you have satisfied the minimum requirements
for your choice, you should save those
working programs and results (in separate files/directories!) before
trying something more sophisticated.
In either case you choose, your life will be greatly simplified if
you build upon or modify the following example programs provided
in the ENGR 131 Code Repository:
GLRatings.java,
ImageFun.java
This type of development,
where you write or modify a minimal/similar program and always proceed by
making small changes to obtain other working programs, is
known as iterated development or successive
refinement.
This is common practice among top programmers for building
complex applications. In fact, it was used
by
Paul Buchheit
— a retired-by-30 Case grad —
to create Gmail
and heavily advocated in his talk at Case last Thursday.
"Task breakdowns" are provided for the programs
below to help you learn this style of
development. For more on this topic, you should read the document entitled
Program Development Plan
linked here and on the course webpage.
And above all, have fun!
Part A: Mini Netflix Challenge (100 points)
Background
Netflix recently
announced the $1 million
Netflix Prize,
which aims to "substantially improve the accuracy of predictions about how much
someone is going to love a movie based on their movie preferences."
Netflix currently has its own prediction program, Cinematch. The prize is
for a program that achieves performance better than Cinematch
(with $50,000 awarded for a 1% improvement and the cool $1 million for a 10% improvement).
Programming teams who sign up to compete for the Netflix prize will gain access to
datafiles containing 100 million movie ratings, spanning 50,000 movie titles
and 500,000 users (which we will call the "training dataset").
Predictions are tested by using another set of data
(the "test dataset").
Improvements are measured with respect to root mean squared error (RMSE)
of the program's predicted rating for a user-movie pair,
pred(u,m) versus the
actual rating,
actual(u,m),
in the test dataset. The RMSE can be computed from
the sum of squared errors (SSE) as follows:
SSE = sumover all u,m pairs in test dataset
( actual(u,m) - pred(u,m) )2
RMSE = sqrt( (1/T) SSE ),
where T is the number of items the SSE has been summed over.
In this programming assignment, you will be using (part of) a smaller data set,
consisting of only 100,000 entries
[Source: GroupLens Research Project,
http://www.grouplens.org/]
The "README" file for this data describes it as follows:
u.data --- The full u data set, 100000 ratings by 943 users on 1682 items.
Each user has rated at least 20 movies. Users and items are
numbered consecutively from 1. The data is randomly
ordered. This is a tab separated list of
user_id | item_id | rating | timestamp.
The time stamps are unix seconds since 1/1/1970 UTC
So, "user_id" is an integer number (from 1 to 943), and
the "item_id" is an integer number (from 1 to 1682), which you can think of as
a stand-in for its title/year. The rating is an integer number between 1
and 5, inclusive.
For the purposes of this assignment, this data has been (randomly)
split into three pieces:
- 80,000 entries to be used as a training dataset:
train80000.txt
- 10,000 entries to be used as a test dataset for you to test
your predictions on:
test10000.txt
- 10,000 entries to be used as a "secret" test dataset
on high-performing programs after all assignments have been turned in
(see details in the "Contest" section below).
The first two files, as well as some smaller analogous ones that might be easier
for de-bugging purposes, are available for download in the
hw7 folder of the ENGR
Code Repository.
Each of these files contains ASCII text data in the
same format as that used in "u.data" described above.
However, it is now possible that some movies will not have
been rated and that you will find that some users have
not reviewed 20 movies (or even none at all).
The minimum assignment requirements are as follows:
- Read in train80000.txt, computing the average rating for
each user and the average rating for each movie (both double's).
You'll have to do something special here to deal with computing the average
when the user or item "N"'s are zero.
- Write your average user scores and average movie scores to
separate text files, using a single line per id and the following
format:
id | average
- Read in test10000.txt, predicting two different ratings
using only the user_id and item_id:
- the average rating of that user as your first prediction, and
- the average rating for that movie as your second prediction.
Note: you will have to use an if statement and do something
special in the case where no such was available (because the "N" in
one or both of the averages was zero.)
- Compute a running SSE calculation for each of the rating
prediction methods.
- At the end of processing compute
and report the RMSE (not the SSE) to the screen.
What you do after that is up to you! You can use a weighted combination
of the averages, take into account the number of ratings a movie gets,
use the timestamp (guess the rating will be the same as the previous
rating from that user), etc. to create a better predictor.
The best predictor program
will receive a prize (details below).
You only have to perform all your calculations using a Java
program written by you.
Example Task Breakdown:
Don't try to write the whole program all at once or all by yourself.
Instead, start with the working program (GLRatings.java) and then
modify it successively until you've met all the requirements.
For example, you might follow a task breakdown that looks roughly like this:
- Copy the relevant .txt files into a directory/folder on your
machine. Also, copy over the file
u.data (available in the Code
Repository).
- Copy the program
GLRatings.java
into the same directory.
Also, you must copy over the program
TextFileIO.java
into the same directory/project, because we use
its methods to do text file input/output.
Now, compile and run GLRatings.java.
Check out its output files (using some text editing/printing program).
-
Modify the program to read train80000.txt instead and add appropriate
lines to deal with possible divisions by zero in the movie average.
Compile and run. Test/debug. Save.
- Modify the
program to compute averages (instead of just N) for each
user as well as for each movie.
Again, deal with division by zero.
You can also change the name of the user output
file to be more appropriate.
Again, keep
going until you have a working program.
- Next: Open the test data file using the openRead().
You'll have to come up with a name for the new Scanner here.
Implement another while loop on the new Scanner,
that reads in the data. Add a line to close the test data input file.
- [Hopefully, you get the idea. Let's use bigger steps
now....]
- Add the code to make one of the predictions.
- Add the SSE calculation to the while loop.
- Add its RMSE calculation after the while loop.
- Repeat the above three steps to add the other prediction method.
Note: the above is an example task breakdown, and there are other ways to
breakdown or approach the problem. For example, things can be
done differently if you store various data (like predicted and actual
ratings) in arrays. In that case, different while or for
loops can be used for each piece.
Turn In
- Your commented program (75 points).
- A written description of how your program works, including high-level
pseudo-code for the entire program and a detailed description
of your prediction algorithm (15 points).
- Your commented output (5 points).
Specifically, produce the RMSE output and comment on both it and
the average rating data.
Do not print out the average
user/movie ratings files.
- Also, complete and turn in the survey below for this part (5 points).
- Note: Hardcopies and electronic versions of your code, writeups, and results
should be turned in following the procedures set by your TA.
Again, do not produce hardcopy of the average user/movie ratings files.
Do produce it for the RMSE data and turn in as requested.
Contest:
Each TA will pick 1 or 2 of the best programs (judged
by RMSE on the test10000.txt dataset and your description of the algorithm)
in their section to be forwarded to Prof.
Branicky for checking against the "secret" test data.
However, only those which improve on the best of the two baseline algorithms by
at least 1% will be considered for forwarding.
The student
with the best performing program (lowest RMSE error on the secret test dataset)
will win their choice of the
following DVDs that we have played clips from during the semester:
The Matrix (if-then);
Lord of the Rings: Return of the King (while);
The Matrix Reloaded (for);
i, Robot (2-D Arrays);
or Ice Age (Files 1)
Other well-performing programs might be imortalized on a class
Leaderboard.
However, any programs which access or otherwise use the full 100,000 record data set
will be disqualified.
Part B: The Goslingator (100 points)
Background
The class has been enlisted by Comedy Central to help produce
a television segment in honor of Java. As part of the tribute,
they wish to
depict Java's creator, James
Gosling, in a variety of
funny/corny ways (that's their job).
They have collected a variety of Gosling images, as well as a variety
of images to "paste" him into that might prove amusing.
(All images are available in the
hw7 folder of the ENGR
131 Code Repository.)
They have split the work into 7 sections (one per recitation), as
follows:
-
T 10:00am (Alex): Make Gosling look like a U.S. President (use one of the pres?.jpg images)
-
T 1:15pm (James): Make Gosling look like an astronaut (use one of the astro?.jpg images)
-
T 2:45pm (David A.): Make Gosling look like an American icon (use one of the icon?.jpg images)
-
T 4:30pm (Kyle): Make Gosling look like a wizard (use one of the wizard?.jpg images)
-
W 9:30pm (Roman P.): Make Gosling look like a high-powered businessman
(use one of the trump?.jpg images)
-
W 11:30pm (Dave R.): Make Gosling look like a starship captain (use one of
the kirk?.jpg images)
-
W 11:30pm (BarbaraJoy): Make Gosling look like a hip-hop star (use one of
the diddy?.jpg images)
The minimum assignment requirements are as follows:
- Pick two images: one of Gosling (which we'll refer to as "source1") and one from your choices
outlined above (henceforth, "source2").
- Have your program read in both source images and print out
the starting indices and sizes (width and height) of each
image.
- Have your program grab an appropriate rectangle from source1 and place it at an
appropriate position in source2 to fabricate the appropriate funny
image.
- Write the resulting image ("destination") to a .jpg file. As a file name,
use your Case network ID, followed by the source2 image you used, e.g.,
msb11astro3.jpg
What you do after that is up to you! You can use color matching,
shapes other than rectangles, etc.
The best images in each category
will receive fame and fortune (details in the "Contest" section below).
You only have to perform all your manipulations using a Java
program written by you.
Examine
the writeup on
Image File I/O
on the course webpage
and the Java program
ImageFun.java
in the Code Repository to provide the basis for your program.
Besides reading and writing the image files,
you will need to use nested for loops
and the following two methods to perform the (minimum) assignment
- setRGB(i,j,color) can be used to set the color of the
(i,j) pixel of a image to the value specified in the int color.
Here,
- i is counted from its minimum value (MinX, usually 0) up to (MinX+Width) along
the horizontal dimension of the image from left to right.
- j is counted from its minimum value (MinY, usually 0) up to (MinX+Height) along
the vertical dimension of the image from top to bottom;
-
The integer value 0 represents the color black.
Other colors can be achieved using the techniques described in the handout on
Image File I/O, Example 2.
- getRGB(i,j) returns the color value of the
(i,j) pixel of a image.
Example Task Breakdown:
Again, don't try to write the whole program all at once or all by yourself.
Instead, start with the working program (ImageFun.java) and then
modify it successively until you've met all the requirements.
For example, you might follow a task breakdown that looks something like this:
- Copy the relevant jpg files into a directory/folder on your
machine.
- Copy the Java program
ImageFun.java
and the image it uses onto your machine
from the ENGR 131 Code Repository onto your machine.
- Compile and run ImageFun.java. View its output image file by opening
it using a browser or other application (note: you should be able to
simply just double-click on its icon).
- Modify the code so that it reads in one of your chosen images,
and outputs its starting indices and size. (You can comment out
the for loop and file writing at this point.)
Make sure you change the filename, and--if you wish--the File and BufferedImage
object names and the corresponding method calls throughout the program
appropriately.
- Modify the code so that it reads in both of your chosen images,
and outputs their starting indices and size.
You'll now have to use new File and BufferedImage
object names and corresponding method calls throughout appropriately.
- Guess or use a ruler (knowing the width/height in pixels and using the
well-known concept of a "ratio") to estimate the starting and ending indices
of the rectangle you wish to cut from source1 and place in source2.
- Modify the code so that it now also colors your source1 rectangle
black and outputs that file. Make sure the rectangle
you choose is "in bounds". Examine the output image. Tweak the
start/end numbers, compile, run, examine. Repeat as necessary.
- Modify the code so that the same for loops now also color your source2
rectangle black, but instead outputs that file.
Hint: offset the (i,j) indices in the method call to setRGB
for the source2 image differently now! E.g., setRGB(i+90, j-8, BLACK)
Examine the output image. Tweak the
offsets, compile, run, examine. Repeat as necessary.
- Modify the code so that it places grabbed pixels from the
source1 (using getRGB) into the source2 rectangle (using
setRGB).
Examine the output image. Tweak.
Note: this is an example task breakdown and there are other ways to
breakdown or approach the problem. For example, one different way
to approach the problem is to store the cut rectangle from source1
in an array that is then pasted into source2 using a separate
nested for loop. Array storage may be particularly useful
if you want to try some advanced color matching/replacement (e.g.,
replace every bluish background pixel in the source1 rectangle
with a sample background color from source2.)
Turn In
- Your commented program (75 points).
- A written description of how your program works, including high-level
pseudo-code for the entire program and a detailed description
of your how you performed the image manipulation (15 points).
- Your final product image (5 points). If hardcopy is requested by your
TA, attempt to print your result on a color printer.
- Also, complete and turn in the survey below for this part (5 points).
- Note: Hardcopies and electronic versions of your code, writeups, and results
should be turned in following the procedures set by your TA.
Contest:
Each TA will pick the best Goslingated image for their section.
Electronic copies of the winning images will be displayed
on the course website, along with the name of their creators.
Survey
Please reply to the following survey questions with regard to this assignment:
- Rate the difficulty for you on a scale of 1 to 10 (1 = Very easy for me,
..., 10 = Very difficult for me).
- Rate your understanding of the requirements on a scale of 1 to 10 (1 = Did
not understand anything, ..., 10 = Understood everything).
- How long did it take you (total hours spent) to complete the minimum
portion of the assignment?
- How long did you spend (hours) working on extensions?
- Comment on anything you particularly enjoyed or had problems with.
Created: 2006-10-14.
Last Modified: 2006-10-16. © Michael S. Branicky