TopCoder problem "ResultsPredictor" used in Predictive 1 (Division I Level One)

Problem Statement

In this problem we will be predicting the outcomes of TopCoder Component competitions. For each test case, your code will receive a dataset representing the history of all component competitions that have ended before a certain date. Based on this, your code must predict the outcome of all competitions following this date. For each competition you will be given various information relating to that event, and the coders who participated in it. For each component project, you will be given all of the following pieces of data:

component_id -- a unique identifier for each component
component_version_id -- a unique identifier for the version of this component
project_id -- each component typically has two projects: design and development
catalog -- each component goes into a catolog, such as Java
component -- the name of the component
version -- the version number
project_category -- design or development
project_status -- describes the current state of the project. This will be non-null in the examples only.
posting_time -- time the component competition began
end_time -- end time for the contest
scorecard_id -- a unique identifier for the scorecard used to review the project
num_final_fixes -- the number of rounds of fixes required to fix the final project
prize -- the amount paid to the winner
is_rated -- self-explanatory
is_dr -- is this component part of the digital run
dr_points -- the number of points the winner gets in the digital run competition

In addition, you will be given a brief textual description of the problem, along with a list of keywords, and a list of technologies involved in the project.

In addition to information about the competition itself, you will be given some statistics about all the competitors who register for the competition. For each competitor, you will be given the following, at the time of the competition:

coder_id -- a unique id for each member
rating -- their TopCoder rating in the respective competition category
reliability -- their TopCoder reliability rating
auto_screening_result -- whether the project passed, passed with warnings, or failed in automatic screening
screening_score -- the initial score of the competitor prior to review
passed_screening -- whether or not the submission was passed into review
score_before_appeals -- the score before the competitor submitted appeals
score_after_appeals -- the score after appeals were processed
passed_review -- whether or not the submission passed review. This is what you want to predict for the prediction cases.
num_appeals -- self-explanatory
successful_appeals -- self-explanatory

Your task is to learn from past data so that you can predict coders' performance. Thus your program will first be given a dataset with the results of many past contests. You will then be asked to make predictions about a number of contests for which the results will not be given.

Implementation details

Your train method will be given a String[], each element of which represents one competition. Within each element, the data will be formatted with competition data on the first four lines, and competitor data on the remaining lines, one competitor per line. The first line will contain the following, in order, separated by commas: component_id, component_version_id, project_id, catalog, component, version, project_category, project_status, posting_time, end_time, scorecard_id, num_final_fixes, prize, is_rated, is_dr, dr_points. The next line will contain a textual description of the component. The third line will contain a list of related keywords while the fourth line will contain a list of technologies used. Each of the remaining lines will contain information about a coder's submission in the following order: coder_id, rating, reliability, auto_screening_results, screening_score, passed_screening, score_before_appeals, score_after_appeals, passed_review, num_appeals, successful_appeals. For the tests that you are supposed to make predictions about, the data will be formatted the same way, except that when you are given registgrants data, you will only be given the first three fields pertaining to each coder: coder_id, rating and reliability. Below is an example:

7339708,7339713,10003777,Java,Data Paging Tag,1.0,Development,Cancelled - Failed Review,2004-06-01 09:00:00.0,2004-06-30 00:00:00.0,4,1,400,Yes,Off,null
The Data Paging Tag Component is a JSP Tag that accepts a collection of data for display within a view and facilitates splitting the information into pages.
JSP,collection,pagination,paging,tag
Java,JSP,Custom Tag
310233,1118,1.0,null,0.0,null,69.29,69.29,null,0,0
278460,0,0.0,null,0.0,null,0.0,0.0,null,0,0
9981727,0,0.0,null,0.0,null,0.0,0.0,null,0,0
7400447,0,0.0,null,0.0,null,0.0,0.0,null,0,0
7436876,0,1.0,null,0.0,null,0.0,0.0,null,0,0
266149,1227,1.0,null,0.0,null,59.43,59.43,null,0,0
283991,0,0.0,null,0.0,null,0.0,0.0,null,0,0

Your task is to implement three methods. The first, train will allow you to train your prediction model, given data as formatted above. The second, testWithoutCoders will ask you to predict the number of passing submissions, given the component data, without the coder data. The third, testWithCoders, will ask you to predict the number of passing submissions given the component data and list of registered coders. You will always be asked to make your prediction without the coder data first. Your score for a test case will be the sum of your squared errors. That is, if you predict 2.7 and the correct number of passing submissions is 3, your score will be 0.09 from that prediction. Summing over all your predictions (both with and without coder data) gives your overall score for the test.

Evaluation

There will be only one test case, which will use all competitions through 2007 as training data, and all competitions in 2008 as test data. After the contest is over, new data will be gathered from contests that have not yet started, and your submission will be run on those. The leaderboard will give scores inversely proportional to their errors. Your score will be 1000 / YOUR_ERROR.

Data

For offline training purposes, the training set is available at http://www.topcoder.com/contest/problem/ResultsPredictor/train.txt. The test data (without results) is available at http://www.topcoder.com/contest/problem/ResultsPredictor/test_w.txt and http://www.topcoder.com/contest/problem/ResultsPredictor/test_wo.txt, for the tests with and without the registered coder data.

Definition

Class:	ResultsPredictor
Method:	train
Parameters:	String[]
Returns:	int
Method signature:	int train(String[] s)

Method:	testWithCoders
Parameters:	String
Returns:	double
Method signature:	double testWithCoders(String s)

Method:	testWithoutCoders
Parameters:	String
Returns:	double
Method signature:	double testWithoutCoders(String s)
(be sure your methods are public)

Notes

The time limit is 9 minutes.

The memory limit is 1024M.

Provisional results should be taken with a grain of salt. The real test data will be from future contests.

The tests will be randomly divided into examples and provisional tests, where a test is an example with probability 0.5 and a provisional test otherwise.

Reaonsable requests for additional, related data will be entertained, and may be posted to the forums or emailed to lbackstrom@topcoder.com.

Examples

"train"

Returns: ""

Problem url:

http://www.topcoder.com/stat?c=problem_statement&pm=9763

Problem stats url:

http://www.topcoder.com/tc?module=ProblemDetail&rd=13499&pm=9763

Writer:

Unknown

Testers:

Problem categories:

Simulation