TopCoder problem "FundingPrediction" used in FundingPrediction (Division I Level One)

Problem Statement

In the loan industry, a lead aggregator is a company which takes loan applications and sends them to various funding institutions (like banks). The process is as follows:

An interested consumer comes to a website and fills out a loan application.
The loan aggregator matches the application with up to five eligible lenders (there are occasionally more than 5), according to criteria provided by the lenders.
Some (or possibly none) of the matched lenders outline details of a loan to the customer.
Finally, the consumer picks from one of the loan opportunities, or elects not to take any of the loans offered.

It is possible that none of the lenders will offer terms to the customer, or that the customer will turn down all of the loans offered. In fact, the lead aggregator has no way of knowing the details of this process, but is only notified if a lead ends up being funded. Thus from the lead aggregator's point of view, the process is to send the lead on to lenders, and then simply wait. If one of them ends up funding the lead, the aggregator will be notified.

The lead aggregator's income comes primarily from funded loans -- ones where the consumer elects to take a loan at the end of the day. Thus, an important problem for a lead aggregator is to try to match leads to the lenders who will end up funding them. To do this, we need to know how likely each lender is to fund a particular lead. This is your task.

You will be given training data from a large number of past leads. Each record in the training data will correspond to one (lead,lender) pair. It will include information derived from the loan application, information about the lender, and the final outcome (funded or not). Using this data, your task is to construct a model which predicts whether or not a lead will end up being funded by a particular lender.

Your model will initially be trained on historical data from 2005 through the end of 2007. The data will represent the information available at the end of 2007, and thus leads which were submitted in 2007 but funded in 2008 will be marked as unfunded. Your model will then be asked to give predictions for leads which arrived in the month of January, 2008. After you have made these predictions, you will receive additional training data from January, 2008. For any leads where were submitted before or during January, 2008 and were funded in January, 2008, you will be given the exact funding date. Additionally, you will be given training data for the new leads that arrived in January, 2008. Once you have performed any additional training on your model based on this new information, you will be asked to make predictions about February, 2008. After your make predictions you will receive another update, and so on. The last month you will make predictions about is September, 2008.

A detailed description of the data can be found at http://www.topcoder.com/contest/problem/FundingPrediction/data.html.

Your model will be evaluated based on the predictions it makes for the January-September, 2008 leads. During the provisional testing phase of the contest, a smaller subset of the leads will be used as training and testing data. During the final testing phase, a larger, disjoint subset will be used.

Your error for a single record will by the difference between your prediction and the true outcome (1 for funded, 0 for not funded). Your score for that record will by 1-error². Thus, if you predict 0.8 for a record that is funded, your score for that record will be 1-0.2² = 0.96. On the other hand, a prediction of 0.8 for a record that is not funded will result in a score of 1-0.8² = 0.36. Your overall score for a set of records will simply be the sum of your scores on all the individual records.

Implementation Details

Your program must implement 3 methods: init, train and test.

The init method will take data through 2007, one month at a time. You will be given a String[] records, each element of which will represent one record. The records will be given in comma-delimited format, in the order they are given on the data description page. You will also be given a String[] funded, each element of which will be formatted as "RECORD_ID,DATE", indicating that a particular record was funded on a specific date. This method will be called a total of 36 times, once for each month from 2005 through 2007.

The test method will take a String[] records with the same format as the init method. It should return a double[], each element of which is your prediction for the corresponding record.

The train method will take a String[] funded, each element of which will represent one funded record, in the same format as above. Note that each time this method is called, all the RECORD_ID's will correspond to records you have already seen either in init, or in test. Your train method should return 0 to indicate success.

Note that each method will be called multiple times. The full sequence of calls is outlined below:

init is called with all records for January 2005, and all loans which were funded in January 2005 (loans for leads from before 2005 will naturally not be included)
init is called with all records for February 2005, and all loans which were funded in February 2005
...
init is called with all records for December 2007, and all loans which were funded in December 2007
test is called with all records in January 2008.
train is called to tell you which records were funded in January 2008 and give you their funding dates.
test is called with all records in February 2008.
train is called to tell you which records were funded in February 2008 and give you their funding dates.
...

To help you make predictions, the 10 year treasury rates are available for download. Your algorithm can retrieve these rates by calling Treasury.rates(). Each element will be formatted as "MM/DD/YYYY,rate", where rate is a percentage.

To facilitate the development of your model and allow you to test offline, you may download the training data through the end of 2007 here. Each line represents one record, in the same format as it will be given to the train method.

Definition

Class:	FundingPrediction
Method:	init
Parameters:	String[], String[]
Returns:	int
Method signature:	int init(String[] records, String[] funded)

Method:	test
Parameters:	String[]
Returns:	double[]
Method signature:	double[] test(String[] records)

Method:	train
Parameters:	String[]
Returns:	int
Method signature:	int train(String[] funded)
(be sure your methods are public)

Notes

- Note that for a specific lead ID, at most one lender will fund that lead. Thus, you may find it advantageous to somehow group records by lead ID.

- The three loan products (Purchase, Refinance, and Home Equity) have quite different characteristics, so this may be an important feature.

- Your goal is to predict whether a lead will ever close, not whether it will close in a particular month.

- Predictions greater than 1 will be reduced to 1, while those less than 0 will be increased to 0.

- There are 1115151 records in 2005, 1220458 in 2006, and 1051354 in 2007. The provisional testing uses 217185 records from the first nine months of 2008, while the final testing will use 545237. All of these have been sampled from a larger dataset by randomly selecting a subset of the lead ids and pulling all related records.

- The time limit for all training and testing is 8 minutes.

- The memory limit is 1024M.

- The example testing will consist of the first two months of provisional testing (January and February of 2008).

Examples

"e"

Returns: ""

Example testing will be identical to provision testing, except it will only contain two months of 2008 data, instead of 9.