Problem Statement | |||||||||||||||||||||||||||||||||||
In the loan industry, a lead aggregator is a company which takes loan applications and sends them to various funding institutions (like banks). The process is as follows:
The lead aggregator's income comes primarily from funded loans -- ones where the consumer elects to take a loan at the end of the day. Thus, an important problem for a lead aggregator is to try to match leads to the lenders who will end up funding them. To do this, we need to know how likely each lender is to fund a particular lead. This is your task. You will be given training data from a large number of past leads. Each record in the training data will correspond to one (lead,lender) pair. It will include information derived from the loan application, information about the lender, and the final outcome (funded or not). Using this data, your task is to construct a model which predicts whether or not a lead will end up being funded by a particular lender. Your model will initially be trained on historical data from 2005 through the end of 2007. The data will represent the information available at the end of 2007, and thus leads which were submitted in 2007 but funded in 2008 will be marked as unfunded. Your model will then be asked to give predictions for leads which arrived in the month of January, 2008. After you have made these predictions, you will receive additional training data from January, 2008. For any leads where were submitted before or during January, 2008 and were funded in January, 2008, you will be given the exact funding date. Additionally, you will be given training data for the new leads that arrived in January, 2008. Once you have performed any additional training on your model based on this new information, you will be asked to make predictions about February, 2008. After your make predictions you will receive another update, and so on. The last month you will make predictions about is September, 2008. A detailed description of the data can be found at http://www.topcoder.com/contest/problem/FundingPrediction/data.html. Your model will be evaluated based on the predictions it makes for the January-September, 2008 leads. During the provisional testing phase of the contest, a smaller subset of the leads will be used as training and testing data. During the final testing phase, a larger, disjoint subset will be used. Your error for a single record will by the difference between your prediction and the true outcome (1 for funded, 0 for not funded). Your score for that record will by 1-error2. Thus, if you predict 0.8 for a record that is funded, your score for that record will be 1-0.22 = 0.96. On the other hand, a prediction of 0.8 for a record that is not funded will result in a score of 1-0.82 = 0.36. Your overall score for a set of records will simply be the sum of your scores on all the individual records. Implementation DetailsYour program must implement 3 methods: init, train and test.The init method will take data through 2007, one month at a time. You will be given a String[] records, each element of which will represent one record. The records will be given in comma-delimited format, in the order they are given on the data description page. You will also be given a String[] funded, each element of which will be formatted as "RECORD_ID,DATE", indicating that a particular record was funded on a specific date. This method will be called a total of 36 times, once for each month from 2005 through 2007. The test method will take a String[] records with the same format as the init method. It should return a double[], each element of which is your prediction for the corresponding record. The train method will take a String[] funded, each element of which will represent one funded record, in the same format as above. Note that each time this method is called, all the RECORD_ID's will correspond to records you have already seen either in init, or in test. Your train method should return 0 to indicate success. Note that each method will be called multiple times. The full sequence of calls is outlined below:
To facilitate the development of your model and allow you to test offline, you may download the training data through the end of 2007 here. Each line represents one record, in the same format as it will be given to the train method. | |||||||||||||||||||||||||||||||||||
Definition | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
Notes | |||||||||||||||||||||||||||||||||||
- | Note that for a specific lead ID, at most one lender will fund that lead. Thus, you may find it advantageous to somehow group records by lead ID. | ||||||||||||||||||||||||||||||||||
- | The three loan products (Purchase, Refinance, and Home Equity) have quite different characteristics, so this may be an important feature. | ||||||||||||||||||||||||||||||||||
- | Your goal is to predict whether a lead will ever close, not whether it will close in a particular month. | ||||||||||||||||||||||||||||||||||
- | Predictions greater than 1 will be reduced to 1, while those less than 0 will be increased to 0. | ||||||||||||||||||||||||||||||||||
- | There are 1115151 records in 2005, 1220458 in 2006, and 1051354 in 2007. The provisional testing uses 217185 records from the first nine months of 2008, while the final testing will use 545237. All of these have been sampled from a larger dataset by randomly selecting a subset of the lead ids and pulling all related records. | ||||||||||||||||||||||||||||||||||
- | The time limit for all training and testing is 8 minutes. | ||||||||||||||||||||||||||||||||||
- | The memory limit is 1024M. | ||||||||||||||||||||||||||||||||||
- | The example testing will consist of the first two months of provisional testing (January and February of 2008). | ||||||||||||||||||||||||||||||||||
Examples | |||||||||||||||||||||||||||||||||||
0) | |||||||||||||||||||||||||||||||||||
|