TopCoder problem "AgentMatching" used in AgentMatching (Division I Level One)

Problem Statement

In the real estate business, most purchases are made through real estate agents, who typically have specific regions in which they work. Naturally, different agents have different specialties, both geographically and otherwise.

The goal of this problem is to match an interested customer to the most appropriate agent -- the one most likely to provide a successful outcome to the customer. To this end, your task is to develop an algorithm which assigns a score to a number of customer-agent pairings. You should assign higher scores to pairings which are more likely to conclude successfully, with a property being purchased. To develop this algorithm, you will be given a corpus of data from past pairings, along with the outcomes for those pairings. After submitting your algorithm, you will be scored on an independent dataset, using the AUC metric. For this evaluation metric, the exact scores do not matter, only the ordering they impose. Ideally, the ordering will consist of all the negative instances, followed by all the positive ones. To make these predictions, you will be given three pieces of data. The first two will be constant, and contain information about the agents. For the third, you will be given half the data for training, with the remaining half held back for testing. The data is described below:

A partitioning of agents into brokerages. More info and download
The coverage areas of the agents, as a list of agent-zip code pairs. More info and download
The customer-agent pairings, along with information about the customer and property. More info and download

You will be given all three of the above as String[]'s with one element representing one line from the files for download (one record). You will also be given a String[], test representing a number of pairings with unknown outcomes. You are to return a double[] with elements corresponding to elements of test. These will be your scores, upon which you will be evaluated. The elements of test will be the same as the elements of train except that they will lack the final two fields (which give the outcome). Each element will be in CSV Format.

Scoring

If you ranked the test cases perfectly, you would assign all of the successful pairings (those which would have a one in their final field) higher scores than all of the unsuccessful pairings. In this case, your score would be 1.0. If you just returned a random ordering, you would expect to score about 0.5. Another way to think about the scoring is that it is a decreasing function in the number of inversions your ordering has compared to a correct ordering. In other words, every time you rank a bad test case higher than a good test case (give the bad one a higher score) that decreases your score.

Definition

Class:	AgentMatching
Method:	evaluate
Parameters:	String[], String[], String[], String[]
Returns:	double[]
Method signature:	double[] evaluate(String[] agents, String[] zips, String[] train, String[] test)
(be sure your method is public)

Notes

During the provisional testing, only 10% of the full dataset will be used for scoring. The remaining 40% (50% used for training, 10% for provisional testing) will be used for final testing.

As the data comes from a real website, it is noisy and may contain missing or incorrect values.

The time limit for the call to your method is 5 minutes. The memory limit is 1024MB.

The example test is the same as the provisional test.

The training data contains 39797 matchings, the example/provisional test set contains 8051, and the final test set contains 31902.

More information about the AUC metric and ROC curves can be found at Wikipedia.

Examples

"1"

Returns: "Provisional Test"