TopCoder problem "Deanonymize" used in MM 45 (Division I Level One)



Problem Statement

    A company has recently released a large database. Each record in the database corresponds to one user. There are N public fields which do not contain sensitive information, and M private fields which do. The company claims that it has added noise to anonymize the database and that there is no way to recover accurate private values from unperturbed public information. You're not so sure...



You will be given the perturbed database and unperturbed public information for a number of users (but not everyone). Your task is to return your best guess as to the unperturbed private values of the users whose public information you know.



The data will be generated by first picking N and M. Then, C gaussian distributions will be generated in N+M dimensional space. The mean and standard deviation of each dimension of each gaussian will be selected independently and uniformly from [-1,1] and [0,1]. To generate each of U users, one of the C gaussians will be selected at random, and a sample will be pulled from it. Finally, the data will be perturbed by adding gaussian noise to every value, sampled from a distribution with mean 0 and standard deviation SD. You will then be given the unperturbed public information about Q users (selected randomly from the U) and are to return your best guess as to their unperturbed private information.



Your raw score will be the mean squared error of your return. This will be compared to a baseline measure which, for every user whose unperturbed public information is given, takes the expected value of the private information (given the C distributions used to generate the data). Your score will be your improvement over this baseline (in mean squared error). Your overall score will simply be the sum of these scores, with any negative values increased to 0.



You must implement a method deanonymize. Your method will take N, M and SD as described above. The parameter database will give you U records, where field j of record i is given by database[i*(N+M)+j] and the first N fields are the public ones. users will give you the unperturbed users whose data we are interested in, where public field j of user i is given by users[i*N+j]. You should return an array with Q*M elements, grouped by user in the same way that the users array is.



An offline tester is also available.
 

Definition

    
Class:Deanonymize
Method:deanonymize
Parameters:int, int, double[], double[], double
Returns:double[]
Method signature:double[] deanonymize(int N, int M, double[] database, double[] users, double SD)
(be sure your method is public)
    
 

Notes

-The memory limit is 2048MB, however you may experience performance degradation due to swapping when using more than 1024MB.
-The time limit is 20 seconds.
-There are 100 provisional test cases.
 

Constraints

-In generating the parameters, rand() is a uniform random variable in [0,1], and floor is used for rounding to integers when necessary.
-N will be 100rand()
-M will be 100rand()
-C will be 1000rand()
-U will be 100+100000rand()
-Q will be U/10
-SD will be (rand()+0.5)*min(0.5,U-1/N)
 

Examples

0)
    
"1"
Returns: "Seed = 1
N = 15
M = 6
SD = 0.6794676326705618
U = 146
Q = 14
"
1)
    
"2"
Returns: "Seed = 2
N = 7
M = 1
SD = 0.335403841291766
U = 682
Q = 68
"
2)
    
"3"
Returns: "Seed = 3
N = 43
M = 1
SD = 0.6342912229585391
U = 653
Q = 65
"
3)
    
"4"
Returns: "Seed = 4
N = 55
M = 2
SD = 0.7483313334749946
U = 947
Q = 94
"
4)
    
"5"
Returns: "Seed = 5
N = 7
M = 1
SD = 0.34758487121050863
U = 9770
Q = 977
"
5)
    
"6"
Returns: "Seed = 6
N = 92
M = 17
SD = 0.3416127850569974
U = 146
Q = 14
"
6)
    
"7"
Returns: "Seed = 7
N = 5
M = 4
SD = 0.20131294912571285
U = 697
Q = 69
"
7)
    
"8"
Returns: "Seed = 8
N = 6
M = 10
SD = 0.44062796822464706
U = 111
Q = 11
"
8)
    
"9"
Returns: "Seed = 9
N = 22
M = 46
SD = 0.533073548894668
U = 27926
Q = 2792
"
9)
    
"10"
Returns: "Seed = 10
N = 3
M = 1
SD = 0.038548775274140114
U = 18476
Q = 1847
"

Problem url:

http://www.topcoder.com/stat?c=problem_statement&pm=10227

Problem stats url:

http://www.topcoder.com/tc?module=ProblemDetail&rd=13570&pm=10227

Writer:

Unknown

Testers:

Problem categories:

Advanced Math