Skip to content

Implementation of KNN based on the Spark-ML-LSH  #3

@Victor0118

Description

@Victor0118

Hash Function: h_i(x) = floor(r_i.dot(x) / bucketLength)
threshold = 2000
W = bucketLength
NHT = # of HashTables

  • The number of buckets will be (max L2 norm of input vectors) / bucketLength.
  • If input vectors are normalized, 1-10 times of pow(numRecords, -1/inputDim) would be a reasonable value
k NHT W Accuracy_train Accuracy_test T_index T_query
1 3 2 - 0.9087 54 175848
5 3 2 - 0.893 54 174651
9 3 2 - 0.8808 54 155673
1 5 2 - 0.9291 29 251302
5 5 2 - 0.9137 29 275162
9 5 2 - 0.9036 29 367008
1 7 2 - 0.9372 34 523696
5 7 2 - 0.9238 34 460986
9 7 2 - 0.9145 34 485565
1 3 5 - 0.9357 30 367245
5 3 5 - 0.9263 30 340930
9 3 5 - 0.9171 30 341963
1 5 5 - 0.9459 41 596984
5 5 5 - 0.9401 41 559091
9 5 5 - 0.93 41 561646
1 7 5 - 0.9496 22 770659
5 7 5 - 0.9465 22 787571
9 7 5 - 0.9385 22 841044
1 3 8 - 0.9419 37 439672
5 3 8 - 0.9348 37 417642
9 3 8 - 0.9253 37 422822
1 5 8 - 0.9481 24 605899
5 5 8 - 0.9438 24 609686
9 5 8 - 0.9358 24 609061
1 7 8 - 0.9511 22 780209
5 7 8 - 0.9447 22 769710
9 7 8 - 0.9409 22 769710

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions