Skip to content

Conversation

@NoahHenrikKleinschmidt
Copy link

Hi There

Even though it looks like the repo is pretty stale at this point, I still would like to submit a fix for a bug that I found when it comes to computing the TSNE in the script ClusterMSA.py. Specifically, in line 200 the sequences are converted to lists and and subsequently added together. However, the query_.sequence.tolist() is itself wrapped inside square brackets which causes it to be another list. As a consequence, the final inputs for the encode_seqs call contain a single-entry list as final element (i.e. the one for the query sequence) rather than a string. Therefore the encoding does not work properly for this case, which lets the query sequence also appear in a wrong location in the TSNE plot.

[line 200 in scripts/ClusterMSA.py]
- ohe_vecs = encode_seqs(df.sequence.tolist()+[query_.sequence.tolist()], max_len=L)
+ ohe_vecs = encode_seqs(df.sequence.tolist()+query_.sequence.tolist(), max_len=L)
What the inputs actually look like
>>> inputs_original = df.sequence.tolist()+[query_.sequence.tolist()]
>>> print(inputs_original[-5:]) # (notice the list in the final element)
['--LVINDRNGRHCSMNVKLSDTIGNLKANI---PSIDPQNKELVFNDMVLDDTCILANLPIMADSILTLM------',
 '-KVKVKPLEGSVFELSINASETVDMVKHRICAREGVNSQVHALCFEGRELPPGSLMSRSG----------------',
 '-QIYVKSLLSKAFVVEMLTYDTVGMLKARIQKNFKLPIEKL-LTLEETPLEDNAKLEHTVISNDSAI---------',
 '----------QMITMAFNINQSVGKLKQYFASQLKVPQDVLQVVFQGRLIEDGESLMHIGVRPHGTIQ--------',
 ['MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG']]


>>> inputs_fixed = df.sequence.tolist()+query_.sequence.tolist()
>>> print(inputs_fixed[-5:])
['--LVINDRNGRHCSMNVKLSDTIGNLKANI---PSIDPQNKELVFNDMVLDDTCILANLPIMADSILTLM------',
 '-KVKVKPLEGSVFELSINASETVDMVKHRICAREGVNSQVHALCFEGRELPPGSLMSRSG----------------',
 '-QIYVKSLLSKAFVVEMLTYDTVGMLKARIQKNFKLPIEKL-LTLEETPLEDNAKLEHTVISNDSAI---------',
 '----------QMITMAFNINQSVGKLKQYFASQLKVPQDVLQVVFQGRLIEDGESLMHIGVRPHGTIQ--------',
 'MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG']

As a simple test to exemplify this, run the following (with any arbitrary a3m you have lying around as input)

original_ohe = encode_seqs(df.sequence.tolist()+[query_.sequence.tolist()], max_len=L)
fixed_ohe = encode_seqs(df.sequence.tolist()+query_.sequence.tolist(), max_len=L)

# since we do one-hot encoding every entry should have at least one 1-valued entry
assert original_ohe.any(axis=1).all(), "Encoding in the original setup does not work since there are all-0 entries!"
# will raise AssertationError
assert fixed_ohe.any(axis=1).all(), "Encoding in the fixed setup does not work since there are all-0 entries!"
# will be fine

Also, I found that when submitting custom values to the command line interface for arguments like min_samples they were retained as str even though the defaults were numeric. Therefore, I added type declarations to the CLI for

  • min_eps
  • max_eps
  • eps_step
  • min_samples

to ensure that the values are properly converted.

Cheers,
Noah ☀️

…LI for min_eps, max_eps, eps_step, and min_samples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant