🐞 fix: TSNE now encodes query sequence correctly +++ fixed types in C… #22

NoahHenrikKleinschmidt · 2025-04-28T08:59:51Z

Hi There

Even though it looks like the repo is pretty stale at this point, I still would like to submit a fix for a bug that I found when it comes to computing the TSNE in the script ClusterMSA.py. Specifically, in line 200 the sequences are converted to lists and and subsequently added together. However, the query_.sequence.tolist() is itself wrapped inside square brackets which causes it to be another list. As a consequence, the final inputs for the encode_seqs call contain a single-entry list as final element (i.e. the one for the query sequence) rather than a string. Therefore the encoding does not work properly for this case, which lets the query sequence also appear in a wrong location in the TSNE plot.

[line 200 in scripts/ClusterMSA.py]
- ohe_vecs = encode_seqs(df.sequence.tolist()+[query_.sequence.tolist()], max_len=L)
+ ohe_vecs = encode_seqs(df.sequence.tolist()+query_.sequence.tolist(), max_len=L)

What the inputs actually look like

>>> inputs_original = df.sequence.tolist()+[query_.sequence.tolist()]
>>> print(inputs_original[-5:]) # (notice the list in the final element)
['--LVINDRNGRHCSMNVKLSDTIGNLKANI---PSIDPQNKELVFNDMVLDDTCILANLPIMADSILTLM------',
 '-KVKVKPLEGSVFELSINASETVDMVKHRICAREGVNSQVHALCFEGRELPPGSLMSRSG----------------',
 '-QIYVKSLLSKAFVVEMLTYDTVGMLKARIQKNFKLPIEKL-LTLEETPLEDNAKLEHTVISNDSAI---------',
 '----------QMITMAFNINQSVGKLKQYFASQLKVPQDVLQVVFQGRLIEDGESLMHIGVRPHGTIQ--------',
 ['MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG']]


>>> inputs_fixed = df.sequence.tolist()+query_.sequence.tolist()
>>> print(inputs_fixed[-5:])
['--LVINDRNGRHCSMNVKLSDTIGNLKANI---PSIDPQNKELVFNDMVLDDTCILANLPIMADSILTLM------',
 '-KVKVKPLEGSVFELSINASETVDMVKHRICAREGVNSQVHALCFEGRELPPGSLMSRSG----------------',
 '-QIYVKSLLSKAFVVEMLTYDTVGMLKARIQKNFKLPIEKL-LTLEETPLEDNAKLEHTVISNDSAI---------',
 '----------QMITMAFNINQSVGKLKQYFASQLKVPQDVLQVVFQGRLIEDGESLMHIGVRPHGTIQ--------',
 'MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG']

As a simple test to exemplify this, run the following (with any arbitrary a3m you have lying around as input)

original_ohe = encode_seqs(df.sequence.tolist()+[query_.sequence.tolist()], max_len=L)
fixed_ohe = encode_seqs(df.sequence.tolist()+query_.sequence.tolist(), max_len=L)

# since we do one-hot encoding every entry should have at least one 1-valued entry
assert original_ohe.any(axis=1).all(), "Encoding in the original setup does not work since there are all-0 entries!"
# will raise AssertationError
assert fixed_ohe.any(axis=1).all(), "Encoding in the fixed setup does not work since there are all-0 entries!"
# will be fine

Also, I found that when submitting custom values to the command line interface for arguments like min_samples they were retained as str even though the defaults were numeric. Therefore, I added type declarations to the CLI for

min_eps
max_eps
eps_step
min_samples

to ensure that the values are properly converted.

Cheers,
Noah ☀️

…LI for min_eps, max_eps, eps_step, and min_samples

🐞 fix: TSNE now encodes query sequence correctly +++ fixed types in C…

469bdda

…LI for min_eps, max_eps, eps_step, and min_samples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐞 fix: TSNE now encodes query sequence correctly +++ fixed types in C… #22

🐞 fix: TSNE now encodes query sequence correctly +++ fixed types in C… #22

Uh oh!

NoahHenrikKleinschmidt commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🐞 fix: TSNE now encodes query sequence correctly +++ fixed types in C… #22

Are you sure you want to change the base?

🐞 fix: TSNE now encodes query sequence correctly +++ fixed types in C… #22

Uh oh!

Conversation

NoahHenrikKleinschmidt commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant