Skip to content
Jonas Almeida edited this page Nov 7, 2025 · 25 revisions

Shorter tensors

GitHub has a per file size cap set at 100M (compressed). Let's find out how much can the tensor tsv file be shortened by adopting scientific notation. This tensor file is borrowed from the projection https://bit.ly/tcgaReps. The additional, metadata, projection json file can be found at this gist.

Make sure you have the cli module loaded:

cli = await import('https://epiverse.github.io/cli/cli.mjs')

Test Data

url="https://raw.githubusercontent.com/epiverse/tcgaReps/refs/heads/main/embeddings_9523.tsv"

Fetching it

tsv = await (await fetch(url)).text()
tsv.length
90430307 //text, as the TF projector requires 

Let's look at the first 100 characters:

tsv.slice(0,100)

0.016493121\t0.021351963\t-0.015399779\t-0.013994825\t0.03226338\t0.034183376\t0.02694424\t-0.005479755\t-0.'

extract vectors

vec = cli.tsv2vec(tsv)

see values

vec[0].slice(1,10)
[0.021351963, -0.015399779, -0.013994825, 0.03226338, 0.034183376, 0.02694424, -0.0054 ...

Convert decimal to scientific notation

i.e.

vec[0][0].toExponential(3)

'1.649e-2'

Map sci notation

to all tensors, resolution 3

vec3 = vec.map(row=>row.map(x=>x.toExponential(3)))

Let's convert this into a function, vec2exp, so we don't have to remember this is mapped to cells in an array of arrays So we can now generate the corresponding tsv:

exp2 = cli.vec2exp(vec,2) // 2 digits
exp3 = cli.vec2exp(vec,3) // 3 digits

and find out how much did we compress the tensors:

exp3.length/tsv.length
0.7676

69% shortened for exp with 2 digits 77% shortened for exp with 3 digits

DEV NOTES

identify data url

url="https://raw.githubusercontent.com/epiverse/tcgaReps/refs/heads/main/embeddings_9523.tsv"

load cli module

cli = await import('http://localhost:8000/cli/cli.mjs')
cli = await import('https://epiverse.github.io/cli/cli.mjs')

load tsv file

tsv = await (await fetch(url)).text()

regenerate tensors (vec)

vec = cli.tsv2vec(tsv)

use exp notation to control size / resolution

exp3 = cli.vec2exp(vec,3)
exp2 = cli.vec2exp(vec,2)

Regenerate tsv text with exp resolution

tsv2 = cli.vec2tsv(exp2)

Regenerate tsv text with exp resolution

tsv3 = cli.vec2tsv(exp3)

tsv2.length/tsv.length = 69%

tsv3.length/tsv.length = 77%

save shortened embeddings

cli.saveFile(tsv2,'tsv2.tsv')

Clone this wiki locally