-
Notifications
You must be signed in to change notification settings - Fork 0
Home
GitHub has a per file size cap set at 100M (compressed). Let's find out how much can the tensor tsv file be shortened by adopting scientific notation. This tensor file is borrowed from the projection https://bit.ly/tcgaReps. The additional, metadata, projection json file can be found at this gist.
Make sure you have the cli module loaded:
cli = await import('https://epiverse.github.io/cli/cli.mjs')url="https://raw.githubusercontent.com/epiverse/tcgaReps/refs/heads/main/embeddings_9523.tsv"tsv = await (await fetch(url)).text()tsv.length
90430307 //text, as the TF projector requires Let's look at the first 100 characters:
tsv.slice(0,100)0.016493121\t0.021351963\t-0.015399779\t-0.013994825\t0.03226338\t0.034183376\t0.02694424\t-0.005479755\t-0.'
vec = cli.tsv2vec(tsv)see values
vec[0].slice(1,10)[0.021351963, -0.015399779, -0.013994825, 0.03226338, 0.034183376, 0.02694424, -0.0054 ...i.e.
vec[0][0].toExponential(3)'1.649e-2'
to all tensors, resolution 3
vec3 = vec.map(row=>row.map(x=>x.toExponential(3)))Let's convert this into a function, vec2exp, so we don't have to remember this is mapped to cells in an array of arrays So we can now generate the corresponding tsv:
exp2 = cli.vec2exp(vec,2) // 2 digits
exp3 = cli.vec2exp(vec,3) // 3 digitsand find out how much did we compress the tensors:
exp3.length/tsv.length
0.767669% shortened for exp with 2 digits 77% shortened for exp with 3 digits
DEV NOTES
identify data url
url="https://raw.githubusercontent.com/epiverse/tcgaReps/refs/heads/main/embeddings_9523.tsv"load cli module
cli = await import('http://localhost:8000/cli/cli.mjs')cli = await import('https://epiverse.github.io/cli/cli.mjs')load tsv file
tsv = await (await fetch(url)).text()regenerate tensors (vec)
vec = cli.tsv2vec(tsv)use exp notation to control size / resolution
exp3 = cli.vec2exp(vec,3)exp2 = cli.vec2exp(vec,2)Regenerate tsv text with exp resolution
tsv2 = cli.vec2tsv(exp2)Regenerate tsv text with exp resolution
tsv3 = cli.vec2tsv(exp3)tsv2.length/tsv.length = 69%
tsv3.length/tsv.length = 77%
save shortened embeddings
cli.saveFile(tsv2,'tsv2.tsv')