Posting relevant metrics for scheduling into InfluxDB for Galaxy Job Radar #1465

TomasVondrak · 2025-04-02T10:14:56Z

Galaxy Job Radar needs the metrics below to be accessible in InfluxDB so it would be able to evaluate how effective the scheduling on Pulsars and the meta-scheduling of the whole system have been in past time frames.

job release time
start time of a job computation
job completition time
requested number of processors by job

Those metrics we need by destination so that we can take all jobs in some time frame by one destination and compute various graphs based on various metrics.

@sanjaysrikakulam

sanjaysrikakulam · 2025-04-02T10:19:26Z

@TomasVondrak what do you mean by job release time?

I will see what I can do there. The job resource metrics are only available once a job finishes. So, I will filter for the jobs that ended and this will include both error, and ok states unless you only want the ok. Let me know.

TomasVondrak · 2025-04-02T10:27:01Z

By release time I mean time when job came to the destination to be computed. It is used to compute wait time of a job which is the time between it came to the job pool and when the job started being computed.

I need only jobs ended in past and I guess both errored and ended successfully.

sanjaysrikakulam · 2025-04-02T14:38:43Z

We record job state change information to a certain degree, such as when the job was submitted (New state), when the job started to run (Running state), and when the job finished (Error, Ok states). You can use the state change from New to the 1st Running state entrt in the job state change table in Galaxy.

I will write an SQL query to do that and put them in the InfluxDB in the coming days.

TomasVondrak · 2025-04-02T19:02:24Z

Great, thanks! Then only other thing I need is that number of required CPUs.

sanjaysrikakulam · 2025-05-03T12:46:33Z

@TomasVondrak Since you requested that you want the data to complete your master's thesis in the next 2 weeks, I dug into constructing the SQL query. However, the resulting query (complex join operations) is complex and takes much time to extract the data directly from the DB. As an alternative, I propose the following:

I give you three separate metrics in InfluxDB
You do the joining of the data in the post-processing step in Python (I am not sure which language you are using)

Here are the three SQL queries, just FYI, and for tracking.

SELECT id AS job_id, create_time AS job_create_time, tool_id FROM job WHERE state IN ('ok', 'error') AND update_time >= NOW() - INTERVAL '4 hour' AND tool_id != '__DATA_FETCH__';

SELECT job_id, metric_name, metric_value FROM job_metric_numeric WHERE job_id IN (SELECT id FROM job WHERE state IN ('ok', 'error') AND update_time >= NOW() - INTERVAL '4 hour' AND tool_id != '__DATA_FETCH__') AND metric_name IN ('galaxy_slots', 'galaxy_memory_mb', 'memory.peak');

SELECT jsh.job_id, MIN(jsh.create_time) AS running_start_time FROM job_state_history jsh WHERE jsh.job_id IN (SELECT id FROM job WHERE state IN ('ok', 'error') AND update_time >= NOW() - INTERVAL '4 hour' AND tool_id != '__DATA_FETCH__') AND jsh.state = 'running' GROUP BY jsh.job_id;

Further, I will configure Telegraf to run this only once every four hours, and the above queries will fetch the data for the last four hours.

Let me know what you think of that, and I will build the rest of the things around and prepare for a PR.

TomasVondrak · 2025-05-03T13:28:56Z

Thanks very much for this proposition. I think it is good idea. I think how it could be made more effective.

I am not sure if it could help but I made the whole metric computation little bit more easy so I also need less data. I have decided that I will compute only mean slowdown, bounded slowdown and response time, by now, because utilization needs also number of CPUs available on computation nodes which I understood is not possible to get from Influx and even if, the whole source gathering in Galaxy computation is more tricky and such metric would not be reliable.

So for this I need only job id, destination id, submit time of a job, start time of job computation and finish time of a job for f. e. last four hours. And then I compute for every hour values for these metrics and then I will serve user just these values for last several hours, or day or something. Later I can make it more complex.

I can of course make some joining on my site, I am working with Python, so it won't be problem as well.

When I am looking on how long galaxy jobs typically are (some very quick, few seconds, but then some big, often more than 4 hours), maybe it could have been run once a day. So that users can run their tasks, then they will end, and other day administrators can watch how every metric went. I am not sure, it definitely wants more research and this will be starting point.

sanjaysrikakulam · 2025-05-03T13:45:16Z

Remember, you will have three new measurements from which you must join everything in the post-processing stage.

Below are the queries and sample output about a single job

Query 1:

gxadmin query q "COPY (SELECT id AS job_id, EXTRACT(EPOCH FROM create_time)::BIGINT AS job_create_time, destination_id, tool_id FROM job WHERE state IN ('ok', 'error') AND update_time >= NOW() - INTERVAL '4 hour' AND tool_id != '__DATA_FETCH__') TO STDOUT WITH (FORMAT CSV, HEADER FALSE, DELIMITER ',');" | awk -F',' '{job_id = $1;job_create_time = $2;destination_id = $3;tool_id = $4;gsub(/[ ,=]/, "_", tool_id);printf"galaxy_job_metadata,job_id=%s,tool_id=%s,destination_id=%s job_create_time=%ds\n", job_id, tool_id, destination_id, job_create_time;}'

galaxy_job_metadata,job_id=82962917,tool_id=toolshed.g2.bx.psu.edu/repos/iuc/fastp/fastp/0.24.0+galaxy4,destination_id=condor_tpv job_create_time=1746235804s

Query 2:

gxadmin query q "COPY (SELECT job_id, metric_name, metric_value FROM job_metric_numeric WHERE job_id IN (SELECT id FROM job WHERE state IN ('ok', 'error') AND update_time >= NOW() - INTERVAL '4 hour' AND tool_id != '__DATA_FETCH__') AND metric_name IN ('galaxy_slots', 'galaxy_memory_mb', 'memory.peak')) TO STDOUT WITH (FORMAT CSV, HEADER FALSE, DELIMITER ',');" | awk -F, '{printf "galaxy_job_metrics,job_id=%s metric_name=\"%s\",metric_value=%s\n", $1, $2, $3}'


galaxy_job_metrics,job_id=82962917 metric_name="galaxy_slots",metric_value=4.0000000
galaxy_job_metrics,job_id=82962917 metric_name="galaxy_memory_mb",metric_value=49152.0000000
galaxy_job_metrics,job_id=82962917 metric_name="memory.peak",metric_value=23593701376.0000000

Query 3:

gxadmin query q "COPY (SELECT job_id, EXTRACT(EPOCH FROM MIN(CASE WHEN state = 'running' THEN create_time END))::BIGINT AS running_start_time, EXTRACT(EPOCH FROM MAX(CASE WHEN state IN ('ok', 'error') THEN create_time END))::BIGINT AS final_state_time FROM job_state_history WHERE job_id IN (SELECT id FROM job WHERE state IN ('ok', 'error') AND update_time >= NOW() - INTERVAL '4 hour' AND tool_id != '__DATA_FETCH__') AND state IN ('running', 'ok', 'error') GROUP BY job_id) TO STDOUT WITH (FORMAT CSV, HEADER FALSE, DELIMITER ',');" | awk -F, '{printf "galaxy_job_state,job_id=%s running_start_time=%d final_state_time=%d\n", $1, $2, $3}'


galaxy_job_state,job_id=82962917 running_start_time=1746235836 final_state_time=1746276734

TomasVondrak · 2025-05-03T14:14:59Z

I do not understand Query 2, for what do I need it when in Query 1 I have submit time and in Query 3 start time and finish time? Or probably I do not understand it fully.

I was also thinking that maybe it could be enough to collect data only once a day. Can it help the infrastructure with effectiveness? Maybe it could make even more sense from perspective of people using these metrics.

TomasVondrak · 2025-05-03T14:16:56Z

Oh, damn. I see now, those are metrics for CPUs and memory. I see, I am sorry. So as I said, I will not use them now, but maybe later I can, so I can collect those for later use.

sanjaysrikakulam · 2025-05-03T14:17:11Z

Don't you need the job resource usage requirements such as per job how much of memory and CPU cores allocated (you requested this in the initially)?

sanjaysrikakulam · 2025-05-03T14:18:00Z

Oh, damn. I see now, those are metrics for CPUs and memory. I see, I am sorry. So as I said, I will not use them now, but maybe later I can, so I can collect those for later use.

I see, requirements are changed. However, feel free to use that data.

Xref: usegalaxy-eu#1465

sanjaysrikakulam self-assigned this Apr 2, 2025

sanjaysrikakulam added a commit to sanjaysrikakulam/usegalaxy-eu_infrastructure-playbook that referenced this issue May 3, 2025

Add galaxy job stats script for GJR

b515eb6

Xref: usegalaxy-eu#1465

sanjaysrikakulam mentioned this issue May 3, 2025

Add Telegraf script and task for galaxy job stats for GJR project #1505

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Posting relevant metrics for scheduling into InfluxDB for Galaxy Job Radar #1465

Posting relevant metrics for scheduling into InfluxDB for Galaxy Job Radar #1465

TomasVondrak commented Apr 2, 2025

sanjaysrikakulam commented Apr 2, 2025

Uh oh!

TomasVondrak commented Apr 2, 2025

Uh oh!

sanjaysrikakulam commented Apr 2, 2025

Uh oh!

TomasVondrak commented Apr 2, 2025

Uh oh!

sanjaysrikakulam commented May 3, 2025 •

edited

Loading

Uh oh!

TomasVondrak commented May 3, 2025 •

edited

Loading

Uh oh!

sanjaysrikakulam commented May 3, 2025 •

edited

Loading

Uh oh!

TomasVondrak commented May 3, 2025

Uh oh!

TomasVondrak commented May 3, 2025

Uh oh!

sanjaysrikakulam commented May 3, 2025 •

edited

Loading

Uh oh!

sanjaysrikakulam commented May 3, 2025

Uh oh!

Posting relevant metrics for scheduling into InfluxDB for Galaxy Job Radar #1465

Posting relevant metrics for scheduling into InfluxDB for Galaxy Job Radar #1465

Comments

TomasVondrak commented Apr 2, 2025

sanjaysrikakulam commented Apr 2, 2025

Uh oh!

TomasVondrak commented Apr 2, 2025

Uh oh!

sanjaysrikakulam commented Apr 2, 2025

Uh oh!

TomasVondrak commented Apr 2, 2025

Uh oh!

sanjaysrikakulam commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomasVondrak commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanjaysrikakulam commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomasVondrak commented May 3, 2025

Uh oh!

TomasVondrak commented May 3, 2025

Uh oh!

sanjaysrikakulam commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanjaysrikakulam commented May 3, 2025

Uh oh!

sanjaysrikakulam commented May 3, 2025 •

edited

Loading

TomasVondrak commented May 3, 2025 •

edited

Loading

sanjaysrikakulam commented May 3, 2025 •

edited

Loading

sanjaysrikakulam commented May 3, 2025 •

edited

Loading