-
Notifications
You must be signed in to change notification settings - Fork 102
Posting relevant metrics for scheduling into InfluxDB for Galaxy Job Radar #1465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@TomasVondrak what do you mean by I will see what I can do there. The job resource metrics are only available once a job finishes. So, I will filter for the jobs that ended and this will include both |
By release time I mean time when job came to the destination to be computed. It is used to compute wait time of a job which is the time between it came to the job pool and when the job started being computed. I need only jobs ended in past and I guess both errored and ended successfully. |
We record job state change information to a certain degree, such as when the job was submitted (New state), when the job started to run (Running state), and when the job finished (Error, Ok states). You can use the state change from New to the 1st Running state entrt in the job state change table in Galaxy. I will write an SQL query to do that and put them in the InfluxDB in the coming days. |
Great, thanks! Then only other thing I need is that number of required CPUs. |
@TomasVondrak Since you requested that you want the data to complete your master's thesis in the next 2 weeks, I dug into constructing the SQL query. However, the resulting query (complex join operations) is complex and takes much time to extract the data directly from the DB. As an alternative, I propose the following:
Here are the three SQL queries, just FYI, and for tracking.
Further, I will configure Telegraf to run this only once every four hours, and the above queries will fetch the data for the last four hours. Let me know what you think of that, and I will build the rest of the things around and prepare for a PR. |
Thanks very much for this proposition. I think it is good idea. I think how it could be made more effective. I am not sure if it could help but I made the whole metric computation little bit more easy so I also need less data. I have decided that I will compute only mean slowdown, bounded slowdown and response time, by now, because utilization needs also number of CPUs available on computation nodes which I understood is not possible to get from Influx and even if, the whole source gathering in Galaxy computation is more tricky and such metric would not be reliable. So for this I need only job id, destination id, submit time of a job, start time of job computation and finish time of a job for f. e. last four hours. And then I compute for every hour values for these metrics and then I will serve user just these values for last several hours, or day or something. Later I can make it more complex. I can of course make some joining on my site, I am working with Python, so it won't be problem as well. When I am looking on how long galaxy jobs typically are (some very quick, few seconds, but then some big, often more than 4 hours), maybe it could have been run once a day. So that users can run their tasks, then they will end, and other day administrators can watch how every metric went. I am not sure, it definitely wants more research and this will be starting point. |
Remember, you will have three new measurements from which you must join everything in the post-processing stage. Below are the queries and sample output about a single job Query 1: gxadmin query q "COPY (SELECT id AS job_id, EXTRACT(EPOCH FROM create_time)::BIGINT AS job_create_time, destination_id, tool_id FROM job WHERE state IN ('ok', 'error') AND update_time >= NOW() - INTERVAL '4 hour' AND tool_id != '__DATA_FETCH__') TO STDOUT WITH (FORMAT CSV, HEADER FALSE, DELIMITER ',');" | awk -F',' '{job_id = $1;job_create_time = $2;destination_id = $3;tool_id = $4;gsub(/[ ,=]/, "_", tool_id);printf"galaxy_job_metadata,job_id=%s,tool_id=%s,destination_id=%s job_create_time=%ds\n", job_id, tool_id, destination_id, job_create_time;}'
galaxy_job_metadata,job_id=82962917,tool_id=toolshed.g2.bx.psu.edu/repos/iuc/fastp/fastp/0.24.0+galaxy4,destination_id=condor_tpv job_create_time=1746235804s Query 2: gxadmin query q "COPY (SELECT job_id, metric_name, metric_value FROM job_metric_numeric WHERE job_id IN (SELECT id FROM job WHERE state IN ('ok', 'error') AND update_time >= NOW() - INTERVAL '4 hour' AND tool_id != '__DATA_FETCH__') AND metric_name IN ('galaxy_slots', 'galaxy_memory_mb', 'memory.peak')) TO STDOUT WITH (FORMAT CSV, HEADER FALSE, DELIMITER ',');" | awk -F, '{printf "galaxy_job_metrics,job_id=%s metric_name=\"%s\",metric_value=%s\n", $1, $2, $3}'
galaxy_job_metrics,job_id=82962917 metric_name="galaxy_slots",metric_value=4.0000000
galaxy_job_metrics,job_id=82962917 metric_name="galaxy_memory_mb",metric_value=49152.0000000
galaxy_job_metrics,job_id=82962917 metric_name="memory.peak",metric_value=23593701376.0000000 Query 3: gxadmin query q "COPY (SELECT job_id, EXTRACT(EPOCH FROM MIN(CASE WHEN state = 'running' THEN create_time END))::BIGINT AS running_start_time, EXTRACT(EPOCH FROM MAX(CASE WHEN state IN ('ok', 'error') THEN create_time END))::BIGINT AS final_state_time FROM job_state_history WHERE job_id IN (SELECT id FROM job WHERE state IN ('ok', 'error') AND update_time >= NOW() - INTERVAL '4 hour' AND tool_id != '__DATA_FETCH__') AND state IN ('running', 'ok', 'error') GROUP BY job_id) TO STDOUT WITH (FORMAT CSV, HEADER FALSE, DELIMITER ',');" | awk -F, '{printf "galaxy_job_state,job_id=%s running_start_time=%d final_state_time=%d\n", $1, $2, $3}'
galaxy_job_state,job_id=82962917 running_start_time=1746235836 final_state_time=1746276734 |
I do not understand Query 2, for what do I need it when in Query 1 I have submit time and in Query 3 start time and finish time? Or probably I do not understand it fully. I was also thinking that maybe it could be enough to collect data only once a day. Can it help the infrastructure with effectiveness? Maybe it could make even more sense from perspective of people using these metrics. |
Oh, damn. I see now, those are metrics for CPUs and memory. I see, I am sorry. So as I said, I will not use them now, but maybe later I can, so I can collect those for later use. |
Don't you need the job resource usage requirements such as per job how much of memory and CPU cores allocated (you requested this in the initially)? |
I see, requirements are changed. However, feel free to use that data. |
Galaxy Job Radar needs the metrics below to be accessible in InfluxDB so it would be able to evaluate how effective the scheduling on Pulsars and the meta-scheduling of the whole system have been in past time frames.
Those metrics we need by destination so that we can take all jobs in some time frame by one destination and compute various graphs based on various metrics.
@sanjaysrikakulam
The text was updated successfully, but these errors were encountered: