Thor Queue Recommendation - Planning and Design Documentation by Copilot · Pull Request #29 · ghalliday/HPCC-Platform

Copilot · 2025-11-13T12:17:37Z

Description

Adds comprehensive planning documentation for post-execution Thor queue recommendation analysis. When a workunit completes, the system should analyze its resource usage and estimate performance/cost across available queue configurations to identify the most efficient queue choice.

No code implemented - deliverable is planning documentation only, as requested in the problem statement.

Documentation Structure

Four planning documents created in common/wuanalysis/:

README_QUEUE_RECOMMENDATION.md (9KB)

Navigation hub with quick-start guides, algorithm summary, and document index. Target audiences: implementers, reviewers, project managers.

IMPLEMENTATION_SUMMARY.md (10KB)

Executive overview covering deliverables, design rationale, next steps, risk mitigation, and success metrics.

QUEUE_RECOMMENDATION_PLAN.md (18KB)

Complete technical specification:

Four class designs: QueueConfiguration, SubgraphResourceUsage, QueueEstimation, WorkunitQueueAnalyser
Time estimation algorithm with memory scaling and spill detection
Cost calculation using existing calcCostNs() infrastructure
Configuration format (XML), API design, integration strategy
Five implementation phases, comprehensive testing approach

GITHUB_COPILOT_PROMPT.md (15KB)

Ready-to-use implementation guide with code templates, algorithm pseudocode, HPCC Platform style requirements, phased approach, and success criteria.

Core Algorithm

Per-subgraph estimation on each candidate queue:

// Scale memory by worker ratio
ScaledMemory = PeakRowMemory × (CandidateWorkers / ActualWorkers)

// Detect spilling
WillSpill = (ScaledMemory > QueueMaxMemory)

// Apply configurable penalty (default 10x) if newly spilling
if (WillSpill && !OriginallySpilled)
    EstimatedTime = ActualTime × SpillPenaltyFactor

// Scale CPU time, preserve I/O latency
CpuTime = (TimeUser + TimeSystem) × (CandidateWorkers / ActualWorkers)
BlockedTime = max(0, TimeElapsed - CpuTime)
FinalTime = CpuTime + BlockedTime

// Calculate cost
Cost = calcCostNs(QueueVMCostPerHour, FinalTime) × NumWorkers

Critical Information Gaps

Seven gaps documented with resolution strategies:

Priority	Gap	Location
HIGH	StatisticKind enums for Thor stats	Research jstats.h, workunit headers
HIGH	Queue configuration source	Design decision required
HIGH	Actual worker count extraction	Find in workunit statistics
MEDIUM	Memory units convention	Verify bytes assumption
MEDIUM	Spill detection mechanism	Confirm `PeakTempDisk > 0`
MEDIUM	Integration trigger points	Decide when to run
LOW	Result storage approach	Workunit attrs vs separate

Integration Points

Extends WorkunitAnalyserBase, follows patterns from WorkunitRuleAnalyser/WorkunitStatsAnalyser. New API function analyseQueueRecommendation() in anawu.hpp/cpp. Compatible with wutool CLI.

Implementation Timeline

Estimated 4 weeks post-gap-resolution:

Week 1: Skeleton + statistics extraction
Week 2: Core algorithm
Week 3: Configuration + integration
Week 4: Reporting + testing

Type of change:

This change improves the code (refactor or other change that does not change the functionality)

Checklist:

My code follows the code style of this project.
The commit message is properly formatted and free of typos.
- The commit message title makes sense in a changelog, by itself.
My change requires a change to the documentation.
- I have updated the documentation accordingly, or...
I have read the CONTRIBUTORS document.

Testing:

Planning documentation only - no code to test. Design validation against problem statement requirements completed.

Original prompt

When a workunit executes in thor the costs of running the job is closely linked to the resources that the underlying processes have available. The system allows multiple different thor configurations (or queues), and ideally a job should be target the queue that uses the minimum resources that can efficiently process a job. Each queue has the following constraints

maxmimum memory available for rows

maximum memory available for temp disk

number of cpus

number of workers.

The main constraint for whether a queue can execute a job efficiently is the amount of ram it uses. A thor job consists of multiple graphs, each graph is split into multiple subgraphs. Each subgraph records the following statistics: SizePeakRowMemory - how much memory was required to run that subgraph SizePeakTempDisk - if activities within the subgraph ran out of memory, then it is the amount of data that had to spill to disk. This is 0 if no activities had to spill. SizePeakEphemeralDisk - the maximum amount of local disk space that was consumed by temporary files - this includes spills from activities, and files that are written to pass information between the different subgraphs. TimeUser - the amount of userspace time that the subgraph consumed
TimeSystem - the amount of time the subgraph spend in the operating system kernel
TimeElapsed - the total time the subgraph took.

The average cpu utilitization can be computed as (TimeSystem+TimeUser)/TimeElapsed

For each of these the maximum values are the most interesting, because the maximum values determmine the overall performance of the job.

The goal of this task is to work out after a workunit has completed, which of the queues it would have been most efficient to run on.

To do this it should estimate how long it would take a job to run on each of the different queue configurations. One suggested approach is to iterate through each of the subgraphs and calculate the expected time. If the maximum peak row memory required is less than the memory available then the estimated time should stay the same. If the subgraph spilled then the time should remain the same. Otherwise multiple the elapsed time for the subgraph by a configurable factor - defaulting to the value 10.
Scale the memory used and the estimated elapsed time by the number of workers in the candidate queue.
For instance if a job was run on a queue with 40 workers and used 100MB of memory. Then it would use 200MB of memory when running on a queue with 20 workers.

The estimated run time should be scaled in the following way:
Actual cpuTime = TimeUser + TimeSystem
BlockedTime = if (TimeElapsed > cpuTime) then TimeElapsed - cpuTime else 0
EstimatedCpuTime = cpuTime * candidate workers / actual workers.
Estimated run time = EstimatedCpuTime + BlockedTime

Run this analysis for each of the different queue configurations, and report estimated excutuion times for each queue. Also include an estimate cost of running the job on each of the different queues. This can be calculed from the elapsed time, the number of workers, and the vm costs. Functions alredy exist in the code base to perform the calculations. They may need adapting.

If you do not know how to obtain particular pieces of information to complete the task create a class to represent the information and clear comments describing what information is required.

The directory common/wuanalyis already contains c++ classes for post-analysis of a workunit. The aim would be to add new functionality to the code in that directory to perform this analysis.

Do not implement the code. Instead create a detailed plan of how you would implement this functionality, and generate a prompt as the output that would be suitable for supplying to the githib copilot to implement the functionality. Include clear placeholders where more information needs to be supplied.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: ghalliday <902490+ghalliday@users.noreply.github.com>

Initial plan

1a1d1c8

Copilot AI assigned Copilot and ghalliday Nov 13, 2025

Copilot started work on behalf of ghalliday November 13, 2025 12:18 View session

Copilot AI and others added 2 commits November 13, 2025 12:25

Add comprehensive planning docs for Thor queue recommendation feature

b9fecaf

Co-authored-by: ghalliday <902490+ghalliday@users.noreply.github.com>

Add comprehensive README index for queue recommendation documentation

74ef8e4

Co-authored-by: ghalliday <902490+ghalliday@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Optimize resource allocation for Thor job execution~~ Thor Queue Recommendation - Planning and Design Documentation Nov 13, 2025

Copilot AI requested a review from ghalliday November 13, 2025 12:30

Copilot finished work on behalf of ghalliday November 13, 2025 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thor Queue Recommendation - Planning and Design Documentation#29

Thor Queue Recommendation - Planning and Design Documentation#29
Copilot wants to merge 3 commits intomasterfrom
copilot/optimize-thor-queue-usage-again

Copilot AI commented Nov 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Documentation Structure

README_QUEUE_RECOMMENDATION.md (9KB)

IMPLEMENTATION_SUMMARY.md (10KB)

QUEUE_RECOMMENDATION_PLAN.md (18KB)

GITHUB_COPILOT_PROMPT.md (15KB)

Core Algorithm

Critical Information Gaps

Integration Points

Implementation Timeline

Type of change:

Checklist:

Testing:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Nov 13, 2025 •

edited

Loading