Skip to content

Thor Queue Recommendation - Planning and Design Documentation#29

Draft
Copilot wants to merge 3 commits intomasterfrom
copilot/optimize-thor-queue-usage-again
Draft

Thor Queue Recommendation - Planning and Design Documentation#29
Copilot wants to merge 3 commits intomasterfrom
copilot/optimize-thor-queue-usage-again

Conversation

Copy link

Copilot AI commented Nov 13, 2025

Description

Adds comprehensive planning documentation for post-execution Thor queue recommendation analysis. When a workunit completes, the system should analyze its resource usage and estimate performance/cost across available queue configurations to identify the most efficient queue choice.

No code implemented - deliverable is planning documentation only, as requested in the problem statement.

Documentation Structure

Four planning documents created in common/wuanalysis/:

README_QUEUE_RECOMMENDATION.md (9KB)

Navigation hub with quick-start guides, algorithm summary, and document index. Target audiences: implementers, reviewers, project managers.

IMPLEMENTATION_SUMMARY.md (10KB)

Executive overview covering deliverables, design rationale, next steps, risk mitigation, and success metrics.

QUEUE_RECOMMENDATION_PLAN.md (18KB)

Complete technical specification:

  • Four class designs: QueueConfiguration, SubgraphResourceUsage, QueueEstimation, WorkunitQueueAnalyser
  • Time estimation algorithm with memory scaling and spill detection
  • Cost calculation using existing calcCostNs() infrastructure
  • Configuration format (XML), API design, integration strategy
  • Five implementation phases, comprehensive testing approach

GITHUB_COPILOT_PROMPT.md (15KB)

Ready-to-use implementation guide with code templates, algorithm pseudocode, HPCC Platform style requirements, phased approach, and success criteria.

Core Algorithm

Per-subgraph estimation on each candidate queue:

// Scale memory by worker ratio
ScaledMemory = PeakRowMemory × (CandidateWorkers / ActualWorkers)

// Detect spilling
WillSpill = (ScaledMemory > QueueMaxMemory)

// Apply configurable penalty (default 10x) if newly spilling
if (WillSpill && !OriginallySpilled)
    EstimatedTime = ActualTime × SpillPenaltyFactor

// Scale CPU time, preserve I/O latency
CpuTime = (TimeUser + TimeSystem) × (CandidateWorkers / ActualWorkers)
BlockedTime = max(0, TimeElapsed - CpuTime)
FinalTime = CpuTime + BlockedTime

// Calculate cost
Cost = calcCostNs(QueueVMCostPerHour, FinalTime) × NumWorkers

Critical Information Gaps

Seven gaps documented with resolution strategies:

Priority Gap Location
HIGH StatisticKind enums for Thor stats Research jstats.h, workunit headers
HIGH Queue configuration source Design decision required
HIGH Actual worker count extraction Find in workunit statistics
MEDIUM Memory units convention Verify bytes assumption
MEDIUM Spill detection mechanism Confirm PeakTempDisk > 0
MEDIUM Integration trigger points Decide when to run
LOW Result storage approach Workunit attrs vs separate

Integration Points

Extends WorkunitAnalyserBase, follows patterns from WorkunitRuleAnalyser/WorkunitStatsAnalyser. New API function analyseQueueRecommendation() in anawu.hpp/cpp. Compatible with wutool CLI.

Implementation Timeline

Estimated 4 weeks post-gap-resolution:

  • Week 1: Skeleton + statistics extraction
  • Week 2: Core algorithm
  • Week 3: Configuration + integration
  • Week 4: Reporting + testing

Type of change:

  • This change improves the code (refactor or other change that does not change the functionality)

Checklist:

  • My code follows the code style of this project.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
  • I have read the CONTRIBUTORS document.

Testing:

Planning documentation only - no code to test. Design validation against problem statement requirements completed.

Original prompt

When a workunit executes in thor the costs of running the job is closely linked to the resources that the underlying processes have available. The system allows multiple different thor configurations (or queues), and ideally a job should be target the queue that uses the minimum resources that can efficiently process a job. Each queue has the following constraints

  • maxmimum memory available for rows
  • maximum memory available for temp disk
  • number of cpus
  • number of workers.

The main constraint for whether a queue can execute a job efficiently is the amount of ram it uses. A thor job consists of multiple graphs, each graph is split into multiple subgraphs. Each subgraph records the following statistics: SizePeakRowMemory - how much memory was required to run that subgraph SizePeakTempDisk - if activities within the subgraph ran out of memory, then it is the amount of data that had to spill to disk. This is 0 if no activities had to spill. SizePeakEphemeralDisk - the maximum amount of local disk space that was consumed by temporary files - this includes spills from activities, and files that are written to pass information between the different subgraphs. TimeUser - the amount of userspace time that the subgraph consumed
TimeSystem - the amount of time the subgraph spend in the operating system kernel
TimeElapsed - the total time the subgraph took.

The average cpu utilitization can be computed as (TimeSystem+TimeUser)/TimeElapsed

For each of these the maximum values are the most interesting, because the maximum values determmine the overall performance of the job.

The goal of this task is to work out after a workunit has completed, which of the queues it would have been most efficient to run on.

To do this it should estimate how long it would take a job to run on each of the different queue configurations. One suggested approach is to iterate through each of the subgraphs and calculate the expected time. If the maximum peak row memory required is less than the memory available then the estimated time should stay the same. If the subgraph spilled then the time should remain the same. Otherwise multiple the elapsed time for the subgraph by a configurable factor - defaulting to the value 10.
Scale the memory used and the estimated elapsed time by the number of workers in the candidate queue.
For instance if a job was run on a queue with 40 workers and used 100MB of memory. Then it would use 200MB of memory when running on a queue with 20 workers.

The estimated run time should be scaled in the following way:
Actual cpuTime = TimeUser + TimeSystem
BlockedTime = if (TimeElapsed > cpuTime) then TimeElapsed - cpuTime else 0
EstimatedCpuTime = cpuTime * candidate workers / actual workers.
Estimated run time = EstimatedCpuTime + BlockedTime

Run this analysis for each of the different queue configurations, and report estimated excutuion times for each queue. Also include an estimate cost of running the job on each of the different queues. This can be calculed from the elapsed time, the number of workers, and the vm costs. Functions alredy exist in the code base to perform the calculations. They may need adapting.

If you do not know how to obtain particular pieces of information to complete the task create a class to represent the information and clear comments describing what information is required.

The directory common/wuanalyis already contains c++ classes for post-analysis of a workunit. The aim would be to add new functionality to the code in that directory to perform this analysis.

Do not implement the code. Instead create a detailed plan of how you would implement this functionality, and generate a prompt as the output that would be suitable for supplying to the githib copilot to implement the functionality. Include clear placeholders where more information needs to be supplied.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits November 13, 2025 12:25
Co-authored-by: ghalliday <902490+ghalliday@users.noreply.github.com>
Co-authored-by: ghalliday <902490+ghalliday@users.noreply.github.com>
Copilot AI changed the title [WIP] Optimize resource allocation for Thor job execution Thor Queue Recommendation - Planning and Design Documentation Nov 13, 2025
Copilot AI requested a review from ghalliday November 13, 2025 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants