src/aiu_trace_analyzer/pipeline/rcu_utilization.py: average PT active…#83
src/aiu_trace_analyzer/pipeline/rcu_utilization.py: average PT active…#83tirtha-ibm wants to merge 1 commit intoIBM:mainfrom
Conversation
… information with autopilot on in torch profiler Signed-off-by: Tirtha Biswas <Tirtha.Biswas@ibm.com>
lasch
left a comment
There was a problem hiding this comment.
Thanks for your contribution. A few comments to address please:
- currently it's failing the flake8 test with an unused global variable
- there are more flake8 issues reported:
./src/aiu_trace_analyzer/pipeline/flow_launch.py:71:9: E265 block comment should start with '# '
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:353:11: E271 multiple spaces after keyword
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:702:1: E305 expected 2 blank lines after class or function definition, found 1
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:704:1: C901 'compute_utilization' is too complex (18)
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:704:1: E302 expected 2 blank lines, found 0
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:710:121: E501 line too long (140 > 120 characters)
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:731:13: F824 `global jobhash_map` is unused: name is never assigned in scope
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:753:5: E303 too many blank lines (2)
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:758:16: E111 indentation is not a multiple of 4
- additional comments below
This is a welcome addition and feature. I'm happy to work this out directly.
| else: | ||
| aiulog.log(aiulog.WARN, "FLOWS: Ignoring event with ts after schedule wait", event) | ||
| assert self.queues[qid]["last_ts"] <= sched_wait_end, f"{qid}: {event}, {self.queues[qid]}" | ||
| #assert self.queues[qid]["last_ts"] <= sched_wait_end, f"{qid}: {event}, {self.queues[qid]}" |
There was a problem hiding this comment.
Is there a reason for commenting out this check? Did you encounter traces where it failed?
| _autopilot_pattern = re.compile(r'DSM-AutoPilot BEGIN') | ||
| _iteration_mode_pattern = re.compile(r'^\s+(DECODING|PREFILL)\s+$') | ||
| _non_kernel_names = ["Total"] | ||
| _total = re.compile(r'Total ') |
There was a problem hiding this comment.
Please follow the pattern of the other variable names and use _total_pattern
| jobhash_map: dict[int, bool] = {} | ||
| processID: int = 0 |
There was a problem hiding this comment.
Global variables should be avoided here. The context classes should be used to hold 'global' state for the functions of the pipeline.
| if event["name"] != "aiuScheduleWait": | ||
| return [event] | ||
| try: | ||
| ideal_dur = float(context.get_ideal_dur(kernel_name, pid, job_fingerprint)) | ||
| if autopilot and event["name"] == "aiuScheduleWait": | ||
| global jobhash_map | ||
| if event["args"]["jobhash"] not in jobhash_map: | ||
| ideal_dur = list(totals)[1] | ||
| jobhash_map[event["args"]["jobhash"]] = True | ||
| else: | ||
| ideal_dur = list(totals)[0] | ||
| else: | ||
| ideal_dur = float(context.get_ideal_dur(kernel_name, pid, job_fingerprint)) |
There was a problem hiding this comment.
This functionality should be baked into the context. I suggest the following rough idea if autopilot is enabled:
- we add the 'total' cycles to the corresponding table structure as a 'mock-up' kernel name: e.g. 'supernode_kernel'
extract_kernel_from_event_namereturns the 'supernode_kernel' as the name to lookup- the rest should then fall into place automatically
… information with autopilot on in torch profiler