Skip to content

src/aiu_trace_analyzer/pipeline/rcu_utilization.py: average PT active…#83

Open
tirtha-ibm wants to merge 1 commit intoIBM:mainfrom
tirtha-ibm:autopilot
Open

src/aiu_trace_analyzer/pipeline/rcu_utilization.py: average PT active…#83
tirtha-ibm wants to merge 1 commit intoIBM:mainfrom
tirtha-ibm:autopilot

Conversation

@tirtha-ibm
Copy link
Contributor

… information with autopilot on in torch profiler

… information with autopilot on in torch profiler

Signed-off-by: Tirtha Biswas <Tirtha.Biswas@ibm.com>
Copy link
Member

@lasch lasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution. A few comments to address please:

  • currently it's failing the flake8 test with an unused global variable
  • there are more flake8 issues reported:
./src/aiu_trace_analyzer/pipeline/flow_launch.py:71:9: E265 block comment should start with '# '
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:353:11: E271 multiple spaces after keyword
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:702:1: E305 expected 2 blank lines after class or function definition, found 1
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:704:1: C901 'compute_utilization' is too complex (18)
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:704:1: E302 expected 2 blank lines, found 0
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:710:121: E501 line too long (140 > 120 characters)
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:731:13: F824 `global jobhash_map` is unused: name is never assigned in scope
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:753:5: E303 too many blank lines (2)
./src/aiu_trace_analyzer/pipeline/rcu_utilization.py:758:16: E111 indentation is not a multiple of 4
  • additional comments below

This is a welcome addition and feature. I'm happy to work this out directly.

else:
aiulog.log(aiulog.WARN, "FLOWS: Ignoring event with ts after schedule wait", event)
assert self.queues[qid]["last_ts"] <= sched_wait_end, f"{qid}: {event}, {self.queues[qid]}"
#assert self.queues[qid]["last_ts"] <= sched_wait_end, f"{qid}: {event}, {self.queues[qid]}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for commenting out this check? Did you encounter traces where it failed?

_autopilot_pattern = re.compile(r'DSM-AutoPilot BEGIN')
_iteration_mode_pattern = re.compile(r'^\s+(DECODING|PREFILL)\s+$')
_non_kernel_names = ["Total"]
_total = re.compile(r'Total ')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the pattern of the other variable names and use _total_pattern

Comment on lines +702 to +703
jobhash_map: dict[int, bool] = {}
processID: int = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Global variables should be avoided here. The context classes should be used to hold 'global' state for the functions of the pipeline.

Comment on lines +727 to +738
if event["name"] != "aiuScheduleWait":
return [event]
try:
ideal_dur = float(context.get_ideal_dur(kernel_name, pid, job_fingerprint))
if autopilot and event["name"] == "aiuScheduleWait":
global jobhash_map
if event["args"]["jobhash"] not in jobhash_map:
ideal_dur = list(totals)[1]
jobhash_map[event["args"]["jobhash"]] = True
else:
ideal_dur = list(totals)[0]
else:
ideal_dur = float(context.get_ideal_dur(kernel_name, pid, job_fingerprint))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality should be baked into the context. I suggest the following rough idea if autopilot is enabled:

  • we add the 'total' cycles to the corresponding table structure as a 'mock-up' kernel name: e.g. 'supernode_kernel'
  • extract_kernel_from_event_name returns the 'supernode_kernel' as the name to lookup
  • the rest should then fall into place automatically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants