Skip to content
aterrel edited this page Jan 6, 2012 · 7 revisions

User stories are a method of gathering requirements for a project. By determining who the users are the project can focus on their needs and, hopefully, avoid unnecessary features. Below are a few user stories guiding the tacc_stats project.

Roles

Here are the various user roles that tacc_stats tries to accommodate.

HPC Specialist

The HPC Specialist is an expert in building application codes and enhancing the performance of the code. She may have extensive experience in a few domains but usually looks at general patterns of HPC codes. She wants a quick look at data to indicate if there some common problems exist in the codes running on the systems.

Systems Specialist

The Systems Specialist wants to know how the system is responding to code running and what kind of improvements she might need to make for the system.

System User

The System User is a domain expert running codes to answer research questions in her field of expertise. Ultimately she is interested in time-to-solution, but often has varying abilities to profile the code. She wants a quick way to see what takes time in the code on the system and ways to mitigate the bottlenecks.

Current focus stories

Stories here are being actively developed.

Viewing data from system monitors

All users need a way to see the data being organized and easily viewed. Design of data viewers

Potential Stories

The stories here are not yet supported or planned for. As a story becomes important enough it will be specified further and moved to a milestone.

List of common bugs

The HPC Specialist knows a few common problems she incounters regularly. tacc_stats should provide a way to tell if the following problems are occuring:

  • Opening large numbers of files
  • Opening many files all at once
  • Long idle times
  • Unbalanced memory usage among running nodes

System user knowledge

The HPC Specialist would like to track the progress of system users to see if they are improving over time. By doing so she is able to snapshot the effectiveness of training programs and user support.

System diagnosis

The System Specialist knows that not all hardware is created equal. She wants to find slow nodes and fix them. Give a view into the data that helps make this quick.

Bad Jobs Notification

The HPC Specialist would like to be notified nightly about the worst jobs from the previous day based on a set of established metrics. By highlighting under-performing or misconfigured jobs, the HPC Specialist will be able to contact the users and help them improve their code or process.

Who is using hardware performance counters?

If a user enables some CPU performance counters, TACC Stats gets out of the way and records zeros in its CPU counter data. We should be able to find these users quickly and take note of them. Possible uses of the data include:

  • Reporting fraction of jobs/users that have ever profiled their codes
  • Contacting these users to see how their profiling studies went
  • Using these users' experiences to create a feature story or use case for the web on how profiling can improve scientific codes, find bugs, etc., etc.

Success stories

Stories here have been implemented, tested, and put into production.

Monitor subsystems

Each system has a lot of monitorable data. To even get started on any stories, a set of system monitors need to be created and setup to be run regularly.