Architecture

[[TOC]]

Overview

EAR is formed by a set of components, where each of them and their relationships with each other provides a full system software which accounts the power and energy consumption of jobs and applications in a cluster, provides a runtime library for application performance monitoring and optimization which can be loaded dynamically during application execution, a global power-capping system and a flexible reporting system to fit any storage requirements for saving all the collected data, all designed to be as most transparent as possible from the user point of view. This section introduces all of these components and how they are stacked to provide different services and EAR features.

System power consumption and job accounting

This is the most basic feature. EAR is able to collect node power consumption and report it periodically thanks to the EAR Node Manager(EARD), a Linux service which runs on each compute node. Is up to the sysadmin to decide how and where its periodic metrics are reported. The following figure shows this scheme.

The EAR Node Manager provides an API which can be used by a batch scheduler plug-in/hook to indicate the start/end of jobs/steps so it can account the power consumption of such entities. Currently, EAR distribution comes with a SLURM SPANK plug-in for supporting the accounting of jobs and steps in SLURM systems.

Application performance monitoring and energy efficiency optimization

Along with applications running in compute nodes, a runtime library can be loaded dynamically (thanks again to the batch scheduler support). The EAR Job Manager(EARL) runs within application/workflow processes, so it can collect performance metrics, which can be reported in the same way as with the Node Manager, but still configurable. Moreover, the Job Manager comes with optimization policies, which can select the optimal CPU/IMC/GPU frequencies based on those performance metrics by contacting with the Node Manager. Below figure shows the interaction between these two components.

EAR Node Manager

The Node Manager (EARD) is a per-node linux service that provides privileged metrics of each node as well as a periodic power monitoring service. Said periodic power metrics can be sent to EAR's database directly, via the EAR Database Daemon (EARDBD) or by using some of the provided report plug-ins.

See the EARDBD section and the configuration page for more information about the EAR Database Manager and how to to configure the EARD to send its collected data to it.

Overview

EARD is the component in charge of providing any kind of services that requires privileged capabilities. Current version is conceived as an external process executed with root privileges.

It provides the following services, each one covered by one thread:

Provides privileged metrics to EARL such as the average frequency, uncore integrated memory controller counters to compute the memory bandwidth, as well as energy metrics (DC node, DRAM and package energy).
Implements a periodic power monitoring service. This service allows EAR package to control the total energy consumed in the system.
Offers a remote API used by EARplug, EARGM and EAR commands. This API accepts requests such as get the system status, change policy settings or notify new job/end job events.

Requirements

If using the EAR Database as the storage targe, EARD connects with EARDBD service, that has to be up before starting the node daemon, otherwise values reported by EARD to be stored in the database, will be lost.

Configuration

The EAR Daemon uses the $(EAR_ETC)/ear/ear.conf file to be configured. It can be dynamically configured by reloading the service.

Please visit the EAR configuration file page for more information about the options of EARD and other components.

Execution

To execute this component, these systemctl command examples are provided:

sudo systemctl start eard to start the EARD service.
sudo systemctl stop eard to stop the EARD service.
sudo systemctl reload eard to force reloading the configuration of the EARD service.

Log messages are generated during the execution. Use journalctl command to see eard message:

sudo journalctl -u eard -f

Reconfiguration

After executing a systemctl reload eard command, not all the EARD options will be dynamically updated. The list of updated variables are:

DefaultPstates
NodeDaemonMinPstate
NodeDaemonVerbose
NodeDaemonPowermonFreq
SupportedPolicies
MinTimePerformanceAccuracy

To reconfigure other options such as EARD connection port, coefficients, etc., it must be stopped and restarted again. Visit the EAR configuration file page for more information about the options of EARD and other components.

EAR Database Manager

The EAR Database Daemon (EARDBD) acts as an intermediate layer between any EAR component that inserts data and the EAR's Database, in order to prevent the database server from collapsing due to getting overrun with connections and insert queries.

The Database Manager caches records generated by the EAR Library and the EARD in the system and reports it to the centralized database. It is recommended to run several EARDBDs if the cluster is big enough in order to reduce the number of inserts and connections to the database.

Also, the EARDBD accumulates data during a period of time to decrease the total insertions in the database, helping the performance of big queries. By now just the energy metrics are available to accumulate in the new metric called energy aggregation. EARDBD uses periodic power metrics sent by the EARD, the per-node daemon, including job identification details (Job Id and Step Id when executed in a SLURM system).

Configuration

The EAR Database Daemon uses the $(EAR_ETC)/ear/ear.conf file to be configured. It can be dynamically configured by reloading the service.

Please visit the EAR configuration file page for more information about the options of EARDBD and other components.

Execution

To execute this component, these systemctl command examples are provided:

sudo systemctl start eardbd to start the EARDBD service.
sudo systemctl stop eardbd to stop the EARDBD service.
sudo systemctl reload eardbd to force reloading the configuration of the EARDBD service.

EAR Global Manager (System power manager)

The EAR Global Manager Daemon (EARGMD) is a cluster wide component offering cluster energy monitoring and capping. EARGM can work in two modes: manual and automatic. When running in manual mode, EARGM monitors the total energy consumption, evaluates the percentage of energy consumption over the energy limit set by the admin and reports the cluster status to the DB. When running in automatic mode, apart from evaluating the energy consumption percentage it sends the evaluation to computing nodes. EARDs passes these messages to EARL which re-applies the energy policy with the new settings.

Apart from sending messages and reporting the energy consumption to the DB, EARGM offers additional features to notify the energy consumption: automatic execution of commands is supported and mails can also automatically be sent. Both the command to be executed or the mail address can be defined in the ear.conf, where it can also be specified the energy limits, the monitoring period, etc.

EARGM uses periodic aggregated power metrics to efficiently compute the cluster energy consumption. Aggregated metrics are computed by EARDBD based on power metrics reported by EARD, the per-node daemon.

Note: if you have multiple EARGMs running, only 1 should be used for Energy management. To turn off energy management for a certain EARGM simply set its energy value to 0.

Power capping

EARGM also includes an optional power capping system. Power capping can work in two different ways:

Cluster power cap (unlimited): Each EARGM controls the power consumption of the nodes under them by ensuring the global power does not exceed a set value. While the global power is under a percentage of the global value, the nodes run without any cap. If it approaches said value, a message is sent to all nodes to set their powercap to a pre-set value (via max_powercap in the tags section of ear.conf). Should the power go back to a value under the cap, a message is sent again so the nodes run at their default value (unlimited power).
Fine grained power cap control: Each EARGM controls the power consumption of the nodes under them and redistributes a certain budget between the nodes, allocating more to nodes who need it. It guarantees that any node has its default powercap allocation (defined by the powercap field in the tags section of ear.conf) if it is running an application.

Furthermore, when using fine grained power cap control it is possible to have multiple EARGMs, each controlling a part of the cluster, with (or without) meta-EARGMs redistributing the power allocation of each EARGM depending on the current needs of each part of the cluster. If no meta-EARGMs are specified, the power value each EARGM has will be static.

Meta-EARGMs are NOT compatible with the unlimited cluster powercap mode.

Local powercap

EARGM has a local version that can be run without privileges and that controls the power consumption of a list of nodes. This can be used as a rudimentary version for job powercap, where a job with N nodes is not allowed to consume more than a certain amount of power. In the current version, if the allocated power is exceeded a powercap will be applied to all nodes equally (that is, the same amount of power will be allocated to each node, regardless of their actual consumption). Furthermore, custom scripts may be executed when the power reaches certain thresholds so the user can have more control over what to do.

See the execution section for how to run this mode.

Configuration

The EAR Global Manager uses the $(EAR_ETC)/ear/ear.conf file to be configured. It can be dynamically configured by reloading the service.

Please visit the EAR configuration file page for more information about the options of EARGM and other components.

Additonally, 2 EARGMs can be used in the same host by declaring the environment variable EARGMID to specify which EARGM configuration each should use. If said variable is not declared, all EARGMs in the same host will read the first entry.

Execution

To execute this component, these systemctl command examples are provided:

sudo systemctl start eargmd to start the EARGM service.
sudo systemctl stop eargmd to stop the EARGM service.
sudo systemctl reload eargmd to force reloading the configuration of the EARGM service.

To execute a local EARGM with powercap for certain nodes, one may run it as:

eargmd --powercap=2000 --nodes=node[0-4] --powercap-policy soft --suspend-perc 90 --suspend-action suspend_action.sh --powercap-period=10 --conf-path=$HOME/ear_install/etc/ear/ear.conf

This will execute an EARGM that will control nodes node[0-4], apply a total powercap of 2000W with a soft powercap policy (that is, the application will run as normal unless the aggregated power of all 5 nodes reaches 2000W, at which point a power limit of 400 per node will be applied).

suspend-perc indicates the percentage of power to reach for suspend-action to be executed; in this case, when power reaches 1800W suspend_action.sh will be called once. A reciprocal of this exists, called resume-perc and resume-action which will only be called once resume-perc power has been reached AND suspend-action has been called.

Finally, powercap-period sets the time between polls for power from the nodes (how often the EARGM checks the current power consumption), and conf-path specifies a custom ear.conf file.

For more information, one can run eargmd --help.

The EAR Library (Job Manager)

The EAR Library (EARL) is the core of the EAR package. The Library offers a lightweight and simple solution to select the optimal frequency for applications at runtime, with multiple power policies each with a different approach to find said frequency.

EARL uses the Daemon to read performance metrics and to send application data to EAR Database.

Overview

EARL is dynamically loaded next to the running applications by the EAR Loader. The Loader detects whether the application is MPI or not. In case it is MPI, it also detects whether it is Intel or OpenMPI, and it intercepts the MPI symbols through the PMPI interface, and next symbols are saved in order to provide compatibility with MPI or other profiling tools. The Library is divided in several stages summarized in the following picture:

Automatic detection of application outer loops. This is done by intercepting MPI calls and invoking the Dynamic Application Iterative Structure detector algorithm. DynAIS is highly optimized for new Intel architectures, reporting low overhead. For non-MPI applications, EAR implements a time-guided approach.
Computation of the application signature. Once DynAIS starts reporting iterations for the outer loop, EAR starts to compute the application signature. This signature includes: iteration time, DC power consumption, bandwidth, cycles, instructions, etc. Since the DC power measurements error highly depends on the hardware, EAR automatically detects the hardware characteristics and sets a minimum time to compute the signature in order to minimize the average error.

The loop signature is used to classify the application activity in different phases. The current EAR version supports the following phases for: IO bound, CPU computation and GPU idle, CPU busy waiting and GPU computing, CPU-GPU computation, and CPU computation (for CPU only nodes). For phases including CPU computation, the optimization policy is applied. For other phases, the EAR library implements some predefined CPU/Memory/GPU frequency settings.

Power and performance projection. EAR has its own performance and power models which requires the application and the system signatures as an input. The system signature is a set of coefficients characterizing each node in the system. They are computed during the learning phase at the EAR configuration step. EAR projects the power used and computing time (performance) of the running application for all the available frequencies in the system. These models are applied to CPU metrics and projects CPU performance and power when varying the CPU frequency. Using these projections the optimization policy can select the optimal CPU memory.

Apply the selected energy optimization policy. EAR includes two power policies to be selected at runtime: minimize time to solution and minimize energy to solution, if permitted by the system administrator. At this point, EAR executes the power policy, using the projections computed in the previous phase, and selects the optimal frequency for an application and its particular run. An additional policy, monitoring only can also be used, but in this case no changes to the running frequency will be made but only the computation and storage of the application signature and metrics will be done. The short version of the names is used when submitting jobs (min_energy, min_time, monitoring). Current policies already includes memory frequency selection but in this case it is not based on models, it is a guided search. Check in your installation in the memory frequency optimization is enabled by default. In case the application is MPI, the policies already classifies the processes as balanced or unbalanced. In case they are unbalanced, a per-process CPU frequency is applied.

Some specific configurations are modified when jobs are executed sharing nodes with other jobs. For example the memory frequency optiization is disabled. See section environment variables page for more information on how to tune the EAR library optimization using environment variables.

Configuration

The Library uses the $(EAR_ETC)/ear.conf file to be configured. Please visit the EAR configuration file page for more information about the options of EARL and other components.

EARL receives its specific settings through a shared memory regions initialized by EARD.

Usage

For information on how to run applications alongside with EARL read the User guide. Next section contains more information regarding EAR's optimisation policies.

Classification

In the context of the Library's pipeline, phase classification is the module that, given the last computed application signature, undertakes the task of identifying the type of activity of the application, thereby giving hints to optimize its execution. By approaching the application signature as a semantic expression of this activity, the classification allows for guiding (or even skipping, if possible) subsequent steps of the pipeline.

Taking this activity as what we call an execution phase, EAR currently accounts for the following types:

Computational phases, which focus on the activity of the CPU in a way that is useful for the Job Manager's pipeline. This type includes execution phases, such as:
- CPU-bound: intensity of calculus-related operations, as measured by the Cycles per Instruction (or CPI) and Floating-point operations (or GFLOPS).
- Memory-bound: intensity of accesses to (main) memory (as measured by MEM_GBS in GB/s), and the memory Transactions per Instruction (or TPI).
- Mix: intensity distributed between both calculus-related operations, and accesses to main memory.
Non-computational phases, which focus on the activity of the CPU in a way that allows for applying pre-defined optimizations to the application. This type includes the following execution phases:
- CPU busy wait: intensive usage of the CPU due to an active wait
- IO-bound: intensive usage of input-output channels
- MPI-bound: presence of lots of MPI calls

Due to the fact that identifying computational phases correctly can optimize the Library's pipeline, and given making solid distinctions between execution phases is a complex task, the classification strategy becomes a key element of the main action loop. Currently, EAR incorporates three different strategies:

Default strategy: EAR's default classification model is based on setting predefined ranges of values for CPI and MEM_GBS metrics. These ranges, defined according to the architecture's characteristics via expert knowledge, allow identifying the different execution phases on a fundamental level, and are available since EAR's installation.
Roofline strategy: this approach is based on the roofline model, which conducts bottleneck analysis of the architecture's peak floating point performance and memory traffic to characterize the activity of any application. This strategy becomes available once the peaks for both resources have been computed, and allows for identifying execution phase types in a simple and quick way in runtime.
K-medoids strategy: this approach is based on the classification offered by the k-medoids clustering method, originally derived from k-means. The strategy, which becomes available once EAR has enough data in the database, allows for a more flexible classification than that of previous strategies, while also allowing for regeneration over time, as needed.

Policies

EAR offers three energy policies plugins: min_energy, min_time and monitoring. The last one is not a power policy, is used just for application monitoring where CPU frequency is not modified (neither memory or GPU frequency). For application analysis monitoringcan be used with specific CPU, memory and/or GPU frequencies.

The energy policy is selected by setting the --ear-policy=policy option when submitting a SLURM job. A policy parameter, which is a particular value or threshold depending on the policy, can be set using the flag --ear-policy-th=value. Its default value is defined in the configuration file, for more information check the configuration page for more information.

`min_energy`

The goal of this policy is to minimise the energy consumed with a limit to the performance degradation. This limit is is set in the SLURM --ear-policy-th option or the configuration file. The min_energy policy will select the optimal frequency that minimizes energy enforcing (performance degradation <= parameter). When executing with this policy, applications starts at default frequency(specified at ear.conf).

PerfDegr = (CurrTime - PrevTime) / (PrevTime)

`min_time`

The goal of this policy is to improve the execution time while guaranteeing a minimum ratio between performance benefit and frequency increment that justifies the increased energy consumption from this frequency increment. The policy uses the SLURM parameter option mentioned above as a minimum efficiency threshold.

Example: if --ear-policy-th=0.75, EAR will prevent scaling to upper frequencies if the ratio between performance gain and frequency gain do not improve at least 75% (PerfGain >= (FreqGain * threshold).

PerfGain=(PrevTime-CurrTime)/PrevTime
FreqGain=(CurFreq-PrevFreq)/PrevFreq

When launched with min_time policy, applications start at a default frequency (defined at ear.conf). Check the configuration page for more information.

Example: given a system with a nominal frequency of 2.3GHz and default P_STATE set to 3, an application executed with min_time will start with frequency F\\\[i\\\]=2.0Ghz (3 P_STATEs less than nominal). When application metrics are computed, the library will compute performance projection for F\\\[i+1\\\] and will compute the performance_gain as shown in the Figure 1. If performance gain is greater or equal than threshold, the policy will check with the next performance projection F\\\[i+2\\\]. If the performance gain computed is less than threshold, the policy will select the last frequency where the performance gain was enough, preventing the waste of energy.

Figure 1: min_time uses the threshold value as the minimum value for the performance gain between F\\\[i\\\] and F\\\[i+1\\\].

EAR Loader

The EAR Loader is the responsible for loading the EAR Library. It is a small and lightweight library loaded by the EAR SLURM Plugin (through the LD_PRELOAD environment variable) that identifies the user application and loads its corresponding EAR Library distribution.

The Loader detects the underlying application, identifying the MPI version (if used) and other minor details. With this information, the loader opens the suitable EAR Library version.

As can be read in the EARL page, depending on the MPI vendor the MPI types can be different, preventing any compatibility between distributions. For example, if the MPI distribution is OpenMPI, the EAR Loader will load the EAR Library compiled with the OpenMPI includes.

You can read the installation guide for more information about compiling and installing different EARL versions.

EAR SLURM plugin

EAR SLURM plug-in allows to dynamically load and configure the EAR Library for the SLURM jobs (and steps), if the flag --ear=on is set or if it is enabled by default. Additionally, it reports any jobs that start or end to the nodes' EARDs for accounting and monitoring purposes.

Configuration

Visit the SLURM SPANK plugin section on the configuration page to set up properly the SLURM /etc/slurm/plugstack.conf file.

You can find the complete list of EAR SLURM plugin accpeted parameters in the user guide.

EAR Data Center Monitor

It is a new EAR service for Data Center monitoring. In particular, it targets elements different than computational nodes which are already monitored by the EARD running in compute nodes. It has a dedicated section you can read for more information.

EAR application API

EAR offers a user API for applications. The current EAR version only offers two sets of functions:

To measure the energy consumption
To set the cpu and gpu frequencies .
int ear_connect()
int ear_energy(unsigned long \\\*energy_mj, unsigned long \\\*time_ms)
void ear_energy_diff(unsigned long ebegin, unsigned long eend, unsigned long \\\*ediff, unsigned long tbegin, unsigned long tend, unsigned long \\\*tdiff)
int ear_set_cpufreq(cpu_set_t \\\*mask,unsigned long cpufreq);
int ear_set_gpufreq(int gpu_id,unsigned long gpufreq)
int ear_set_gpufreq_list(int num_gpus,unsigned long \\\*gpufreqlist)
void ear_disconnect()

EAR's header file and library can be found at $EAR_INSTALL_PATH/include/ear.h and $EAR_INSTALL_PATH/lib/libEAR_api.so respectively. The following example reports the energy, time, and average power during that time for a simple loop including a sleep(5).

#define _GNU_SOURCE
#include <ear.h>

int main(int argc,char *argv[])
{
  unsigned long e_mj=0,t_ms=0,e_mj_init,t_ms_init,e_mj_end,t_ms_end=0;
  unsigned long ej,emj,ts,tms,os,oms;
  unsigned long ej_e,emj_e,ts_e,tms_e,os_e,oms_e;
  int i=0;
  struct tm *tstamp,*tstamp2,*tstamp3,*tstamp4;
  char s[128],s2[128],s3[128],s4[128];
  
  /* Connecting with ear */
  if (ear_connect()!=EAR_SUCCESS)
  {
    printf("error connecting eard\n");
    exit(1);
  }

  /* Reading energy */
  if (ear_energy(&e_mj_init,&t_ms_init)!=EAR_SUCCESS)
  {
    printf("Error in ear_energy\n");
  }
  while(i<5)
  {
    sleep(5);

    /* READING ENERGY */
    if (ear_energy(&e_mj_end,&t_ms_end)!=EAR_SUCCESS)
    {
      printf("Error in ear_energy\n");
    }
    else
    {
      ts=t_ms_init/1000;
      ts_e=t_ms_end/1000;
      tstamp=localtime((time_t *)&ts);
      strftime(s, sizeof(s), "%c", tstamp);
              tstamp2=localtime((time_t *)&ts_e);
              strftime(s2, sizeof(s), "%c", tstamp2);
 
      printf("Start time %s End time %s\n",s,s2);
      ear_energy_diff(e_mj_init,e_mj_end, &e_mj, t_ms_init,t_ms_end,&t_ms);
      printf("Time consumed %lu (ms), energy consumed %lu(mJ), 
             Avg power %lf(W)\n",t_ms,e_mj,(double)e_mj/(double)t_ms);
      e_mj_init=e_mj_end;
      t_ms_init=t_ms_end;
    }
    i++;
  }
  ear_disconnect();
}

Architecture

Overview

System power consumption and job accounting

Application performance monitoring and energy efficiency optimization

EAR Node Manager

Overview

Requirements

Configuration

Execution

Reconfiguration

EAR Database Manager

Configuration

Execution

EAR Global Manager (System power manager)

Power capping

Local powercap

Configuration

Execution

The EAR Library (Job Manager)

Overview

Configuration

Usage

Classification

Policies

min_energy

min_time

EAR Loader

EAR SLURM plugin

Configuration

EAR Data Center Monitor

EAR application API

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`min_energy`

`min_time`