Skip to content

eargm daemon core dump on version 5.1.10 #2

@rajeshitshoulders

Description

@rajeshitshoulders

Hi, I've completed and installed version 5.1.10 on ubuntu 20 server, able to start eardb and eard daemon
but failed to start eargmd, getting core dump

./configure --prefix=/opt/ear-1 --with-slurm=/usr/local/etc --with-freeipmi=/usr/bin/ipmitool --with-cuda=/usr/local/cuda MPICC=mpicc MPICC_FLAGS=-O2 -g MAKE_NAME=openmpi MPI_VERSION=ompi EAR_TMP=/opt/ear-1/tmp EAR_ETC=/opt/ear-1/etc

Here is the message found in syslog

Oct 16 13:26:15 smc-gpu-01 kernel: [622450.537306] eargmd[964817]: segfault at fffffffffffff9b0 ip 000055917204056a sp 00007ffce2f6ff30 error 5 in eargmd[559172032000+58000]
Oct 16 13:26:16 smc-gpu-01 systemd[1]: eargmd.service: Main process exited, code=dumped, status=11/SEGV
Oct 16 13:26:16 smc-gpu-01 systemd[1]: eargmd.service: Failed with result 'core-dump'.
Oct 16 13:26:16 smc-gpu-01 systemd[1]: eargmd.service: Scheduled restart job, restart counter is at 4.
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: Using /opt/ear-1/etc/ear/ear.conf as EARGM configuration file
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: --> EARGM configuration
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 eargm: verbosen 2 #011use_aggregation 1 #011t1 90 #011t2 259200 #011mode: 1 #011mail: nomail
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 eargm: defcon levels [85,90,95] grace period 3
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 policy 0 (0=MaxEnergy,other=error) units=K (-,K,M)
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 use_log 1 report plugins mysql.so
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 powercap_check_period 120 #011powercap_mode monitoring
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 power limit for action 90 and for lower 40
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 powercap_limit_action no_action and powercap_lower_action no_action
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 energycap_action no_action
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 EARGM definitions
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011#011->EARGM 1 node smc-gpu-01 (port 0) energy limit 0 power limit 0 (disabled)
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: mysql init
Oct 16 13:26:16 smc-gpu-01 kernel: [622451.035780] eargmd[964820]: segfault at fffffffffffff9b0 ip 000055779e2aa56a sp 00007ffd6b974920 error 5 in eargmd[55779e29c000+58000]

Log's from eargmd.log

>>>>> Path specifications <<<<<

DB file_pathname:
--->EAR_TMP: /opt/ear-1/tmp
--->EAR_ETC: /opt/ear-1/etc

Plugins_path: /opt/ear-1/lib/plugins

Default plugins: Energy energy_cpu_gpu.so power_models avx512_model.so

Verbose: 2
--->Default_policy: min_energy (id 2)
--->Min_time_perf_acc: 10000000

>>>>> Policies configuration section <<<<<
---> policy monitoring id 0 p_state 0 def_freq 0.000 NormalUsersAuth 1[Thu Oct 16 13:26:16 2025]  tag:
---> policy min_time id 1 p_state 4 def_freq 0.000 NormalUsersAuth 1 setting0  0.70 [Thu Oct 16 13:26:16 2025]  tag:
---> policy min_energy id 2 p_state 0 def_freq 0.000 NormalUsersAuth 1 setting0  0.05 [Thu Oct 16 13:26:16 2025]  tag:
---> policy optimize id 3 p_state 0 def_freq 0.000 NormalUsersAuth 1 setting0  0.05 [Thu Oct 16 13:26:16 2025]  tag:

>>>>> Authorization section <<<<<

Users
--->user: root
--->user: idps

Groups
--->groups: root
--->groups: idps

Sccounts
--->acc: root
--->acc: idps

>>>>> Specific node configurations section <<<<<

>>>>> SQL DB server section <<<<<

--> DB configuration
---> IP: 127.0.0.1 sec_ip       User: ear_daemon        User commands ear_commands      Port:3306       DB:EAR
-->max_connections 20 report_node_details 1 report_sig_details 1 report_loops 1

>>>>> EARDBD: DB  manager section <<<<<
--> EARDBD configuration
---> Insertion time 30  Aggregation time: 60    TCP port: 50002 Sec. TCP port: 50003    Sync Port: 50004        CacheSize: 120
--> use_log 1i report plugins mysql.so

>>>>> EARD: Node manager section <<<<<
         eard: verbosen 1 period 60 max_pstate 1
         eard: turbo 0 port 50001 use_db 1 use_eardbd 1
         eard: force_frequencies 1
         eard: use_log 1 report plugin eardbd.so

>>>>> EARGM: System power manager section <<<<<
--> EARGM configuration
         eargm: verbosen 2      use_aggregation 1       t1 90   t2 259200       mode: 1         mail: nomail
         eargm: defcon levels [85,90,95] grace period 3
         policy 0 (0=MaxEnergy,other=error) units=K (-,K,M)
         use_log 1 report plugins mysql.so
         powercap_check_period 120      powercap_mode monitoring
         power limit for action 90 and for lower 40
         powercap_limit_action no_action and powercap_lower_action no_action
         energycap_action no_action
         EARGM definitions
                ->EARGM 1 node smc-gpu-01 (port 0) energy limit 0 power limit 0 (disabled)

>>>>> TAGS <<<<<<

>>>>> Computational nodes <<<<<
Island[0]
--->id: 0 (min_power 30, max_power 5000,power_cap -1.0 power_cap_type=node)
--->       (power>5500 or temp>150 are errors)
---->prefix: smc-gpu-01 start: 4294967295       end: 4294967295 eargm: 1

>>>>> Data Center Monitoring  section <<<<<<
[Thu Oct 16 13:26:16 2025]

>>>>> Energy tags section <<<<<
--> Tag: cpu-intensive   pstate: 1
---> user: all
--> Tag: turbo   pstate: 0
--> Tag: memory-intensive        pstate: 4
---> user: usr1
---> user: usr2
---> accounts: acc1
---> accounts: acc2
---> group: grp1
---> group: grp2

>>>>> EAR library section <<<<<
-->Coefficients path: /opt/ear-1/etc/ear/coeffs
-->DynAIS levels: 10
-->DynAIS window size: 200
-->dynais timeout 15 ear period 10 check every 1000
-->report plugins eard.so
[Thu Oct 16 13:26:16 2025] MAXENERGY policy configured with limit 0 Kilo Joules

ear.conf config file copied from templated and modified, only DBIp and nodelist

Here is the congiguration:

# EAR Configuration File

#-------------------------------------------------------------------------------------------------
# DB confguration: This configuration conrrespondons with the DB server installation.
#-------------------------------------------------------------------------------------------------
DBIp=127.0.0.1

# Add a secondary IP for high availability
#DBSECIP=add_secondary_ip_for_ha

DBUser=ear_daemon
DBPassw=password

# User and password for usermode querys.
DBCommandsUser=ear_commands
DBCommandsPassw=password

DBDatabase=EAR
DBPort=3306
DBMaxConnections=20

# Extended node information saves also the average frequency and temperature.
DBReportNodeDetail=1

# Extended signature information saves also the hardware counters.
DBreportSIGDetail=1

# Report loop signatures.
DBReportLoops=1

#--------------------------------------------------------------------------------------------------
# EAR Daemon (EARD): Update this section to change EARD configuration.
#--------------------------------------------------------------------------------------------------
# Port is used for connections with the EAR plugin and commands.
NodeDaemonPort=50001
# Frequency at wich periodic metrics are reported, in seconds.
NodeDaemonPowermonFreq=60
# Max frequency used by eard. It's max frequency but min pstate.
NodeDaemonMinPstate=1
NodeDaemonTurbo=0
# Defines whether EARD uses the DB.
NodeUseDB=1
# Defines if EARD connects with EARDBD to report data or directly with the DB server.
# Only for testing.
NodeUseEARDBD=1

# When set to 1, this flag means EARD must set frequencies before job starts.
# If not, frequency is only changed in case job runs with EARL.
NodeDaemonForceFrequencies=1

# Verbosity.
NodeDaemonVerbose=1

# When set to 1, the output is saved in 'TmpDir'/eard.log (common configuration) as a log file.
NodeUseLog=1

# Set the report plug-in to be loaded.
EARDReportPlugins=eardbd.so
#-------------------------------------------------------------------------------------------------
# EAR Database Manager (EARDBD): Update this section to change EARDBD configuration.
#-------------------------------------------------------------------------------------------------
DBDaemonPortTCP=50002
DBDaemonPortSecTCP=50003
DBDaemonSyncPort=50004
# Aggregation time id frequency at which power metrics are aggregated in aggregated metrics, in seconds.
DBDaemonAggregationTime=60
# Frequency at which buffered data is sent to DB server.
DBDaemonInsertionTime=30
# Memory size expressed in MB per process (server and/or mirror) to cache the values.
DBDaemonMemorySize=120
#
# The percentage of the memory buffer used by the previous field, by each type.
# These types are: mpi, non-mpi and learning applications, loops, energy metrics and aggregations and events, in that order. If a type gets 0% of space, this metric is discarded and not saved into the database.
#
#DBDaemonMemorySizePerType=40,20,5,24,5,1,5

# When set to 1, the output is saved in 'TmpDir'/eardbd.log (common configuration) as a log file.
DBDaemonUseLog=1
EARDBDReportPlugins=mysql.so

#--------------------------------------------------------------------------------------------------
# EAR Library (EARL): These options modify internal EARL behaviour.
# Do not modify except you are an expert.
#--------------------------------------------------------------------------------------------------
CoefficientsDir=/opt/ear-1/etc/ear/coeffs

# Sets the minimum period (in microseconds) to do power readings.
# This value acts as a lower bound of the LibraryPeriod field.
# However, the energy plug-in is who sets the minimum possible value to do power readings.
MinTimePerformanceAccuracy=10000000

# DynAIS configuration
DynAISLevels=10
DynAISWindowSize=200
#
# Maximum time (in seconds) EAR will wait until a signature is computed.
# After 'DynaisTimeout' seconds, if no signature is computed, EAR will go to periodic mode.
#
DynaisTimeout=15

# When EAR goes to periodic mode, it will compute the application signature every 'LibraryPeriod' seconds.
LibraryPeriod=10
# EAR will check every N mpi calls whether it must go to periodic mode or not.
CheckEARModeEvery=1000
# EAR library default report plugin
EARLReportPlugins=eard.so

#--------------------------------------------------------------------------------------------------
# EAR Global Manager (EARGMD): Update that section to use EARGM.
#--------------------------------------------------------------------------------------------------
#
# Verbosity
EARGMVerbose=2
# When set to 1, the output is saved in 'TmpDir'/eargmd.log (common configuration) as a log file.
EARGMUseLog=1
# Email address to report the warning level (and the action taken in automatic mode).
EARGMMail=nomail
# Period T1 and T2 are specified in seconds (ex. T1 must be less than T2, ex. 10min and 1 month).
EARGMEnergyPeriodT1=90
EARGMEnergyPeriodT2=259200
# '-' are Joules, 'K' KiloJoules and 'M' MegaJoules.
EARGMEnergyUnits=K
# EARGM events will be registered through the selected report plugins
EARGMReportPlugins=mysql.so
# Use aggregated periodic metrics or periodic power metrics.
# Aggregated metrics are only available when EARDBD is running.
EARGMEnergyUseAggregated=1
# Two modes are supported '0=manual' and '1=automatic'.
EARGMEnergyMode=1
# Percentage of accumulated energy to start the warning DEFCON level L4, L3 and L2.
EARGMEnergyWarningsPerc=85,90,95
# T1 "grace" periods between DEFCON before re-evaluate.
EARGMEnergyGracePeriods=3
# Format for action is: command_name energy_T1 energy_T2  energy_limit T2 T1  units "
# This action is automatically executed at each warning level (only once per grace periods).
EARGMEnergyAction=no_action

#### POWERCAP definition for EARGM: Powercap is still under development. Do not activate.
# Period at which the powercap thread is activated. Meta-EARGM checks the EARGMs it controls every 2*EARGMPowerPeriod
EARGMPowerPeriod=120
# Powercap mode: 0 is monitoring, 1 is hard powercap, 2 is soft powercap.
EARGMPowerCapMode=0
# Admins can specify to automatically execute a command in EARGMPowerCapSuspendAction when total_power >= EARGMPowerLimit*EARGMPowerCapResumeLimit/100
EARGMPowerCapSuspendLimit=90
# Format for action is: command_name current_power current_limit total_idle_nodes total_idle_power
EARGMPowerCapSuspendAction=no_action
#
# Admins can specify to automatically execute a command in EARGMPowerCapResumeAction to undo EARGMPowerCapSuspendAction
# when total_power >= EARGMPowerLimit*EARGMPowerCapResumeLimit/100.
# Note that this will only be executed if a suspend action was executed previously.
#
EARGMPowerCapResumeLimit=40
# Format for action is: command_name current_power current_limit total_idle_nodes total_idle_power
EARGMPowerCapResumeAction=no_action
# Sets the report plugins to use for EARGM warning and events accounting
EARGMReportPlugins=mysql.so

# EARGMs must be specified with a unique id, their node and the port that receives remote
# connections. An EARGM can also act as meta-eargm if the meta field is filled, and it will
# control the EARGMs whose ids are in said field. If two EARGMs are in the same node,
# setting the EARGMID environment variable overrides the node field and chooses the characteristics
# of the EARGM with the correspoding id.
#
# Only one EARGM can currently control the energy caps, so setting the rest to 0 is recommended and
# the limit applies to EARGMPeriodT2, using EARGMEnergyUnits to define the units.
# energy = 0 -> energy_cap disabled
# power = 0  -> powercap disabled
# power = N  -> powercap budget for that EARGM (and the nodes it controls) is N
# power = -1 -> powercap budget is calculated by adding up the powercap set to each of the nodes under its control.
#                               This is incompatible with nodes that have their powercap unlimited (powercap = 1)
EARGMId=1 node=smc-gpu-01
#EARGMId=2 energy=0 power=500 node=node2 port=50100
#EARGMId=3 energy=0 power=500 node=node3 port=50100

#--------------------------------------------------------------------------------------------------
# Common configuration
#--------------------------------------------------------------------------------------------------
TmpDir=/opt/ear-1/tmp
EtcDir=/opt/ear-1/etc
InstDir=/opt/ear-1
Verbose=2
# Network extension (using another network instead of the local one).
# If compute nodes must be accessed from login nodes with a network different than default,
# and can be accesed using a expension, uncommmet next line and define 'netext' accordingly.
#NetworkExtension=netext


#---------------------------------------------------------------------------------------------------
# Authorized Users
#---------------------------------------------------------------------------------------------------
#
# Authorized users,accounts and groups are allowed to change policies, thresholds, frequencies, etc.
# They are supposed to be admins, all special name is supported.
#
AuthorizedUsers=root,idps
AuthorizedAccounts=root,idps
AuthorizedGroups=root,idps

#---------------------------------------------------------------------------------------------------
# Tags
#---------------------------------------------------------------------------------------------------
# Tags are used for architectural descriptions. Max. AVX frequencies are used in predictor models
# and are SKU-specific. Max. and min. power are used for warning and error tracking.
# Powercap specifies the maximum power a node is allowed to use by default. If an EARGM is
# controlling the cluster with mode UNLIMITED (powercap=1) max_powercap is the set power that
# a node will receive if the cluster needs to be power capped (otherwise it runs
# with unlimited power). A different than the default powercap plugin can be specified for nodes
# using the tag. POWERCAP=0 --> disabled, POWERCAP=1 -->unlimited, POWERCAP=N (> 1) limits node to N watts
# At least a default tag is mandatory to be included in this file for a cluster to work properly.
#
#
# List of accepted options is: max_avx512(GHz), max_avx2(GHz), max_power(W), min_power(W), error_power(W), coeffs(filename),
# powercap(W), powercap_plugin(filename), energy_plugin(filename), gpu_powercap_plugin(filename), max_powercap(W), gpu_def_freq(KHz),
# cpu_max_pstate(0..max_pstate), imc_max_pstate(0..max_imc_pstate), energy_model(filename)
# imc_max_freq, imc_min_freq, idle_governor(def default), idle_pstate
# If gpu_def_freq is 0, GPU frequency is not set at job init/end.
# If gpu_def_freq is greater than 0 and, the EAR Library is being used OR
# NodeDaemonForceFrequencies is set to 1, then the GPU frequency is set at job init to the
# minimum between the maximum permitted GPU frequency and gpu_def_freq.
#
#
#Tag=CPU default=yes max_avx512=2.2 max_avx2=2.6 max_power=5000 min_power=50 error_power=5000 coeffs=coeffs.default powercap=1 powercap_plugin=dvfs.so energy_plugin=energy_nm.so
#Tag=GPU max_avx512=2.2 max_avx2=2.6 max_power=5000 min_power=150 error_power=5000 coeffs=coeffs.default powercap=1 powercap_plugin=dvfs.so energy_plugin=energy_nm.so  gpu_powercap_plugin=gpu.so idle_governor=ondemand idle_pstate=1000 policy=optimize

#---------------------------------------------------------------------------------------------------
## Power policies
## ---------------------------------------------------------------------------------------------------
#
## Policy names must be exactly file names for policies installeled in the system.
DefaultPowerPolicy=min_energy
Policy=monitoring Settings=0 DefaultPstate=0 Privileged=0
Policy=min_time Settings=0.7 DefaultPstate=4 Privileged=0
Policy=min_energy Settings=0.05 DefaultPstate=0 Privileged=0
Policy=optimize Settings=0.05 DefaultPstate=0 Privileged=0
#
# For homogeneous systems, default frequencies can be easily specified using freqs.
# For heterogeneous systems it is preferred to use pstates or use tags
#

# Example with freqs (lower pstates corresponds with higher frequencies). Pstate=1 is nominal and 0 is turbo
#Policy=monitoring Settings=0 DefaultFreq=2.4 Privileged=0
#Policy=min_time Settings=0.7 DefaultFreq=2.0 Privileged=0
#Policy=min_energy Settings=0.05 DefaultFreq=2.4 Privileged=1


#Example with tags
#Policy=monitoring Settings=0 DefaultFreq=2.6 Privileged=0 tag=6126
#Policy=min_time Settings=0.7 DefaultFreq=2.1 Privileged=0 tag=6126
#Policy=min_energy Settings=0.05 DefaultFreq=2.6 Privileged=1 tag=6126
#Policy=monitoring Settings=0 DefaultFreq=2.4 Privileged=0 tag=6148
#Policy=min_time Settings=0.7 DefaultFreq=2.0 Privileged=0 tag=6148
#Policy=min_energy Settings=0.05 DefaultFreq=2.4 Privileged=1 tag=6148

#---------------------------------------------------------------------------------------------------
# Energy Tags
#---------------------------------------------------------------------------------------------------
#
# Privileged users, accounts and groups are allowed to use EnergyTags.
# The "allowed" TAGs are defined by row together with the priviledged user/group/account.
#
EnergyTag=cpu-intensive pstate=1 users=all
EnergyTag=turbo pstate=0
EnergyTag=memory-intensive pstate=4 users=usr1,usr2 groups=grp1,grp2 accounts=acc1,acc2

#--------------------------------------------------------------------------------------------------
# Node Isles
#--------------------------------------------------------------------------------------------------
# It is mandatory to specify all the nodes in the cluster, grouped by islands. More than one line
# per island must be supported to hold nodes with different names or for pointing to different
# EARDBDs through its IPs or hostnames.
# EARGMID is the field that specifies which EARGM controls the nodes in that line. If no EARGMID
# is specified, it will pick the first EARGMID value that is found (ie, the previous line's EARGMID).
#

#
# In the following example the nodes are clustered in two different islands, but the Island 1 have
# two types of EARDBDs configurations.
#

Island=0 Nodes=smc-gpu-01  DBIP=127.0.0.1 EARGMId=1

# These nodes are in island0 using different DB connections and with a different architecture.
# These nodes will use the same EARGM as the previous nodes.
#Island=0 Nodes=node11[01-80] DBIP=node1084 DBSECIP=node1085 tag=CPU
# These nodes are is island0 and will use default values for DB connection (line 0 for island0) and default tag.
#Island=0 Nodes=node12[01-80] tag=GPU


# Will use default tag.
#Island=1 Nodes=node11[01-80] DBIP=node1181 DBSECIP=node1182


# --------------------------------------------------------------------------------------------------
# Data Center Monitor configuration: Uncomment if needed
# --------------------------------------------------------------------------------------------------

# A edcmon.template.conf is available at /opt/ear-1/etc/ear/. Change the name if needed
#include=/opt/ear-1/etc/ear/edcmon.conf

I tried updated island and EMGRID values but none helped.

could you please help with the issue?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions