-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Hi, I've completed and installed version 5.1.10 on ubuntu 20 server, able to start eardb and eard daemon
but failed to start eargmd, getting core dump
./configure --prefix=/opt/ear-1 --with-slurm=/usr/local/etc --with-freeipmi=/usr/bin/ipmitool --with-cuda=/usr/local/cuda MPICC=mpicc MPICC_FLAGS=-O2 -g MAKE_NAME=openmpi MPI_VERSION=ompi EAR_TMP=/opt/ear-1/tmp EAR_ETC=/opt/ear-1/etc
Here is the message found in syslog
Oct 16 13:26:15 smc-gpu-01 kernel: [622450.537306] eargmd[964817]: segfault at fffffffffffff9b0 ip 000055917204056a sp 00007ffce2f6ff30 error 5 in eargmd[559172032000+58000]
Oct 16 13:26:16 smc-gpu-01 systemd[1]: eargmd.service: Main process exited, code=dumped, status=11/SEGV
Oct 16 13:26:16 smc-gpu-01 systemd[1]: eargmd.service: Failed with result 'core-dump'.
Oct 16 13:26:16 smc-gpu-01 systemd[1]: eargmd.service: Scheduled restart job, restart counter is at 4.
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: Using /opt/ear-1/etc/ear/ear.conf as EARGM configuration file
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: --> EARGM configuration
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 eargm: verbosen 2 #011use_aggregation 1 #011t1 90 #011t2 259200 #011mode: 1 #011mail: nomail
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 eargm: defcon levels [85,90,95] grace period 3
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 policy 0 (0=MaxEnergy,other=error) units=K (-,K,M)
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 use_log 1 report plugins mysql.so
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 powercap_check_period 120 #011powercap_mode monitoring
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 power limit for action 90 and for lower 40
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 powercap_limit_action no_action and powercap_lower_action no_action
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 energycap_action no_action
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011 EARGM definitions
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: #011#011->EARGM 1 node smc-gpu-01 (port 0) energy limit 0 power limit 0 (disabled)
Oct 16 13:26:16 smc-gpu-01 eargmd[964820]: mysql init
Oct 16 13:26:16 smc-gpu-01 kernel: [622451.035780] eargmd[964820]: segfault at fffffffffffff9b0 ip 000055779e2aa56a sp 00007ffd6b974920 error 5 in eargmd[55779e29c000+58000]
Log's from eargmd.log
>>>>> Path specifications <<<<<
DB file_pathname:
--->EAR_TMP: /opt/ear-1/tmp
--->EAR_ETC: /opt/ear-1/etc
Plugins_path: /opt/ear-1/lib/plugins
Default plugins: Energy energy_cpu_gpu.so power_models avx512_model.so
Verbose: 2
--->Default_policy: min_energy (id 2)
--->Min_time_perf_acc: 10000000
>>>>> Policies configuration section <<<<<
---> policy monitoring id 0 p_state 0 def_freq 0.000 NormalUsersAuth 1[Thu Oct 16 13:26:16 2025] tag:
---> policy min_time id 1 p_state 4 def_freq 0.000 NormalUsersAuth 1 setting0 0.70 [Thu Oct 16 13:26:16 2025] tag:
---> policy min_energy id 2 p_state 0 def_freq 0.000 NormalUsersAuth 1 setting0 0.05 [Thu Oct 16 13:26:16 2025] tag:
---> policy optimize id 3 p_state 0 def_freq 0.000 NormalUsersAuth 1 setting0 0.05 [Thu Oct 16 13:26:16 2025] tag:
>>>>> Authorization section <<<<<
Users
--->user: root
--->user: idps
Groups
--->groups: root
--->groups: idps
Sccounts
--->acc: root
--->acc: idps
>>>>> Specific node configurations section <<<<<
>>>>> SQL DB server section <<<<<
--> DB configuration
---> IP: 127.0.0.1 sec_ip User: ear_daemon User commands ear_commands Port:3306 DB:EAR
-->max_connections 20 report_node_details 1 report_sig_details 1 report_loops 1
>>>>> EARDBD: DB manager section <<<<<
--> EARDBD configuration
---> Insertion time 30 Aggregation time: 60 TCP port: 50002 Sec. TCP port: 50003 Sync Port: 50004 CacheSize: 120
--> use_log 1i report plugins mysql.so
>>>>> EARD: Node manager section <<<<<
eard: verbosen 1 period 60 max_pstate 1
eard: turbo 0 port 50001 use_db 1 use_eardbd 1
eard: force_frequencies 1
eard: use_log 1 report plugin eardbd.so
>>>>> EARGM: System power manager section <<<<<
--> EARGM configuration
eargm: verbosen 2 use_aggregation 1 t1 90 t2 259200 mode: 1 mail: nomail
eargm: defcon levels [85,90,95] grace period 3
policy 0 (0=MaxEnergy,other=error) units=K (-,K,M)
use_log 1 report plugins mysql.so
powercap_check_period 120 powercap_mode monitoring
power limit for action 90 and for lower 40
powercap_limit_action no_action and powercap_lower_action no_action
energycap_action no_action
EARGM definitions
->EARGM 1 node smc-gpu-01 (port 0) energy limit 0 power limit 0 (disabled)
>>>>> TAGS <<<<<<
>>>>> Computational nodes <<<<<
Island[0]
--->id: 0 (min_power 30, max_power 5000,power_cap -1.0 power_cap_type=node)
---> (power>5500 or temp>150 are errors)
---->prefix: smc-gpu-01 start: 4294967295 end: 4294967295 eargm: 1
>>>>> Data Center Monitoring section <<<<<<
[Thu Oct 16 13:26:16 2025]
>>>>> Energy tags section <<<<<
--> Tag: cpu-intensive pstate: 1
---> user: all
--> Tag: turbo pstate: 0
--> Tag: memory-intensive pstate: 4
---> user: usr1
---> user: usr2
---> accounts: acc1
---> accounts: acc2
---> group: grp1
---> group: grp2
>>>>> EAR library section <<<<<
-->Coefficients path: /opt/ear-1/etc/ear/coeffs
-->DynAIS levels: 10
-->DynAIS window size: 200
-->dynais timeout 15 ear period 10 check every 1000
-->report plugins eard.so
[Thu Oct 16 13:26:16 2025] MAXENERGY policy configured with limit 0 Kilo Joules
ear.conf config file copied from templated and modified, only DBIp and nodelist
Here is the congiguration:
# EAR Configuration File
#-------------------------------------------------------------------------------------------------
# DB confguration: This configuration conrrespondons with the DB server installation.
#-------------------------------------------------------------------------------------------------
DBIp=127.0.0.1
# Add a secondary IP for high availability
#DBSECIP=add_secondary_ip_for_ha
DBUser=ear_daemon
DBPassw=password
# User and password for usermode querys.
DBCommandsUser=ear_commands
DBCommandsPassw=password
DBDatabase=EAR
DBPort=3306
DBMaxConnections=20
# Extended node information saves also the average frequency and temperature.
DBReportNodeDetail=1
# Extended signature information saves also the hardware counters.
DBreportSIGDetail=1
# Report loop signatures.
DBReportLoops=1
#--------------------------------------------------------------------------------------------------
# EAR Daemon (EARD): Update this section to change EARD configuration.
#--------------------------------------------------------------------------------------------------
# Port is used for connections with the EAR plugin and commands.
NodeDaemonPort=50001
# Frequency at wich periodic metrics are reported, in seconds.
NodeDaemonPowermonFreq=60
# Max frequency used by eard. It's max frequency but min pstate.
NodeDaemonMinPstate=1
NodeDaemonTurbo=0
# Defines whether EARD uses the DB.
NodeUseDB=1
# Defines if EARD connects with EARDBD to report data or directly with the DB server.
# Only for testing.
NodeUseEARDBD=1
# When set to 1, this flag means EARD must set frequencies before job starts.
# If not, frequency is only changed in case job runs with EARL.
NodeDaemonForceFrequencies=1
# Verbosity.
NodeDaemonVerbose=1
# When set to 1, the output is saved in 'TmpDir'/eard.log (common configuration) as a log file.
NodeUseLog=1
# Set the report plug-in to be loaded.
EARDReportPlugins=eardbd.so
#-------------------------------------------------------------------------------------------------
# EAR Database Manager (EARDBD): Update this section to change EARDBD configuration.
#-------------------------------------------------------------------------------------------------
DBDaemonPortTCP=50002
DBDaemonPortSecTCP=50003
DBDaemonSyncPort=50004
# Aggregation time id frequency at which power metrics are aggregated in aggregated metrics, in seconds.
DBDaemonAggregationTime=60
# Frequency at which buffered data is sent to DB server.
DBDaemonInsertionTime=30
# Memory size expressed in MB per process (server and/or mirror) to cache the values.
DBDaemonMemorySize=120
#
# The percentage of the memory buffer used by the previous field, by each type.
# These types are: mpi, non-mpi and learning applications, loops, energy metrics and aggregations and events, in that order. If a type gets 0% of space, this metric is discarded and not saved into the database.
#
#DBDaemonMemorySizePerType=40,20,5,24,5,1,5
# When set to 1, the output is saved in 'TmpDir'/eardbd.log (common configuration) as a log file.
DBDaemonUseLog=1
EARDBDReportPlugins=mysql.so
#--------------------------------------------------------------------------------------------------
# EAR Library (EARL): These options modify internal EARL behaviour.
# Do not modify except you are an expert.
#--------------------------------------------------------------------------------------------------
CoefficientsDir=/opt/ear-1/etc/ear/coeffs
# Sets the minimum period (in microseconds) to do power readings.
# This value acts as a lower bound of the LibraryPeriod field.
# However, the energy plug-in is who sets the minimum possible value to do power readings.
MinTimePerformanceAccuracy=10000000
# DynAIS configuration
DynAISLevels=10
DynAISWindowSize=200
#
# Maximum time (in seconds) EAR will wait until a signature is computed.
# After 'DynaisTimeout' seconds, if no signature is computed, EAR will go to periodic mode.
#
DynaisTimeout=15
# When EAR goes to periodic mode, it will compute the application signature every 'LibraryPeriod' seconds.
LibraryPeriod=10
# EAR will check every N mpi calls whether it must go to periodic mode or not.
CheckEARModeEvery=1000
# EAR library default report plugin
EARLReportPlugins=eard.so
#--------------------------------------------------------------------------------------------------
# EAR Global Manager (EARGMD): Update that section to use EARGM.
#--------------------------------------------------------------------------------------------------
#
# Verbosity
EARGMVerbose=2
# When set to 1, the output is saved in 'TmpDir'/eargmd.log (common configuration) as a log file.
EARGMUseLog=1
# Email address to report the warning level (and the action taken in automatic mode).
EARGMMail=nomail
# Period T1 and T2 are specified in seconds (ex. T1 must be less than T2, ex. 10min and 1 month).
EARGMEnergyPeriodT1=90
EARGMEnergyPeriodT2=259200
# '-' are Joules, 'K' KiloJoules and 'M' MegaJoules.
EARGMEnergyUnits=K
# EARGM events will be registered through the selected report plugins
EARGMReportPlugins=mysql.so
# Use aggregated periodic metrics or periodic power metrics.
# Aggregated metrics are only available when EARDBD is running.
EARGMEnergyUseAggregated=1
# Two modes are supported '0=manual' and '1=automatic'.
EARGMEnergyMode=1
# Percentage of accumulated energy to start the warning DEFCON level L4, L3 and L2.
EARGMEnergyWarningsPerc=85,90,95
# T1 "grace" periods between DEFCON before re-evaluate.
EARGMEnergyGracePeriods=3
# Format for action is: command_name energy_T1 energy_T2 energy_limit T2 T1 units "
# This action is automatically executed at each warning level (only once per grace periods).
EARGMEnergyAction=no_action
#### POWERCAP definition for EARGM: Powercap is still under development. Do not activate.
# Period at which the powercap thread is activated. Meta-EARGM checks the EARGMs it controls every 2*EARGMPowerPeriod
EARGMPowerPeriod=120
# Powercap mode: 0 is monitoring, 1 is hard powercap, 2 is soft powercap.
EARGMPowerCapMode=0
# Admins can specify to automatically execute a command in EARGMPowerCapSuspendAction when total_power >= EARGMPowerLimit*EARGMPowerCapResumeLimit/100
EARGMPowerCapSuspendLimit=90
# Format for action is: command_name current_power current_limit total_idle_nodes total_idle_power
EARGMPowerCapSuspendAction=no_action
#
# Admins can specify to automatically execute a command in EARGMPowerCapResumeAction to undo EARGMPowerCapSuspendAction
# when total_power >= EARGMPowerLimit*EARGMPowerCapResumeLimit/100.
# Note that this will only be executed if a suspend action was executed previously.
#
EARGMPowerCapResumeLimit=40
# Format for action is: command_name current_power current_limit total_idle_nodes total_idle_power
EARGMPowerCapResumeAction=no_action
# Sets the report plugins to use for EARGM warning and events accounting
EARGMReportPlugins=mysql.so
# EARGMs must be specified with a unique id, their node and the port that receives remote
# connections. An EARGM can also act as meta-eargm if the meta field is filled, and it will
# control the EARGMs whose ids are in said field. If two EARGMs are in the same node,
# setting the EARGMID environment variable overrides the node field and chooses the characteristics
# of the EARGM with the correspoding id.
#
# Only one EARGM can currently control the energy caps, so setting the rest to 0 is recommended and
# the limit applies to EARGMPeriodT2, using EARGMEnergyUnits to define the units.
# energy = 0 -> energy_cap disabled
# power = 0 -> powercap disabled
# power = N -> powercap budget for that EARGM (and the nodes it controls) is N
# power = -1 -> powercap budget is calculated by adding up the powercap set to each of the nodes under its control.
# This is incompatible with nodes that have their powercap unlimited (powercap = 1)
EARGMId=1 node=smc-gpu-01
#EARGMId=2 energy=0 power=500 node=node2 port=50100
#EARGMId=3 energy=0 power=500 node=node3 port=50100
#--------------------------------------------------------------------------------------------------
# Common configuration
#--------------------------------------------------------------------------------------------------
TmpDir=/opt/ear-1/tmp
EtcDir=/opt/ear-1/etc
InstDir=/opt/ear-1
Verbose=2
# Network extension (using another network instead of the local one).
# If compute nodes must be accessed from login nodes with a network different than default,
# and can be accesed using a expension, uncommmet next line and define 'netext' accordingly.
#NetworkExtension=netext
#---------------------------------------------------------------------------------------------------
# Authorized Users
#---------------------------------------------------------------------------------------------------
#
# Authorized users,accounts and groups are allowed to change policies, thresholds, frequencies, etc.
# They are supposed to be admins, all special name is supported.
#
AuthorizedUsers=root,idps
AuthorizedAccounts=root,idps
AuthorizedGroups=root,idps
#---------------------------------------------------------------------------------------------------
# Tags
#---------------------------------------------------------------------------------------------------
# Tags are used for architectural descriptions. Max. AVX frequencies are used in predictor models
# and are SKU-specific. Max. and min. power are used for warning and error tracking.
# Powercap specifies the maximum power a node is allowed to use by default. If an EARGM is
# controlling the cluster with mode UNLIMITED (powercap=1) max_powercap is the set power that
# a node will receive if the cluster needs to be power capped (otherwise it runs
# with unlimited power). A different than the default powercap plugin can be specified for nodes
# using the tag. POWERCAP=0 --> disabled, POWERCAP=1 -->unlimited, POWERCAP=N (> 1) limits node to N watts
# At least a default tag is mandatory to be included in this file for a cluster to work properly.
#
#
# List of accepted options is: max_avx512(GHz), max_avx2(GHz), max_power(W), min_power(W), error_power(W), coeffs(filename),
# powercap(W), powercap_plugin(filename), energy_plugin(filename), gpu_powercap_plugin(filename), max_powercap(W), gpu_def_freq(KHz),
# cpu_max_pstate(0..max_pstate), imc_max_pstate(0..max_imc_pstate), energy_model(filename)
# imc_max_freq, imc_min_freq, idle_governor(def default), idle_pstate
# If gpu_def_freq is 0, GPU frequency is not set at job init/end.
# If gpu_def_freq is greater than 0 and, the EAR Library is being used OR
# NodeDaemonForceFrequencies is set to 1, then the GPU frequency is set at job init to the
# minimum between the maximum permitted GPU frequency and gpu_def_freq.
#
#
#Tag=CPU default=yes max_avx512=2.2 max_avx2=2.6 max_power=5000 min_power=50 error_power=5000 coeffs=coeffs.default powercap=1 powercap_plugin=dvfs.so energy_plugin=energy_nm.so
#Tag=GPU max_avx512=2.2 max_avx2=2.6 max_power=5000 min_power=150 error_power=5000 coeffs=coeffs.default powercap=1 powercap_plugin=dvfs.so energy_plugin=energy_nm.so gpu_powercap_plugin=gpu.so idle_governor=ondemand idle_pstate=1000 policy=optimize
#---------------------------------------------------------------------------------------------------
## Power policies
## ---------------------------------------------------------------------------------------------------
#
## Policy names must be exactly file names for policies installeled in the system.
DefaultPowerPolicy=min_energy
Policy=monitoring Settings=0 DefaultPstate=0 Privileged=0
Policy=min_time Settings=0.7 DefaultPstate=4 Privileged=0
Policy=min_energy Settings=0.05 DefaultPstate=0 Privileged=0
Policy=optimize Settings=0.05 DefaultPstate=0 Privileged=0
#
# For homogeneous systems, default frequencies can be easily specified using freqs.
# For heterogeneous systems it is preferred to use pstates or use tags
#
# Example with freqs (lower pstates corresponds with higher frequencies). Pstate=1 is nominal and 0 is turbo
#Policy=monitoring Settings=0 DefaultFreq=2.4 Privileged=0
#Policy=min_time Settings=0.7 DefaultFreq=2.0 Privileged=0
#Policy=min_energy Settings=0.05 DefaultFreq=2.4 Privileged=1
#Example with tags
#Policy=monitoring Settings=0 DefaultFreq=2.6 Privileged=0 tag=6126
#Policy=min_time Settings=0.7 DefaultFreq=2.1 Privileged=0 tag=6126
#Policy=min_energy Settings=0.05 DefaultFreq=2.6 Privileged=1 tag=6126
#Policy=monitoring Settings=0 DefaultFreq=2.4 Privileged=0 tag=6148
#Policy=min_time Settings=0.7 DefaultFreq=2.0 Privileged=0 tag=6148
#Policy=min_energy Settings=0.05 DefaultFreq=2.4 Privileged=1 tag=6148
#---------------------------------------------------------------------------------------------------
# Energy Tags
#---------------------------------------------------------------------------------------------------
#
# Privileged users, accounts and groups are allowed to use EnergyTags.
# The "allowed" TAGs are defined by row together with the priviledged user/group/account.
#
EnergyTag=cpu-intensive pstate=1 users=all
EnergyTag=turbo pstate=0
EnergyTag=memory-intensive pstate=4 users=usr1,usr2 groups=grp1,grp2 accounts=acc1,acc2
#--------------------------------------------------------------------------------------------------
# Node Isles
#--------------------------------------------------------------------------------------------------
# It is mandatory to specify all the nodes in the cluster, grouped by islands. More than one line
# per island must be supported to hold nodes with different names or for pointing to different
# EARDBDs through its IPs or hostnames.
# EARGMID is the field that specifies which EARGM controls the nodes in that line. If no EARGMID
# is specified, it will pick the first EARGMID value that is found (ie, the previous line's EARGMID).
#
#
# In the following example the nodes are clustered in two different islands, but the Island 1 have
# two types of EARDBDs configurations.
#
Island=0 Nodes=smc-gpu-01 DBIP=127.0.0.1 EARGMId=1
# These nodes are in island0 using different DB connections and with a different architecture.
# These nodes will use the same EARGM as the previous nodes.
#Island=0 Nodes=node11[01-80] DBIP=node1084 DBSECIP=node1085 tag=CPU
# These nodes are is island0 and will use default values for DB connection (line 0 for island0) and default tag.
#Island=0 Nodes=node12[01-80] tag=GPU
# Will use default tag.
#Island=1 Nodes=node11[01-80] DBIP=node1181 DBSECIP=node1182
# --------------------------------------------------------------------------------------------------
# Data Center Monitor configuration: Uncomment if needed
# --------------------------------------------------------------------------------------------------
# A edcmon.template.conf is available at /opt/ear-1/etc/ear/. Change the name if needed
#include=/opt/ear-1/etc/ear/edcmon.conf
I tried updated island and EMGRID values but none helped.
could you please help with the issue?
Thanks
Metadata
Metadata
Assignees
Labels
No labels