Skip to content

bug(conda?): some models fail with sleef errors #85

@HaoZeke

Description

@HaoZeke

As a reproducer, in a conda-forge controlled environment:

mkdir blah; cd blah
# the best recipe..
wget https://atomistic-cookbook.org/_downloads/72b9bec1c6e219fe3a0fb83fa52b668b/eon-pet-neb.zip
unzip eon-pet-neb.zip
conda env create --file environment.yml -p $(pwd)/.tmp
conda activate $(pwd)/.tmp

So far so good. Also works with the PET-MAD stuff on upet, as seen in lab-cosmo/atomistic-cookbook#212

However, the OMAD models fail terribly. Generate the inputs..

python eon-pet-neb.py # takes a minute or less
# use a newer metatrain
uvx --from metatrain mtt export https://huggingface.co/lab-cosmo/upet/resolve/main/models/pet-omad-xs-v1.0.0.ckpt

Make the change in config.ini, i.e.

[Metatomic]
model_path = pet-omad-xs-v1.0.0.pt

Now a fresh run..

rm -rf neb* *.log
eonclient
Floating point exception: Overflow
Aborted (core dumped)

Which can be expanded to:

GDB trace
[New Thread 0x7fffa2ffb240 (LWP 778554)]

Thread 1 "eonclient" received signal SIGFPE, Arithmetic exception.
0x00007fffdca9f3a5 in Sleef_finz_expf8_u10avx2 () from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/./libsleef.so.3
(gdb) bt
#0  0x00007fffdca9f3a5 in Sleef_finz_expf8_u10avx2 () from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/./libsleef.so.3
#1  0x00007fffebe5062d in void c10::function_ref<void (char**, long const*, long, long)>::callback_fn<at::native::AVX2::VectorizedLoop2d<at::native::(anonymous namespace)::silu_kernel(at::TensorIteratorBase&)::{lambda()#2}::operator()() const::{lambda()#2}::operator()() const::{lambda(float)#1}, at::native::(anonymous namespace)::silu_kernel(at::TensorIteratorBase&)::{lambda()#2}::operator()() const::{lambda()#2}::operator()() const::{lambda(at::vec::AVX2::Vectorized<float>)#1}> >(long, char**, long const*, long, long) () from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libtorch_cpu.so
#2  0x00007fffe571d369 in at::TensorIteratorBase::serial_for_each(c10::function_ref<void (char**, long const*, long, long)>, at::Range) const ()
   from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libtorch_cpu.so
#3  0x00007fffe571da80 in at::TensorIteratorBase::for_each(c10::function_ref<void (char**, long const*, long, long)>, long) ()
   from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libtorch_cpu.so
#4  0x00007fffebe954ce in at::native::(anonymous namespace)::silu_kernel(at::TensorIteratorBase&) () from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libtorch_cpu.so
#5  0x00007fffe6ce9789 in at::(anonymous namespace)::wrapper_CPU_silu(at::Tensor const&) () from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libtorch_cpu.so
#6  0x00007fffe6ce995e in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::wrapper_CPU_silu>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) ()
   from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libtorch_cpu.so
#7  0x00007fffe6997b86 in at::_ops::silu::redispatch(c10::DispatchKeySet, at::Tensor const&) () from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libtorch_cpu.so
#8  0x00007fffe94a7233 in torch::autograd::VariableType::(anonymous namespace)::silu(c10::DispatchKeySet, at::Tensor const&) ()
   from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libtorch_cpu.so
#9  0x00007fffe94a781d in c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::silu>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) () from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libtorch_cpu.so
#10 0x00007fffea6d9566 in c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const [clone .isra.0] ()
   from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libtorch_cpu.so
#11 0x00007fffea241d28 in bool torch::jit::InterpreterStateImpl::runTemplate<false>(std::vector<c10::IValue, std::allocator<c10::IValue> >&) ()
   from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libtorch_cpu.so
#12 0x00007fffea247ce5 in torch::jit::InterpreterStateImpl::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) ()
   from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libtorch_cpu.so
#13 0x00007fffea222e76 in torch::jit::GraphExecutorImplBase::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) ()
   from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libtorch_cpu.so
#14 0x00007fffe9e96433 in torch::jit::Method::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&) const () from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libtorch_cpu.so
#15 0x00007ffff67edf8d in MetatomicPotential::force(long, double const*, int const*, double*, double*, double*, double const*) ()
   from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libmetatomic_pot.so
#16 0x00007ffff7f6f103 in Potential::get_ef(Eigen::Matrix<double, -1, 3, 1, -1, 3>, Eigen::Matrix<int, -1, 1, 0, -1, 1>, Eigen::Matrix<double, 3, 3, 1, 3, 3>) ()
   from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libeonclib.so
#17 0x00007ffff7ee15f8 in Matter::computePotential() () from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libeonclib.so
#18 0x00007ffff7ee28da in Matter::getPotentialEnergy() () from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libeonclib.so
#19 0x00007ffff7f1c31a in NudgedElasticBand::NudgedElasticBand(std::vector<Matter, std::allocator<Matter> >, std::shared_ptr<Parameters>, std::shared_ptr<Potential>) ()
   from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libeonclib.so
#20 0x00007ffff7f1ca09 in NudgedElasticBand::NudgedElasticBand(std::shared_ptr<Matter>, std::shared_ptr<Matter>, std::shared_ptr<Parameters>, std::shared_ptr<Potential>) ()
   from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libeonclib.so
#21 0x00007ffff7f32e99 in std::__detail::_MakeUniq<NudgedElasticBand>::__single_object std::make_unique<NudgedElasticBand, std::shared_ptr<Matter>&, std::shared_ptr<Matter>&, std::shared_ptr<Parameters>&, std::shared_ptr<Potential>&>(std::shared_ptr<Matter>&, std::shared_ptr<Matter>&, std::shared_ptr<Parameters>&, std::shared_ptr<Potential>&) ()
   from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libeonclib.so
#22 0x00007ffff7f362a6 in NudgedElasticBandJob::run[abi:cxx11]() () from /home/goswami/Git/Github/epfl/lab-cosmo/pixi_envs/atomistic-cookbook/atomistic-cookbook/.nox/eon-pet-neb/bin/../lib/libeonclib.so
#23 0x0000555555564409 in main ()

However, @Luthaf pointed out that the sleef symbols are only linked in the conda variant, and indeed, a separate environment where everything is managed with pip and eonclient is source installed works just fine.

SPICE and PET-MAD models work fine though. Table incoming.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions