You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*[COMET](https://github.com/Unbabel/COMET) - COMET is an open-source framework for machine learning evaluation.
306
306
*[Deepchecks](https://github.com/deepchecks/deepchecks) - Deepchecks is a holistic open-source solution for all of your AI & ML validation needs, enabling you to test your data and models from research to production thoroughly.
307
307
*[DeepEval](https://github.com/confident-ai/deepeval) - DeepEval is a simple-to-use, open-source evaluation framework for LLM applications.
308
+
*[DomainBed](https://github.com/facebookresearch/DomainBed) - DomainBed is a test suite containing benchmark datasets and algorithms for domain generalization
308
309
*[EvalAI](https://github.com/Cloud-CV/EvalAI) - EvalAI is an open-source platform for evaluating and comparing AI algorithms at scale.
309
310
*[EvalPlus](https://github.com/evalplus/evalplus) - EvalPlus is a robust evaluation framework for LLM4Code, featuring expanded HumanEval+ and MBPP+ benchmarks, efficiency assessment (EvalPerf), and a secure, extensible evaluation toolkit.
310
311
*[Evals](https://github.com/openai/evals) - Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.
*[LLMonitor](https://github.com/lunary-ai/lunary) - LLMonitor is an observability & analytics for AI apps and agents.
329
330
*[LLMPerf](https://github.com/ray-project/llmperf) - LLMPerf is a tool for evaluating the performance of LLM APIs.
330
331
*[lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) - lmms-eval is an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.
332
+
*[Melting Pot](https://github.com/google-deepmind/meltingpot) - Melting Pot is a suite of test scenarios for multi-agent reinforcement learning.
331
333
*[Meta-World](https://github.com/Farama-Foundation/Metaworld) - Meta-World is an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic manipulation tasks.
332
334
*[mir_eval](https://github.com/mir-evaluation/mir_eval) - mir_eval is a Python library which provides a transparent, standardized, and straightforward way to evaluate Music Information Retrieval systems.
333
335
*[MLPerf Inference](https://github.com/mlcommons/inference) - MLPerf Inference is a benchmark suite for measuring how fast systems can run models in a variety of deployment scenarios.
0 commit comments