Core AI Technologies

Reasoning & Prompting Mechanisms

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Summary: Investigates the robustness of Chain-of-Thought (CoT) reasoning in large language models through a data distribution lens. Introduces DataAlchemy, a controlled framework to train LLMs from scratch and systematically probe CoT behavior across task, length, and format dimensions. Findings reveal that CoT reasoning is brittle and heavily reliant on distributional proximity to training data, disappearing under significant distribution shifts.
Tags: #chain-of-thought #CoT #data-distribution #inductive-bias #reasoning #generalizable-reasoning

Language Models & Architecture

DeepSeek-V3 Technical Report
Summary: DeepSeek-V3 is a Mixture-of-Experts (MoE) language model comprising 671 billion parameters, with 37 billion activated per token. It is pre-trained on 14.8 trillion diverse, high-quality tokens, followed by supervised fine-tuning and reinforcement learning to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models, despite requiring only 2.788 million H800 GPU hours for full training. The training process is noted for its remarkable stability.
Tags: #language-model #Mixture-of-Experts #AI
Large Concept Models: Language Modeling in a Sentence Representation Space
Summary: This paper introduces the concept of Large Concept Models (LCMs), which operate on higher-level semantic representations beyond individual tokens. By treating entire sentences as single concepts, LCMs utilize the SONAR sentence embedding space, supporting up to 200 languages in both text and speech modalities. The model is trained to perform autoregressive sentence prediction within this embedding space, exploring approaches such as MSE regression, diffusion-based generation variants, and models operating in a quantized SONAR space. This methodology aims to align AI processing more closely with human cognitive abstraction levels, enhancing the generation and analysis of complex information.
Tags: #language-modeling #sentence-embedding #multilingual #AI
Byte Latent Transformer: Patches Scale Better Than Tokens
Summary: The Byte Latent Transformer (BLT) is a novel byte-level large language model (LLM) architecture that matches the performance of traditional tokenization-based LLMs while enhancing inference efficiency and robustness. BLT encodes raw bytes into dynamically sized patches, which serve as the primary computational units. These patches are segmented based on the entropy of the subsequent byte, allocating more computational resources to complex data segments. A FLOP-controlled scaling study of byte-level models up to 8 billion parameters and 4 trillion training bytes demonstrates that BLT can scale models trained on raw bytes without a fixed vocabulary. This approach improves both training and inference efficiency by dynamically selecting longer patches for predictable data, leading to qualitative enhancements in reasoning and generalization.
Tags: #LLM #byte-level-modeling #transformer #AI
Kolmogorov-Arnold Transformers
Summary: This paper presents the Kolmogorov-Arnold Transformer, a novel architecture aimed at improving the efficiency and performance of transformer models.
Tags: #architecture #transformer #AI
Hunyuan-Large
Summary: This paper introduces Hunyuan-Large, the largest open-source Transformer-based mixture of experts model, featuring a total of 389 billion parameters with 52 billion activated parameters and the ability to handle up to 256K tokens. Hunyuan-Large demonstrates superior performance in various benchmarks, including language understanding and generation, logical reasoning, and coding tasks, surpassing smaller models while maintaining competitive performance against much larger counterparts like LLama3.1-405B.
Tags: #MoE #transformer #open-source #AI

Reinforcement Learning

Memento: Fine-tuning LLM Agents without Fine-tuning LLMs
Summary: Proposes a memory-based online reinforcement learning paradigm that adapts LLM agents without updating base model weights. The method formalizes agent learning as a Memory-augmented MDP (M-MDP) with episodic memory (differentiable or non-parametric), a neural case-selection policy, and memory rewriting/reading for continual improvement. In a deep-research setting, the agent achieves 87.88% Pass@3 (top-1) on GAIA validation and 79.40% on GAIA test, plus 66.6% F1 / 80.4% PM on DeepResearcher, and gains +4.7–9.6 points on OOD tasks via case-based memory. Code is released.
Tags: #LLM-agents #reinforcement-learning #continual-learning #episodic-memory #case-based-policy
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Summary: This paper introduces DeepSeek-R1, a reasoning model developed through reinforcement learning (RL) to enhance the reasoning capabilities of large language models (LLMs). The authors present two versions: DeepSeek-R1-Zero, trained solely via large-scale RL without supervised fine-tuning, and DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL to improve performance and address issues like poor readability and language mixing observed in the zero version. DeepSeek-R1 achieves performance comparable to OpenAI's o1-1217 on reasoning tasks. The models and six distilled versions, ranging from 1.5B to 70B parameters, are open-sourced to support the research community.
Tags: #reinforcement-learning #reasoning #LLMs #AI
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?
Summary: This study critically examines current methods for replicating OpenAI's O1 model capabilities, focusing on the often undisclosed use of knowledge distillation techniques. The authors demonstrate that simple distillation from O1's API, combined with supervised fine-tuning, can achieve superior performance on complex mathematical reasoning tasks. Experiments reveal that a base model fine-tuned on tens of thousands of O1-distilled samples outperforms O1-preview on the American Invitational Mathematics Examination (AIME) with minimal technical complexity. Additionally, the study explores the generalization capabilities of O1-distilled models across diverse tasks, including hallucination, safety, and open-domain question answering. Notably, models trained solely on mathematical problem-solving data exhibited strong generalization to open-ended QA tasks and showed reduced susceptibility to sycophancy after fine-tuning. The authors advocate for transparency in AI research and caution against over-reliance on distillation approaches, emphasizing the importance of first-principles thinking in developing capable AI systems.
Tags: #O1-replication #knowledge-distillation #AI-research #transparency

Multimodal Models

LlamaV-o1: Rethinking Step-by-Step Visual Reasoning in LLMs
Summary: This paper introduces LlamaV-o1, a multimodal visual reasoning model designed to enhance step-by-step problem-solving capabilities in large language models (LLMs). The authors present a comprehensive framework comprising three key contributions: (1) a visual reasoning benchmark with over 4,000 reasoning steps across eight categories, ranging from complex visual perception to scientific reasoning; (2) a novel metric that evaluates visual reasoning quality at the granularity of individual steps, emphasizing both correctness and logical coherence; and (3) a multi-step curriculum learning approach that progressively organizes tasks to facilitate incremental skill acquisition and problem-solving. Extensive experiments demonstrate that LlamaV-o1 outperforms existing open-source models and performs favorably against closed-source proprietary models, achieving an average score of 67.3 with an absolute gain of 3.8% across six benchmarks compared to the recent Llava-CoT, while being five times faster during inference scaling. The benchmark, model, and code are publicly available.
Tags: #visual-reasoning #multimodal #LLMs #curriculum-learning #AI
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Summary: This paper introduces VITA-1.5, a Multimodal Large Language Model (MLLM) designed to integrate visual and speech modalities for real-time interaction. The authors propose a multi-stage training methodology that enables the model to understand both visual and speech information, facilitating efficient speech-to-speech dialogues without relying on separate Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) modules. This integration accelerates multimodal end-to-end response speed and enhances the model's capabilities in both vision and speech tasks. Comparative evaluations across benchmarks for image, video, and speech tasks demonstrate that VITA-1.5 achieves performance comparable to state-of-the-art models, enabling near real-time vision and speech interaction.
Tags: #multimodal #vision-and-speech #real-time-interaction #AI #MLLM
Aria
Summary: Aria is an open multimodal-native mixture of experts model designed to handle diverse data types, enhancing AI's multimodal processing capabilities.
Tags: #architecture #multimodal #AI
TableGPT2
Summary: TableGPT2 is a large-scale multimodal model specifically designed to integrate tabular data into AI applications. Trained on over 593,800 tables and 2.36 million high-quality query-table-output tuples, it excels in table-centric tasks while maintaining robust general language and coding capabilities.
Tags: #multimodal-model #tabular-data #AI

Training & Optimization

Block Diffusion: Interpolating Between Generative and Discriminative Models
Summary: This paper introduces Block Diffusion, a novel framework that bridges the gap between generative and discriminative models. By leveraging block-structured diffusion processes, the approach enables controlled interpolation between generative modeling and supervised learning tasks. The authors explore various applications, demonstrating improved flexibility in training regimes and enhanced generalization in tasks such as image synthesis and classification.
Tags: #diffusion-models #generative-modeling #discriminative-models #AI
s1: Simple Test-Time Scaling
Summary: This paper introduces "s1," a methodology aimed at enhancing language model performance through test-time scaling, which leverages additional computational resources during inference to improve outcomes. The authors curate a concise dataset, s1K, comprising 1,000 questions with detailed reasoning traces, selected based on validated criteria of difficulty, diversity, and quality. They also propose "budget forcing," a technique that manages test-time computation by either truncating the model's reasoning process or extending it by appending "Wait" prompts, encouraging the model to reassess and potentially correct its reasoning steps. After fine-tuning the Qwen2.5-32B-Instruct language model using s1K and implementing budget forcing, the resulting model, s1-32B, surpasses OpenAI's o1-preview by up to 27% on competitive math benchmarks such as MATH and AIME24. This work demonstrates that significant performance gains can be achieved with minimal data and computational investment.
Tags: #test-time-scaling #language-models #budget-forcing #AI #reasoning
Huggingface - AutoTrain
Summary: AutoTrain by Huggingface introduces a no-code solution for training machine learning models, simplifying model customization and deployment.
Tags: #UI #GUI #AI
Initialization Using Update Approximation Is Effective for Training Transformers
Summary: This paper introduces a novel initialization method that approximates full fine-tuning within low-rank subspaces for training Transformer models. By employing a carefully designed initialization strategy, the approach achieves optimal scaling for high-rank gradient updates without the need for extensive hyperparameter tuning. The method demonstrates significant efficiency gains, using 27-90 times fewer parameters than standard low-rank adaptation techniques, while surpassing their performance across various tasks, including mathematical reasoning and language understanding.
Tags: #transformers #initialization #low-rank-fine-tuning #AI
TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training
Summary: This paper introduces TorchTitan, an open-source, PyTorch-native distributed training system designed to streamline the training of large language models (LLMs). TorchTitan unifies state-of-the-art techniques, enabling modular 3D parallelism with elastic scaling, and provides comprehensive tools for logging, checkpointing, and debugging. It incorporates hardware-software co-designed solutions, such as Float8 training and SymmetricMemory, to enhance efficiency. The system has been thoroughly assessed on the Llama 3.1 family of LLMs, demonstrating exceptional performance, modular composability, and elastic scalability.
Tags: #distributed-training #PyTorch #large-language-models #AI
Hot-DoG
Summary: Hot-DoG introduces a tuning-free coreset approach for Markov Chain Monte Carlo methods, enhancing optimization efficiency in AI models.
Tags: #optimization #bayesian-optimization #frameworks

Memory & Attention Mechanisms

Transformer^2: Self-adaptive LLMs
Summary: This paper introduces $\text{Transformer}^2$, a self-adaptation framework designed to enable large language models (LLMs) to adapt to unseen tasks in real-time. The approach selectively adjusts the singular components of the models' weight matrices, allowing for efficient and dynamic task-specific behavior without extensive fine-tuning. During inference, a two-pass mechanism is employed: a dispatch system first identifies the task properties, and then task-specific "expert" vectors, trained via reinforcement learning, are dynamically combined to tailor the model's responses. This method outperforms traditional approaches like LoRA in both parameter efficiency and adaptability. Moreover, $\text{Transformer}^2$ demonstrates versatility across various LLM architectures and modalities, including vision-language tasks, marking a significant advancement toward dynamic, self-organizing AI systems.
Tags: #self-adaptive-LLMs #real-time-adaptation #reinforcement-learning #AI
Titans: Learning to Memorize at Test Time
Summary: This paper introduces Titans, a novel neural architecture that combines attention mechanisms with a neural long-term memory module to address the limitations of traditional Transformers in handling long sequences. While attention mechanisms function effectively as short-term memory by capturing direct dependencies within a fixed context window, they struggle with scalability due to quadratic time and memory complexity. Titans overcome this by incorporating a neural long-term memory module capable of memorizing historical context, thus enabling the model to utilize information from both recent and distant past efficiently. The proposed architecture supports fast, parallelizable training and inference. Experimental results demonstrate that Titans outperform traditional Transformers and modern linear recurrent models across various tasks, including language modeling, common-sense reasoning, genomics, and time series analysis. Notably, Titans can effectively scale to context window sizes exceeding 2 million tokens, maintaining high accuracy in scenarios requiring retrieval of specific information from vast data sequences.
Tags: #transformers #long-term-memory #sequence-modeling #AI #Titans
Lifelong Learning of Large Language Model based Agents: A Roadmap
Summary: This survey explores the integration of lifelong learning—also known as continual or incremental learning—into large language model (LLM)-based agents. It identifies three core components essential for enabling continuous adaptation: the perception module for multimodal input integration, the memory module for storing and retrieving evolving knowledge, and the action module for interacting with dynamic environments. The paper discusses strategies to mitigate catastrophic forgetting and enhance long-term performance, providing a comprehensive roadmap for developing LLM agents capable of adapting over time. Additional resources and literature are available at https://github.com/qianlima-lab/awesome-lifelong-llm-agent.
Tags: #lifelong-learning #LLM #continual-learning #AI #survey
Superposition in Transformers: A Novel Way of Building Mixture of Experts
Summary: This paper addresses the challenge of catastrophic forgetting in large language models (LLMs) when adapting to new tasks or domains. The authors introduce a novel architecture that employs autoencoders to superimpose hidden representations of a base model and a fine-tuned model within a shared parameter space. By utilizing B-spline-based blending coefficients and adaptive autoencoders, this method effectively mitigates catastrophic forgetting, allowing the model to retain original capabilities while incorporating domain-specific expertise. Additionally, it supports dynamic switching between model states during inference, enhancing flexibility and performance.
Tags: #transformers #superposition #mixture-of-experts #catastrophic-forgetting #AI
An Evolved Universal Transformer Memory
Summary: This paper introduces Neural Attention Memory Models (NAMMs), a learned network for memory management that enhances both the performance and efficiency of transformers. NAMMs provide distinct latent contexts, focusing on the most relevant information for individual layers and attention heads. They condition exclusively on the values in the produced attention matrices, making them universally applicable to any model utilizing self-attention. Training NAMMs on a small set of problems yields significant performance improvements across multiple long-context benchmarks while reducing the model's input contexts to a fraction of their original sizes.
Tags: #transformers #memory-management #self-attention #AI
Star Attention: Efficient LLM Inference over Long Sequences
Summary: Star Attention introduces a two-phase block-sparse approximation to enhance the efficiency of Transformer-based Large Language Models (LLMs) during inference on long sequences. The first phase employs blockwise-local attention processed in parallel across multiple hosts, while the second phase allows query and response tokens to attend to all prior cached tokens through sequence-global attention. This method reduces memory requirements and inference time by up to 11x, maintaining 95-100% accuracy.
Tags: #LLM #attention-mechanism #inference-optimization #block-sparse-attention
Memory Layers at Scale
Summary: This paper introduces an enhanced memory layer for language models, demonstrating significant performance improvements, particularly in factual tasks. The proposed memory layer implementation is fully parallelizable and showcases scaling laws with up to 128 billion memory parameters, pretrained on 1 trillion tokens. The augmented models outperform both dense models with more than double the compute budget and mixture-of-expert models when matched for both compute and parameters.
Tags: #memory-augmentation #language-models #scaling-laws #AI
RESOLVE: Relational Reasoning with Symbolic and Object-Level Features Using Vector Symbolic Processing
Summary: RESOLVE introduces a neuro-vector symbolic architecture that integrates object-level features with relational representations in high-dimensional spaces. Utilizing operations like bundling (summation) and binding (Hadamard product), it enables the coexistence of object-level and relational information without interference. The model features a novel attention mechanism operating in a bipolar high-dimensional space, facilitating efficient attention score computation. RESOLVE demonstrates improved generalizability and accuracy in tasks requiring both pure and partial relational reasoning, such as sorting and mathematical problem-solving, compared to existing methods.
Tags: #relational-reasoning #vector-symbolic-processing #neuro-symbolic-AI #AI

Cache & Memory Optimization

Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks
Summary: This paper introduces Cache-Augmented Generation (CAG) as an alternative to Retrieval-Augmented Generation (RAG) for enhancing language models. CAG involves preloading all relevant resources into a language model's extended context, eliminating the need for real-time retrieval during inference.
Tags: #cache-augmented-generation #language-models #memory-augmentation #AI

Long Context Models

LongWriter
Summary: This paper addresses the limitation of current long-context large language models (LLMs) that struggle to generate outputs exceeding 2,000 words. The authors introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling LLMs to produce coherent outputs exceeding 20,000 words.
Tags: #long-context #LLM #text-generation #AI

Spatial Reasoning

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Summary: This paper investigates whether Multimodal Large Language Models (MLLMs), trained on extensive video datasets, can develop visual-spatial intelligence akin to humans. The authors introduce VSI-Bench, a novel benchmark comprising over 5,000 question-answer pairs designed to assess video-based visual-spatial intelligence.
Tags: #MLLM #visual-spatial-intelligence #VSI-Bench #cognitive-maps
R-CoT
Summary: This paper introduces R-CoT, a two-stage Reverse Chain-of-Thought geometry problem generation pipeline designed to enhance the geometric reasoning capabilities of Large Multimodal Models (LMMs).
Tags: #R-CoT #geometric-reasoning #multimodal-models #AI

World Modeling Platforms

Cosmos World Foundation Model Platform for Physical AI
Summary: This paper introduces the Cosmos World Foundation Model Platform, designed to assist developers in creating customized world models for Physical AI systems. The platform offers a general-purpose world foundation model that can be fine-tuned for specific applications, facilitating the development of digital twins for both the AI agent and its environment. Key components include a video curation pipeline, pre-trained world foundation models, post-training examples, and video tokenizers. The platform is open-source, with models available under permissive licenses at https://github.com/NVIDIA/Cosmos.
Tags: #world-modeling #physical-ai #digital-twin #open-source #AI

AI Enhancement Technologies

Deepsearch Tools

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Summary: Search-R1 extends the DeepSeek-R1 model by enabling large language models (LLMs) to autonomously generate multiple search queries during step-by-step reasoning with real-time retrieval, using reinforcement learning (RL). This approach optimizes LLM rollouts with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets demonstrate performance improvements of 26% (Qwen2.5-7B), 21% (Qwen2.5-3B), and 10% (LLaMA3.2-3B) over strong baselines. The paper also provides insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. Code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.
Tags: #reinforcement-learning #retrieval-augmented-reasoning #large-language-models #search-engines #AI
Open Deep Search: Democratizing Search with Open-source Reasoning Agents
Summary: Open Deep Search (ODS) is an open-source framework designed to enhance the reasoning capabilities of large language models (LLMs) by integrating web search tools. It comprises two main components: the Open Search Tool, which retrieves and augments relevant web content, and the Open Reasoning Agent, which orchestrates a sequence of actions, including tool calls, to interpret and complete tasks. Evaluations demonstrate that ODS, when paired with powerful open-source LLMs like DeepSeek-R1, achieves state-of-the-art performance on benchmarks such as SimpleQA and FRAMES, outperforming existing proprietary solutions.
Tags: #open-source #reasoning-agents #web-search #large-language-models #AI

Retrieval & Augmentation

MiniRAG: Towards Extremely Simple Retrieval-Augmented Generation
Summary: This paper introduces MiniRAG, a novel Retrieval-Augmented Generation (RAG) system designed for efficiency and simplicity, particularly when deploying Small Language Models (SLMs). MiniRAG addresses performance degradation issues in existing RAG frameworks by implementing two key innovations: (1) a semantic-aware heterogeneous graph indexing mechanism that unifies text chunks and named entities, reducing dependence on complex semantic understanding; and (2) a lightweight topology-enhanced retrieval approach that leverages graph structures for efficient knowledge discovery without requiring advanced language capabilities. Experiments demonstrate that MiniRAG achieves performance comparable to Large Language Model (LLM)-based methods while using only 25% of the storage space. The implementation and datasets are open-sourced at https://github.com/HKUDS/MiniRAG.
Tags: #RAG #retrieval-augmented-generation #small-language-models #efficiency #AI
Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models
Summary: Auto-RAG introduces an autonomous iterative retrieval model that leverages the decision-making capabilities of Large Language Models (LLMs) to enhance Retrieval-Augmented Generation (RAG). Unlike traditional methods that rely on few-shot prompting or manually constructed rules, Auto-RAG enables LLMs to engage in multi-turn dialogues with retrievers, systematically planning retrievals and refining queries to acquire valuable knowledge.
Tags: #Retrieval-Augmented-Generation #LLM #iterative-retrieval #AI
MemoRAG
Summary: MemoRAG explores advanced techniques in retrieval-augmented generation for more effective information synthesis in AI systems.
Tags: #RAG #retrieval #AI
KAG
Summary: This paper explores Knowledge Augmented Generation (KAG) as a method to enhance large language models (LLMs) for specific professional domains, focusing on the integration of domain-specific knowledge to improve accuracy and relevance in outputs.
Tags: #KAG #knowledge-augmentation #professional-domains #AI

Hallucination Reduction

Prereq-Tune
Summary: Prereq-Tune utilizes synthetic data to reduce hallucination in large language models, enhancing factual accuracy.
Tags: #hallucination #factuality #AI
DecoPrompt: Decoding Prompts Reduces Hallucinations when Large Language Models Meet False Premises
Summary: DecoPrompt addresses the issue of hallucinations in large language models (LLMs) when confronted with false premises. The method leverages LLMs to "decode" misleading prompts without generating hallucinated outputs, effectively reducing inaccuracies.
Tags: #hallucination-reduction #LLM #false-premises #AI

Performance Optimization

Sleep‑time Compute: Beyond Inference Scaling at Test‑time
Summary: This work proposes sleep‑time compute—letting a model “think” offline before user queries arrive by predicting likely questions and pre‑computing helpful intermediate results. On two reasoning benchmarks (Stateful GSM‑Symbolic and Stateful AIME) the approach slashes test‑time compute by ≈5× while matching accuracy; scaling the offline budget then pushes accuracy a further 13 – 18 %. A new Multi‑Query GSM‑Symbolic dataset shows the pre‑computation cost can be amortized across related queries, cutting per‑query cost by 2.5×. An agentic software‑engineering case study illustrates real‑world applicability.
Tags: #sleep-time-compute #inference-scaling #performance-optimization #LLM #AI
Breaking the Low-Rank Dilemma of Linear Attention
Summary: This paper addresses the computational inefficiencies of the Softmax attention mechanism in Transformer models, particularly its quadratic complexity in vision applications. It introduces Rank-Augmented Linear Attention (RALA), which enhances the rank of linear attention's feature map, enabling it to model complex spatial information more effectively.
Tags: #linear-attention #transformer #vision-models #RALA #RAVLT
LoLCATs: On Low-Rank Linearizing of Large Language Models
Summary: This paper introduces LoLCATs (Low-rank Linear Conversion via Attention Transfer), a method designed to enhance the efficiency of large language models (LLMs) by replacing their quadratic attention mechanisms with subquadratic linear attentions. The approach involves two key steps: training linear attentions to approximate the outputs of softmax attentions through attention transfer, and then applying low-rank adaptation (LoRA) to adjust for any approximation errors.
Tags: #linear-attention #LLM #LoRA #model-efficiency

Applications and Implementations

Model Context Protocol (MCP)

Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions
Summary: The Model Context Protocol (MCP) is a standardized interface designed to enable seamless interaction between AI models and external tools and resources, thereby breaking down data silos and facilitating interoperability across diverse systems. This paper provides a comprehensive overview of MCP, focusing on its core components, workflow, and the lifecycle of MCP servers, which consists of three key phases: creation, operation, and update. It analyzes the security and privacy risks associated with each phase and proposes strategies to mitigate potential threats. The paper also examines the current MCP landscape, including its adoption by industry leaders and various use cases, as well as the tools and platforms supporting its integration. Future directions for MCP are explored, highlighting the challenges and opportunities that will influence its adoption and evolution within the broader AI ecosystem. Recommendations are offered for MCP stakeholders to ensure its secure and sustainable development as the AI landscape continues to evolve.
Tags: #ModelContextProtocol #MCP #AI #Interoperability #Security #Frameworks

Coding & Development Tools

Multi-Programming Language Sandbox for LLMs
Summary: This paper introduces MPLSandbox, an out-of-the-box framework designed to provide unified and comprehensive feedback from compiler and analysis tools for Large Language Models (LLMs). MPLSandbox automatically identifies the programming language of the code, compiling and executing it within an isolated sub-sandbox to ensure safety and stability. It integrates both traditional and LLM-based code analysis tools, offering a thorough examination of generated code. The framework can be seamlessly incorporated into the training and deployment of LLMs to enhance the quality and correctness of their code outputs, thereby streamlining workflows for various LLM-based code-related tasks and reducing development costs.
Tags: #LLM #code-analysis #sandbox #programming-languages
Qwen2.5-Coder Technical Report
Summary: The Qwen2.5-Coder series is an advancement over its predecessor, CodeQwen1.5, featuring six models ranging from 0.5B to 32B parameters. Built upon the Qwen2.5 architecture, these code-specific models are pretrained on a vast corpus exceeding 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general and mathematical skills. Evaluations across more than ten benchmarks—including code generation, completion, reasoning, and repair—show that Qwen2.5-Coder consistently outperforms larger models of the same size, achieving state-of-the-art performance.
Tags: #code-generation #LLM #AI #Qwen2.5-Coder

Agentic Frameworks & Task Management

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
Summary: This comprehensive 264-page survey presents a modular, brain-inspired architecture for intelligent agents, integrating principles from cognitive science, neuroscience, and computational research. It delves into the design of agents with perception, cognition, and action modules, emphasizing the importance of memory systems, dynamic tool utilization, and self-evolution capabilities. The paper also addresses the challenges of safety, proposing proactive design strategies to mitigate risks. By unifying fragmented research threads, it offers a structured taxonomy for understanding foundation agents and identifies promising directions for future research.
Tags: #foundation-agents #brain-inspired-AI #modular-architecture #AI-safety #multi-agent-systems
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
Summary: This survey systematically deconstructs Large Language Model (LLM) agent systems through a methodology-centered taxonomy, linking architectural foundations, collaboration mechanisms, and evolutionary pathways. It unifies fragmented research threads by revealing fundamental connections between agent design principles and their emergent behaviors in complex environments. The work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time, while also addressing evaluation methodologies, tool applications, practical challenges, and diverse application domains. By surveying the latest developments in this rapidly evolving field, the authors offer researchers a structured taxonomy for understanding LLM agents and identify promising directions for future research.
Tags: #LLM-agents #methodology #survey #artificial-general-intelligence #AI
Agent S - Uses computers like a human
Summary: Agent S introduces an open agentic framework that enables AI systems to utilize computers similarly to humans, enhancing task management and execution capabilities.
Tags: #agent #task-management #AI
TaskGen
Summary: TaskGen presents a task-based, memory-infused agentic framework designed to improve AI's ability to manage and execute complex tasks efficiently.
Tags: #agent #task-management #AI

Document Processing

Docling Technical Report
Summary: This technical report introduces Docling, an open-source, MIT-licensed package designed for efficient PDF document conversion. Leveraging advanced AI models like DocLayNet for layout analysis and TableFormer for table structure recognition, Docling operates effectively on standard hardware with minimal resource requirements.
Tags: #document-conversion #PDF #open-source #AI

Computer Vision

Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders
Summary: Gaze-LLE addresses the challenge of gaze target estimation by leveraging features from a frozen DINOv2 encoder within a novel transformer framework. By extracting a unified feature representation for the scene and applying a person-specific positional prompt, the model streamlines gaze prediction without relying on complex, hand-crafted pipelines. This approach achieves state-of-the-art performance across multiple gaze benchmarks, demonstrating the effectiveness of utilizing large-scale learned encoders for this task.
Tags: #gaze-estimation #transformer #DINOv2 #AI

Robotics & Automation

ROLO-SLAM: Rotation-Optimized LiDAR-Only SLAM in Uneven Terrain with Ground Vehicle
Summary: This paper introduces ROLO-SLAM, a LiDAR-based SLAM method designed to enhance pose estimation accuracy for ground vehicles navigating rough terrains. Traditional LiDAR-based SLAM systems often experience significant pose estimation drifts, especially in the vertical direction, leading to distorted global maps. ROLO-SLAM addresses this issue by implementing forward location prediction to reduce discrepancies between consecutive scans, enabling precise determination of position and orientation. The method employs parallel spatial voxelization for correspondence matching and develops a spherical alignment-guided rotation registration within each voxel to estimate vehicle rotation. By incorporating geometric alignment and motion constraints into the optimization process, ROLO-SLAM achieves rapid and effective translation estimation. Keyframes are extracted to construct submaps, facilitating accurate pose estimation through scan-to-submap alignment. A global-scale factor graph is also established to mitigate cumulative errors. Experimental results demonstrate that ROLO-SLAM outperforms existing state-of-the-art LiDAR SLAM frameworks in pose estimation for ground vehicles.
Tags: #LiDAR #SLAM #pose-estimation #robotics #AI
OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving
Summary: OpenEMMA is an open-source end-to-end framework that leverages Multimodal Large Language Models (MLLMs) to enhance autonomous driving systems. By incorporating Chain-of-Thought reasoning processes, OpenEMMA improves performance across various challenging driving scenarios, offering a more efficient and effective approach to autonomous driving.
Tags: #autonomous-driving #MLLM #open-source #AI
RoboMatrix: A Skill-centric Hierarchical Framework for Scalable Robot Task Planning and Execution in Open-World
Summary: RoboMatrix introduces a skill-centric and hierarchical framework aimed at enhancing robot task planning and execution in open-world environments. By extracting common meta-skills from various complex tasks, the framework enables robots to dynamically combine these learned skills to address novel tasks. RoboMatrix is structured into three interconnected layers:

Scheduling Layer: Utilizes a general Large Language Model (LLM) to decompose tasks and select appropriate skill models.
Skill Layer: Houses a matrix of meta-skills learned through a unified Vision-Language-Action (VLA) model and hybrid models.
Hardware Layer: Executes the selected skills on the physical robot.

This design facilitates dynamic task decomposition and skill arrangement, allowing robots to effectively perform new tasks by routing through the skill matrix. Experimental results demonstrate that RoboMatrix achieves remarkable generalization across novel objects, scenes, tasks, and embodiments.
Tags: #robotics #task-planning #hierarchical-framework #meta-skills

Text Analysis & Social Media Tools

SocialED: A Python Library for Social Event Detection
Summary: SocialED is an open-source Python library designed to support social event detection (SED) tasks. It integrates 19 detection algorithms and 14 diverse datasets, providing a unified API with detailed documentation. The library offers modular components for preprocessing techniques like graph construction and tokenization, and includes standardized interfaces for training models and making predictions. By integrating popular deep learning frameworks, SocialED ensures high efficiency and scalability across both CPU and GPU environments. It adheres to high code quality standards, including unit testing and continuous integration, ensuring robust and maintainable software. SocialED is publicly available at https://github.com/RingBDStack/SocialED and can be installed via PyPI.
Tags: #python #social-event-detection #open-source #AI #library

AI Safety and Security

Alignment

From Hard Refusals to Safe‑Completions: Toward Output‑Centric Safety Training
Summary: This paper introduces "safe-completions" as an alternative to traditional hard refusal mechanisms in language model safety. Rather than blocking user queries based solely on intent classification, the approach focuses on ensuring that the model's outputs remain safe—especially in sensitive, dual-use domains like biology or cybersecurity. The authors integrate this method into GPT‑5, demonstrating improvements in failure severity reduction, dual-use scenario robustness, and overall helpfulness. Evaluations include both production settings and controlled experiments, highlighting the effectiveness of output-centric safety training in real-world deployment.
Tags: #LLM-safety #output-centric #safe-completions #dual-use #OpenAI

Security Frameworks

Granite Guardian
Summary: Granite Guardian introduces a suite of models designed to detect risks in prompts and responses, facilitating the safe and responsible use of large language models (LLMs). These models cover multiple risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related risks such as context relevance, groundedness, and answer relevance for retrieval-augmented generation (RAG). Trained on a unique dataset that combines human annotations from diverse sources with synthetic data, Granite Guardian addresses risks often overlooked by traditional detection models. It achieves AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks, respectively, making it a generalizable and competitive model in this domain. The models are open-sourced to promote responsible AI development within the community. The code is available at https://github.com/ibm-granite/granite-guardian.
Tags: #risk-detection #LLM #AI-safety #open-source
Best-of-N Jailbreaking
Summary: Best-of-N (BoN) Jailbreaking is a black-box algorithm designed to bypass safety measures in advanced AI systems across various modalities. It operates by generating multiple variations of a prompt through augmentations—such as random shuffling or capitalization for text prompts—until a harmful response is elicited. BoN Jailbreaking has demonstrated high attack success rates (ASRs) on closed-source language models, achieving 89% on GPT-4o and 78% on Claude 3.5 Sonnet with 10,000 augmented prompts. It is also effective against state-of-the-art open-source defenses like circuit breakers. The algorithm extends seamlessly to other modalities, successfully jailbreaking vision language models (VLMs) like GPT-4o and audio language models (ALMs) such as Gemini 1.5 Pro using modality-specific augmentations. Notably, BoN's effectiveness improves with an increased number of sampled prompts, following a power-law-like behavior across modalities. Combining BoN with other black-box algorithms, such as optimized prefix attacks, can further enhance its efficacy, achieving up to a 35% increase in ASR. This work highlights the susceptibility of language models to seemingly innocuous input modifications, which can be exploited across different modalities.
Tags: #jailbreaking #AI-safety #black-box-attack #multi-modal-AI
garak: A Framework for Security Probing Large Language Models
Summary: garak (Generative AI Red-teaming and Assessment Kit) is an open-source framework designed to identify vulnerabilities in large language models (LLMs) and dialogue systems. It systematically probes LLMs to uncover potential security weaknesses, providing detailed reports that inform discussions on model alignment and deployment policies.
Tags: #LLM #security #vulnerability-assessment #AI
AdvSecureNet
Summary: AdvSecureNet provides a Python toolkit for adversarial attack simulation in PyTorch, designed to test and improve AI robustness.
Tags: #security #pytorch #AI
Neural Fingerprints for Adversarial Attack Detection Summary: This paper introduces a method to detect adversarial attacks on AI models by leveraging neural fingerprints. The approach involves creating unique identifiers for neural networks, enabling the identification of unauthorized modifications or attacks. Tags: #adversarial-attack #detection #neural-fingerprints #AI
Manifold of Failure: Behavioral Attraction Basins in Language Models Summary: Proposes a framework for systematically mapping the "Manifold of Failure" in LLMs — regions where models exhibit unsafe behavior — using MAP-Elites, a quality diversity algorithm. Rather than finding individual jailbreaks, the approach identifies "behavioral attraction basins" (distinct vulnerability niches) to produce interpretable, global maps of each model's safety landscape. Testing across three models uncovered up to 370 distinct vulnerability niches, offering structural insights that no existing attack method (GCG, PAIR, or TAP) can provide. Tags: #LLM-safety #jailbreaking #vulnerability-mapping #quality-diversity #MAP-Elites #red-teaming
Trustworthy Federated Learning: Privacy, Security, and Beyond Summary: This comprehensive survey examines the security and privacy challenges inherent in Federated Learning (FL). It highlights vulnerabilities in communication channels and potential cyber threats within decentralized networks. The authors discuss various defensive strategies to mitigate these risks, explore FL applications across different sectors, and propose future research directions to enhance the security and efficiency of FL systems. Tags: #federated-learning #privacy #security #decentralized-networks

Cybersecurity

Are Enterprises Ready for Quantum-Safe Cybersecurity?
Summary: This paper provides a multi-perspective assessment of enterprise readiness for quantum-safe cybersecurity. It considers the technologist’s view (maturity of post-quantum cryptography and quantum key distribution), the enterprise leader’s view (awareness, risk management, operational challenges), and the adversary’s view (quantum threat timelines and “harvest now, decrypt later” risks). A SWOT analysis reveals that fewer than 5% of enterprises have formal transition plans, with only finance, telecom, and government sectors actively migrating. Most are hindered by costs, complexity, and lack of expertise. The paper projects cryptanalytically relevant quantum computers in the 2030s and recommends crypto-agility, roadmap development, PQC deployment in high-value systems, and workforce upskilling.
Tags: #quantum-safe #cybersecurity #post-quantum-cryptography #PQC #risk-management
CAI Fluency: A Framework for Cybersecurity AI Fluency
Summary: Introduces CAI Fluency, an educational platform within the Cybersecurity AI (CAI) framework aimed at democratizing knowledge and practical use of AI-based cybersecurity tools—what the authors term “vibe-hacking,” the security analogue to vibe-coding. It adapts the Framework for AI Fluency’s three human–AI interaction modalities and four core competencies specifically to cybersecurity, emphasizing not only technical skill but also critical thinking and ethical practice. The report serves as a white paper plus hands-on guide for applying the CAI framework in projects and real-world security contexts.
Tags: #cybersecurity #AI #AI-fluency #education #framework #human-AI-interaction #ethics #CAI
Evasive Ransomware Attacks Using Low-level Behavioral Adversarial Examples
Summary: Defines low-level behavioral adversarial examples for defeating deep-learning ransomware detectors that rely on runtime behavior. The authors formalize a gray-box threat model and an optimization that searches for source-code edits which induce minimal perturbations to storage/memory access patterns (e.g., entropy, throughput, page/GPA access stats) yet flip the detector’s decision. They prototype a micro-behavior control mechanism on leaked Conti ransomware—tuning number of threads, per-file encryption ratio, and post-encryption delay—to generate executable adversarial behaviors. Using RanSMAP features and a DL detector, the attack lowers recall from 0.98 to as low as 0.64 (Δ −0.34) and can mimic benign apps (e.g., SDelete), while noting that broader, more flexible behavioral manipulation is needed for consistently successful evasion.
Tags: #cybersecurity #adversarial-examples #ransomware #malware-detection #behavioral-features #evasion-attack
Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning
Summary: The authors propose Pentest-R1, a framework designed to automate penetration testing by enhancing the reasoning capabilities of large language models (LLMs). It does so through a two-stage reinforcement learning pipeline:
- First, an offline RL stage trains the model on a dataset containing over 500 real-world, multi-step walkthroughs to instill foundational attack logic.
- Then, an online RL stage fine-tunes the model via interactions within a Capture The Flag (CTF) environment, allowing it to self-correct errors and adapt strategies based on direct environmental feedback.
  In evaluations on the AutoPenBench benchmark, Pentest-R1 achieves a 24.2% success rate, second only to Gemini 2.5 Flash. On Cybench, it records a 15.0% success rate for unguided tasks—setting a new state-of-the-art among open-source LLMs and matching top proprietary models. Ablation studies underline that both RL stages are vital to its performance.
  Tags: #cybersecurity #LLM #reinforcement-learning #autonomous-penetration-testing
Advancing Autonomous Incident Response: Leveraging LLMs and Cyber Threat Intelligence Summary: This paper introduces a retrieval-augmented generation (RAG) framework that fuses large language models (LLMs) with unstructured Cyber Threat Intelligence (CTI) to automate and enhance incident response. The model enriches security alerts with contextualized, actionable remediation without retraining when CTI updates. Experiments on real and simulated alerts demonstrate improved contextual relevance, reduced analyst workload, and faster response times. Tags: #incident-response #cyber-threat-intelligence #LLM #RAG #autonomous-security
Cyber-Zero: Training Cybersecurity Agents without Runtime
Summary: Cyber-Zero is the first framework for training cybersecurity agents without the need for runtime environments. It uses large language models to generate high-quality, long-horizon agent trajectories by reverse-engineering runtime behaviors from publicly available Capture-The-Flag (CTF) write-ups. This persona-driven simulation approach eliminates executables while producing results that match or outperform traditional runtime-dependent methods.
Tags: #cybersecurity #LLM #synthetic-data #runtime-free-training #CTF #AI-security
Cybersecurity AI: The Dangerous Gap Between Automation and Autonomy
Summary: This position paper warns that vendors routinely blur the line between automation (scripted tasks) and autonomy (adaptive decision-making), creating over-hyped “fully autonomous pentesters.” Drawing on robotics, the author introduces a 6-level taxonomy (Level 0–5) for “cybersecurity autonomy,” showing that today’s leading agents (e.g., XBOW, CAI) actually sit at Levels 3–4 and still require human validation. Mislabeling them can tempt organisations to remove oversight, potentially adding new vulnerabilities instead of closing them. The paper argues for precise terminology, transparent capability disclosure, and a human-in-the-loop model as prerequisites for responsible progress. :contentReference[oaicite:0]{index=0}
Tags: #cybersecurity #ai-autonomy #penetration-testing #human-in-the-loop #taxonomy
CAI: An Open, Bug Bounty-Ready Cybersecurity AI
Summary: CAI introduces a modular, open-source framework that builds specialised AI agents to automate end-to-end vulnerability discovery. It formalises “levels of autonomy” for security tooling, then shows that its agents beat state-of-the-art CTF baselines—solving challenges up to 3,600 × faster (11 × on average) and winning first place among AI teams in the “AI vs Human” live CTF. Outside benchmarks, CAI climbed into the global top-30 on Hack The Box within a week while slashing average bug-bounty testing costs by 156 ×. Its human-in-the-loop design, plug-and-play tool integration, and emphasis on open access aim to democratise advanced security testing beyond today’s oligopolistic bug-bounty platforms. :contentReference[oaicite:0]{index=0}
Tags: #cybersecurity #bug-bounty #penetration-testing #ai-agents #ctf
An LLM Framework for Cryptography over Chat Channels
Summary: The authors present EmbedderLLM, a text-embedding mechanism that covertly slots ciphertext into specific token positions of language-model output, producing chat that is statistically indistinguishable from normal conversation. The design supports both symmetric-key (password-based AEAD) and public-key schemes (e.g., ECDHE), works with any local LLM, and remains secure against pre- or post-quantum adversaries. Demonstrations show reliable key exchange and message transfer even in environments where conventional encryption is blocked or fingerprinted.:contentReference[oaicite:0]{index=0}
Tags: #LLM-cryptography #steganography #covert-communication #cybersecurity #privacy #AI
CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities
Summary: CVE-Bench introduces a real-world cybersecurity benchmark designed to evaluate the capability of large language model (LLM) agents in exploiting web application vulnerabilities. Unlike previous benchmarks that focus on abstract Capture the Flag competitions, CVE-Bench utilizes critical-severity Common Vulnerabilities and Exposures (CVEs) to create scenarios that closely mimic real-world conditions. The framework provides a sandbox environment for LLM agents to autonomously exploit vulnerabilities, facilitating systematic evaluation of their performance. Initial assessments reveal that state-of-the-art agent frameworks can successfully exploit up to 13% of the vulnerabilities presented.
Tags: #cybersecurity #AI-agents #vulnerability-exploitation #benchmarking #LLM
On the Feasibility of Using LLMs to Execute Multistage Network Attacks
Summary: This paper investigates the capability of large language models (LLMs) to autonomously conduct multistage network attacks, which involve complex sequences of actions across multiple hosts, including reconnaissance, exploitation, lateral movement, and data exfiltration. The authors evaluate various LLMs across 10 emulated network environments and find that, in their standard configurations, LLMs struggle to perform these attacks effectively. To address this limitation, they introduce Incalmo, an LLM-agnostic high-level attack abstraction layer that mediates between the LLM and the target environment. Incalmo allows LLMs to specify high-level tasks (e.g., "infect a host" or "scan a network"), which it then translates into low-level commands using an action planner, environment state service, and attack graph service. This approach enables LLMs to successfully execute multistage attacks in 9 out of 10 tested environments, highlighting the potential for LLMs to be utilized in automated cybersecurity operations when paired with appropriate abstraction layers.
Tags: #LLMs #cybersecurity #network-attacks #automation #AI
IntellBot: Retrieval Augmented LLM Chatbot for Cyber Threat Knowledge Delivery
Summary: This paper introduces IntellBot, an advanced cybersecurity chatbot leveraging Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) techniques. IntellBot aggregates information from diverse sources to create a comprehensive knowledge base encompassing known vulnerabilities, recent cyberattacks, and emerging threats.
Tags: #cybersecurity #chatbot #LLM #RAG #threat-intelligence
Analysing the Cultural Dimensions of Cybercriminal Groups: A Case Study on the Conti Ransomware Group
Summary: This study explores the cultural aspects of cybercriminal organizations, focusing on the Conti ransomware group. By analyzing leaked internal communications, the authors apply Hofstede's and Meyer's cultural dimension frameworks to understand the group's operational behaviors.
Tags: #cybersecurity #cultural-analysis #threat-intelligence #ransomware

Deepfake & Anomaly Detection

Capture Artifacts for Deepfake Detection
Summary: This research introduces a method for detecting deepfakes by progressively disentangling and purifying blended identities, improving detection accuracy.
Tags: #deepfake #detection #AI
AnomalyNCD
Summary: AnomalyNCD proposes a novel approach to anomaly detection, focusing on identifying new anomaly classes in AI systems.
Tags: #anomaly-detection #AI
FairAdapter
Summary: FairAdapter addresses the challenge of detecting AI-generated images, which often exhibit inconsistencies across different content types due to overfitting in existing detection models. This framework enhances detection fairness by mitigating performance disparities across various image contents.
Tags: #AI-generated-images #detection-fairness #deep-learning #image-forensics

Poisoning Attacks

NoiseAttack
Summary: NoiseAttack discusses an evasive, sample-specific multi-domain attack strategy, highlighting vulnerabilities in AI systems.
Tags: #poisoning #security #AI
CodeBreaker
Summary: CodeBreaker presents an LLM-assisted, easy-to-trigger backdoor attack, emphasizing the need for robust security measures in AI.
Tags: #poisoning #security #AI

Detection & Monitoring

Protecting Small Organizations from AI Bots with Logrip: Hierarchical IP Hashing
Summary: The paper introduces Logrip, an open-source tool that analyzes web-server access logs to distinguish human traffic from automated crawlers and AI bots. It hashes IPs hierarchically (individual, /24, and /16 subnets) and applies behavior-based metrics—such as request rate, daily span, and consecutive-day activity—to score and block mechanical access patterns. Tested on a real non-profit’s server, Logrip reduced estimated workload by 94 %, proving more effective than conventional throttling. The approach provides small organizations with a lightweight, configurable defense against distributed crawler traffic that degrades performance while remaining within polite rate limits.
Tags: #cybersecurity #log-analysis #bot-detection #hierarchical-ip-hashing #AI-crawlers
Human intelligence can safeguard against artificial intelligence
Summary: This study explores individual differences in discerning human-generated texts from AI-generated ones, highlighting the role of human intelligence in monitoring AI outputs.
Tags: #detection #monitoring #human-AI
LLMSCAN
Summary: LLMSCAN introduces a causal scanning method to detect misbehavior in large language models, enhancing AI system reliability.
Tags: #detection #monitoring #AI
From Imitation to Introspection
Summary: This paper explores the emergence of self-consciousness in language models, transitioning from mere imitation to introspective capabilities. It examines the extent to which these models can exhibit self-referential behaviors, delving into the philosophical implications of AI's evolving cognitive functions.
Tags: #self-consciousness #AI-philosophy #introspection #detection
PARIS: A Practical, Adaptive Trace-Fetching and Real-Time Malicious Behavior Detection System
Summary: This paper introduces PARIS, a system designed for real-time detection of malicious behaviors on Windows platforms. Utilizing Event Tracing for Windows (ETW), PARIS adaptively monitors and collects data related to potentially malicious API calls and call stacks, significantly reducing data collection overhead. This efficiency enables the system to monitor a broader range of APIs, enhancing its ability to detect complex attack behaviors.
Tags: #malware-detection #real-time-monitoring #ETW #cybersecurity

Bias & Fairness

Fair Summarization: Bridging Quality and Diversity in Extractive Summaries
Summary: This paper addresses the challenge of ensuring fairness in multi-document summarization, particularly when dealing with user-generated content from diverse social groups. The authors introduce two novel methods for fair extractive summarization: FairExtract, a clustering-based approach, and FairGPT, which leverages GPT-3.5-turbo with fairness constraints.
Tags: #fairness #summarization #extractive-summarization #GPT-3.5
Mitigating Bias in Queer Representation within Large Language Models: A Collaborative Agent Approach
Summary: This paper addresses biases in pronoun usage by large language models (LLMs), particularly the inappropriate application of gendered pronouns when inclusive language is necessary. The authors introduce a collaborative agent framework designed to detect and correct such biases, enhancing inclusivity in AI-generated content.
Tags: #bias-mitigation #queer-representation #LLM #inclusive-language

Privacy & Ethics

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Summary: This paper provides a comprehensive survey of the emerging "LLM-as-a-judge" paradigm, where Large Language Models (LLMs) are utilized for scoring, ranking, or selection across various tasks and applications. It introduces a detailed taxonomy exploring three dimensions: what to judge, how to judge, and where to judge.
Tags: #LLM #AI-evaluation #survey #AI

Development Tools and Infrastructure

Testing & Benchmarking

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
Summary: Introduces MCP-Bench, a benchmark for evaluating tool-using LLM agents on realistic, multi-step tasks. Built on the Model Context Protocol (MCP), it connects models to 28 live MCP servers exposing ~250 tools across domains (e.g., finance, travel, scientific computing, academic search). Tasks require cross-tool orchestration, schema understanding, precise parameter control, planning/reasoning, and grounding in intermediate tool outputs. A multi-facet evaluation covers tool usage, trajectory planning, and end-to-end task success; experiments on 20 LLMs reveal persistent challenges. Code and data are available on GitHub.
Tags: #benchmark #MCP #tool-use #LLM-agents #evaluation #planning
OSWorld Summary: OSWorld provides a comprehensive benchmarking environment for evaluating AI performance across various operating systems.
Tags: #benchmark #testing #OS
AndroidWorld Summary: AndroidWorld offers a specialized platform for testing and benchmarking AI applications within the Android ecosystem.
Tags: #benchmark #testing #OS
WindowsAgentArena Summary: WindowsAgentArena serves as a testing ground for assessing AI agents' performance in Windows operating environments.
Tags: #benchmark #testing #OS

User Interface & Interaction

Mobile-Agent-v3: Fundamental Agents for GUI Automation
Summary: Introduces GUI-Owl, a foundational GUI agent, and Mobile-Agent-v3, a general-purpose GUI-automation framework. A cloud environment spanning Android, Ubuntu, macOS, and Windows powers a Self-Evolving GUI Trajectory Production pipeline to generate/validate high-quality interaction data. The agents integrate UI grounding, planning, action semantics, and reasoning, and train with scalable asynchronous RL, including Trajectory-aware Relative Policy Optimization (TRPO). Reported results set new open-source SOTA on GUI-agent benchmarks: GUI-Owl-7B scores 66.4 on AndroidWorld and 29.4 on OSWorld; Mobile-Agent-v3 reaches 73.3 (AndroidWorld) and 37.7 (OSWorld). Code is open-sourced.
Tags: #GUI-agent #automation #reinforcement-learning #AndroidWorld #OSWorld #open-source
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Summary: This paper introduces UI-TARS, a native GUI agent model that processes screenshots as input to perform human-like interactions, such as keyboard and mouse operations. Unlike existing frameworks that rely on heavily wrapped commercial models with expert-crafted prompts, UI-TARS is an end-to-end model that outperforms these sophisticated systems. Experiments demonstrate its superior performance across multiple GUI agent benchmarks evaluating perception, grounding, and task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, surpassing models like Claude. In AndroidWorld, it achieves a score of 46.6, outperforming GPT-4o. Key innovations include enhanced perception through a large-scale dataset of GUI screenshots, unified action modeling standardizing actions across platforms, system-2 reasoning incorporating deliberate multi-step decision-making, and iterative training with reflective online traces to address data bottlenecks.
Tags: #GUI #automation #human-computer-interaction #AI
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Summary: ShowUI is a vision-language-action model designed to enhance human productivity by enabling AI agents to interact with graphical user interfaces (GUIs) in a manner similar to human perception. Unlike traditional language-based agents that depend on text-rich meta-information, ShowUI incorporates advanced UI-guided visual token selection and interleaved vision-language-action streaming.
Tags: #GUI-assistants #vision-language-action #AI #productivity
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use
Summary: This study explores Claude 3.5 Computer Use, an AI model designed to function as a graphical user interface (GUI) agent. The authors curated a set of tasks across various domains and software to assess its capabilities.
Tags: #GUI-agent #AI #automation #Claude3.5

Infrastructure & Scaling

LLaMA-Factory Summary: LLaMA-Factory is an open-source infrastructure project aimed at streamlining large language model deployment.
Tags: #open-source #infrastructure #AI
SWIFT
Summary: SWIFT provides scalable, lightweight infrastructure for efficient LLM deployment, designed to minimize resource consumption.
Tags: #open-source #infrastructure #AI

Multimedia Processing

Speech & Audio

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
Summary: IndexTTS extends XTTS/Tortoise into an end-to-end, GPT-style TTS pipeline that can clone voices in a single shot while remaining lightweight enough for industrial workloads. Key upgrades include (1) a hybrid character + pinyin front-end to disambiguate polyphonic Chinese characters and long-tail words; (2) an empirical study of Vector-Quantisation vs. Finite-Scalar-Quantisation codebooks for acoustic tokens; (3) a Conformer-based speech-conditional encoder that steadies prosody during voice-cloning; and (4) replacement of the original decoder with BigVGAN2 for cleaner, high-fidelity audio. Against open-source baselines (Fish-Speech, CosyVoice2, FireRedTTS, F5-TTS) it delivers higher naturalness, better content consistency, faster inference, and a simpler training recipe.
Tags: #text-to-speech #zero-shot #voice-cloning #controllable-TTS #LLM
TangoFlux: Super Fast and Faithful Text-to-Audio Generation with Flow Matching and CLAP-Ranked Preference Optimization
Summary: TangoFlux is an efficient text-to-audio (TTA) generative model with 515 million parameters, capable of producing up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. It introduces CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. TangoFlux outperforms existing TTA models in both objective and subjective evaluations, providing superior audio quality with reduced inference times.
Tags: #text-to-audio #audio-generation #open-source #AI
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Summary: F5-TTS introduces a fully non-autoregressive text-to-speech (TTS) system utilizing flow matching with a Diffusion Transformer (DiT). By padding text inputs to match the length of speech inputs and performing denoising for speech generation, it eliminates the need for complex components like duration models, text encoders, and phoneme alignment. Enhancements include the integration of ConvNeXt for refined text representation and the implementation of a Sway Sampling strategy during inference, which significantly boosts performance and efficiency.
Tags: #text-to-speech #flow-matching #Diffusion-Transformer #AI
Qwen2-Audio Summary: Qwen2-Audio is an advanced speech model designed to improve audio processing capabilities in AI systems.
Tags: #speech #audio #AI-models
LLaMA-Omni
Summary: LLaMA-Omni offers seamless speech interaction with large language models, enhancing real-time communication capabilities.
Tags: #speech #audio #AI-models
Moonshine
Summary: This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing. Moonshine is based on an encoder-decoder transformer architecture and employs Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings.
Tags: #speech-recognition #live-transcription #voice-commands #AI

Video Generation

ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development
Summary: Presents ComfyUI-Copilot, an LLM-powered assistant plugin for the ComfyUI visual generative-AI platform. It provides intelligent node/model recommendations and one-click automated workflow construction using a hierarchical multi-agent architecture (a central assistant delegating to specialized worker agents) backed by curated ComfyUI knowledge bases. Offline evaluations and online user feedback show improved accuracy of recommendations and faster workflow building; code and demo are available on GitHub.
Tags: #agent #multi-agent #workflow-automation #ComfyUI #LLM #creative-AI
Open-Sora 2.0: Training a Commercial-Level Sora Model on 2M Videos
Summary: Open-Sora 2.0 is an open-source initiative aimed at reproducing and advancing Sora-level text-to-video generation capabilities. Trained on 2 million high-quality video clips, the system incorporates a multi-stage training pipeline with 3D U-Net architecture, temporal-aware attention mechanisms, and scalable data curation strategies. It achieves state-of-the-art performance in generating coherent, high-fidelity videos across diverse prompts and scenes. Open-Sora 2.0 sets a new benchmark for open-source video generation systems at commercial scale.
Tags: #video-generation #text-to-video #open-source #Sora #AI
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints
Summary: SynCamMaster introduces a plug-and-play module that enhances pre-trained text-to-video models for generating synchronized multi-camera videos from diverse viewpoints. It incorporates a multi-view synchronization module to maintain appearance and geometry consistency across different camera angles.
Tags: #video-generation #multi-camera #synchronization #AI
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Summary: HunyuanVideo is an open-source video foundation model that achieves performance in video generation comparable to, or surpassing, leading closed-source models. The framework integrates key components such as data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure for large-scale model training and inference.
Tags: #video-generation #open-source #AI #large-scale-models
Open-Sora Plan: Open-Source Large Video Generation Model
Summary: The Open-Sora Plan introduces an open-source initiative aimed at developing a large-scale model capable of generating high-resolution, extended-duration videos based on diverse user inputs. The project encompasses several components integral to the video generation process, including Wavelet-Flow Variational Autoencoder and Joint Image-Video Skiparse Denoiser.
Tags: #video-generation #open-source #AI #Wavelet-Flow-VAE #Skiparse-Denoiser
StableAnimator: High-Quality Identity-Preserving Human Image Animation
Summary: StableAnimator is an end-to-end video diffusion framework designed to synthesize high-quality human image animations while preserving identity (ID) consistency. It utilizes off-the-shelf extractors to compute image and face embeddings, refining them through a global content-aware Face Encoder.
Tags: #image-animation #identity-preservation #video-diffusion #AI
Hallo2
Summary: This paper introduces Hallo2, an advanced method for generating long-duration, high-resolution portrait image animations driven by audio inputs. The approach addresses challenges like appearance drift and temporal artifacts by implementing augmentation strategies within the image space of conditional motion frames.
Tags: #image-animation #audio-driven #high-resolution #long-duration

Specialized Applications

Healthcare

TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools
Summary: TxAgent is an AI agent designed to enhance precision therapeutics by integrating multi-step reasoning with real-time biomedical knowledge retrieval. It leverages a toolbox of 211 tools, collectively known as ToolUniverse, to analyze drug interactions, contraindications, and patient-specific treatment strategies. By evaluating drugs at molecular, pharmacokinetic, and clinical levels, TxAgent tailors treatment recommendations to individual patient characteristics, including age, genetic factors, and disease progression. It outperforms leading large language models and reasoning agents across multiple benchmarks, achieving 92.1% accuracy in open-ended drug reasoning tasks.
Tags: #AI-agent #therapeutic-reasoning #precision-medicine #healthcare #ToolUniverse

Financial Technology

FinRobot: AI Agent for Equity Research and Valuation with Large Language Models
Summary: FinRobot introduces an AI agent framework tailored for equity research, integrating both quantitative and qualitative analyses to emulate human analyst reasoning. It comprises three specialized agents: Data-CoT Agent, Concept-CoT Agent, and Thesis-CoT Agent.
Tags: #AI-agent #equity-research #valuation #LLM

Gaming & Simulation

Reinforcement Learning for Quantum Tiq-Taq-Toe
Summary: This paper explores the application of reinforcement learning (RL) to Quantum Tiq-Taq-Toe, a quantum variant of the classic Tic-Tac-Toe game. The study highlights the challenges in representing quantum games classically due to partial observability and exponential state complexity.
Tags: #reinforcement-learning #quantum-computing #game-theory #AI
Kinetix: Investigating the Training of General Agents through Open-Ended Physics-Based Control Tasks
Summary: Kinetix introduces an open-ended framework for training general reinforcement learning (RL) agents across a vast array of 2D physics-based tasks. By procedurally generating tens of millions of environments—ranging from robotic locomotion and grasping to video games—the framework enables agents to develop robust physical reasoning capabilities.
Tags: #reinforcement-learning #physics-based-tasks #generalization #AI

Scientific Applications

The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning
Summary: The Well is a comprehensive collection of datasets comprising numerical simulations across various spatiotemporal physical systems. It offers 15TB of data spanning 16 datasets, covering domains such as biological systems, fluid dynamics, acoustic scattering, and magneto-hydrodynamic simulations of extra-galactic fluids and supernova explosions. This resource aims to facilitate the development and evaluation of machine learning-based surrogate models in physics by providing a diverse range of physical behaviors.
Tags: #physics-simulations #machine-learning #datasets #open-source
A Library for Learning Neural Operators
Summary: This paper introduces NeuralOperator, an open-source Python library designed to facilitate operator learning. Unlike traditional neural networks that map between finite-dimensional Euclidean spaces, neural operators are capable of learning mappings between function spaces, enabling them to handle inputs and outputs at various discretizations. Built on PyTorch, NeuralOperator offers tools for training and deploying neural operator models, as well as developing new ones, providing a user-friendly interface for both newcomers and experienced practitioners.
Tags: #operator-learning #neural-operators #open-source #Python-library
Automating the Search for Artificial Life with Foundation Models
Summary: This paper introduces the Automated Search for Artificial Life (ASAL), a novel approach that leverages vision-language foundation models (FMs) to explore and discover lifelike simulations across various Artificial Life (ALife) substrates, including Boids, Particle Life, Game of Life, Lenia, and Neural Cellular Automata. ASAL automates the identification of simulations that produce target phenomena, generate temporally open-ended novelty, and illuminate a diverse range of interesting simulations.
Tags: #ArtificialLife #FoundationModels #SimulationDiscovery #AI

Educational Resources

Tutorials & Guides

Quantum Computers Explained – Limits of Human Technology Video: Link
Summary: This educational video provides an accessible overview of quantum computing, exploring its fundamental principles and potential to revolutionize technology. It delves into the limitations of classical computers and illustrates how quantum mechanics can overcome these barriers, enabling unprecedented computational capabilities.
Tags: #quantum-computing #technology #education #computing
Every attention head explained Source: YouTube
Summary: This educational video provides an in-depth explanation of attention heads within transformer models, a fundamental component in modern machine learning architectures. It covers the mechanics of attention mechanisms, their role in processing input data, and how they contribute to the performance of models in tasks like natural language processing and computer vision.
Tags: #attention-heads #transformer-models #machine-learning #AI
How CUDA Programming Works Source: YouTube
Summary: This YouTube video provides an overview of CUDA programming, explaining its functionality and applications in parallel computing.
Tags: #CUDA #computing

Best Practices

Quantum Computing: Hype vs. Reality Video: Link
Summary: This video critically examines the current state of quantum computing, distinguishing between the technology's realistic capabilities and the surrounding hype. It provides insights into the progress made in the field, the challenges that remain, and the practical implications of quantum computing advancements.
Tags: #quantum-computing #technology #analysis #reality-check

Environmental Impact

Energy Efficiency

The carbon emissions of writing and illustrating
Summary: This study compares the carbon emissions associated with writing and illustrating tasks performed by AI systems versus humans. The findings indicate that AI systems produce significantly lower carbon emissions for these creative tasks.
Tags: #carbon-emissions #AI-efficiency #environment #sustainability

Name		Name	Last commit message	Last commit date
Latest commit History 181 Commits
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Core AI Technologies

Reasoning & Prompting Mechanisms

Language Models & Architecture

Reinforcement Learning

Multimodal Models

Training & Optimization

Memory & Attention Mechanisms

Cache & Memory Optimization

Long Context Models

Spatial Reasoning

World Modeling Platforms

AI Enhancement Technologies

Deepsearch Tools

Retrieval & Augmentation

Hallucination Reduction

Performance Optimization

Applications and Implementations

Model Context Protocol (MCP)

Coding & Development Tools

Agentic Frameworks & Task Management

Document Processing

Computer Vision

Robotics & Automation

Text Analysis & Social Media Tools

AI Safety and Security

Alignment

Security Frameworks

Cybersecurity

Deepfake & Anomaly Detection

Poisoning Attacks

Detection & Monitoring

Bias & Fairness

Privacy & Ethics

Development Tools and Infrastructure

Testing & Benchmarking

User Interface & Interaction

Infrastructure & Scaling

Multimedia Processing

Speech & Audio

Video Generation

Specialized Applications

Healthcare

Financial Technology

Gaming & Simulation

Scientific Applications

Educational Resources

Tutorials & Guides

Best Practices

Environmental Impact

Energy Efficiency

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Uh oh!