"Arabic is not a sequence of letters; it is a mathematical matrix of meaning."
ARAI is a Research Repository dedicated to exploring the hypothesis that Classical Arabic can be treated as a computational system. This project investigates Morphological Algebraβa method of decomposing and synthesizing Semitic languages into their constituent structural DNA using Semitic Root Encoding (SRE).
Warning
Status: Experimental Research. This is not a production-ready model. The current implementations are exploratory and designed to validate the theoretical framework of SRE-based architectures.
Standard Large Language Models (LLMs) treat Arabic as a sequence of opaque tokens (BPE), identical to how they treat English or French. However, Classical Arabic is inherently nonlinear. It operates on a multi-dimensional grid where semantic intent and grammatical function are distinct, orthogonal layers.
ARAI investigates treating Arabic as "Code" rather than "Text."
In the SRE framework, we move away from 1D tokenization. Instead, every "word" is viewed as a result of a mathematical operation:
- The Root (The Semantic Core): Usually a three-letter consonant cluster (e.g.,
Ω.Ψͺ.Ψ¨- K-T-B). This is the "constant" that carries the abstract concept of Writing. - The Pattern (The Functional Template): A specific template (e.g.,
Ω 123Ω- Ma123a) that carries the concept of Space/Location.
When you apply the Pattern of Location to the Root of Writing, the language "calculates" the word Maktaba (Library/Office). This project explores whether a Transformer architecture can learn this "Morphological Algebra" directly, allowing it to generalize to new root-pattern combinations that it has never seen before.
Because the language is so highly structured, it mirrors formal logic. A single root can generate hundreds of words across different patterns, yet the semantic "DNA" of the root remains constant. Standard LLMs must learn these relationships statistically; ARAI explores encoding them architecturally.
SRE is our experimental approach to sparse embeddings. By feeding the model two distinct input streamsβone for the Root and one for the Patternβwe allow the attention mechanism to track semantic flow and grammatical consistency on separate but synchronized channels.
This repository contains the following experimental modules used to audit the SRE hypothesis:
- Experimental SRE Transformer: A dual-input architecture designed to investigate the structural interaction between semantic and functional embeddings.
- Morphological Algebra Benchmarks: Vector-space tools for testing semantic analogies, such as
Root(Justice) + (Pattern(Agent) - Pattern(Abstract))to see if the resulting vector clusters near the expected morphological state. - Corpus Ingestion Pipeline: Tools for extracting morphological primitives from classical linguistic datasets, preserving the register and precision of the classical language.
- Edge Feasibility (WIP): A prototype TensorFlow.js implementation exploring whether morphological logic can enable "Titan-class" reasoning on lightweight hardware by reducing dependency on massive parameter counts.
This project is currently in the Discovery Phase. Our objective is to audit the efficiency of SRE as a data structure and to understand the limitations of current Transformer architectures in capturing nonlinear morphological relationships.
βββ src/
β βββ python/arai/ # Core Research Engine (PyTorch)
β βββ javascript/ # Experimental Edge Implementation
βββ scripts/
β βββ training/ # Exploratory training pipelines
β βββ evaluation/ # Logical audits and SRE benchmarks
β βββ preprocessing/ # Morphological extraction tools
βββ research/ # Historical logs and experimental audits
βββ docs/ # Technical specifications and research notesNote: This repository requires an environment capable of running PyTorch and Camel-Tools.
# Install research dependencies
pip install -e .
npm install
# Run the SRE ingestion pipeline
python3 scripts/preprocessing/preprocess.pyAdvanced Agentic Coding Project | Exploring the frontiers of Morphological AGI.