hivelab is an extension of the Slurk chat platform that provides a complete modular pipeline to conduct large-scale randomized controlled trials involving human participants and LLM agents in group conversations. hivelab automates parallelization and uses a unique identifier to seamlessly link data across recruitment platforms, survey engines, and chat interfaces. This solution offers social scientists a transparent, fully controllable, and scalable tool for rigorous human-AI interaction experiments.
In the social sciences, obtaining reliable causal data from large-scale, randomized group experiments remains a major methodological challenge in the rapidly growing field of human-AI interaction involving LLM agents. One of the core difficulties is establishing a robust experimental pipeline capable of performing parallel randomization into nested group structures while offering seamless cross-platform data linkage.
To address this gap, we introduce hivelab, an open-source modular solution designed to conduct automated randomized controlled trials involving humans and LLM agents in group conversations at scale. hivelab provides a complete pipeline that manages parallel randomization for large-scale experiments and propagates a unique identifier to seamlessly connect data across recruitment platforms, surveys, and chat rooms. In addition, it enables real-time customization of agent behavior extracted from the survey data. By offering transparency, complete data control, and a fully customizable setup, hivelab provides the social science community with the necessary tools to conduct rigorous, replicable, and scalable human-AI interaction experiments.
There are a variety of tools for conducting human-AI interaction experiments, but there are currently no solutions that allow for a flexible open-source pipeline serving large-scale randomized group studies. hivelab extends Slurk’s real-time multi-user chat capability by integrating tools for automated randomization and end-to-end data management that are required for complex, group-based randomized controlled trials [4, 2]. Other platforms such as Vegapunk embed LLM agents into Qualtrics for scalable longitudinal conversational surveys [1]. Vegapunk, however, is a fee-based application and is not designed for parallelized multi-human group interactions.
The most functionally similar platform to hivelab is Epitome. Epitome offers a modular visual environment for multi-level group experiments with embedded randomization [3]. The downside to Epitome is its invitation system, which is logistically too complex for large, anonymous participant pools (from Prolific/MTurk). On top of that, its cloud-based nature raises concerns about data privacy and code control. By contrast, hivelab is a modular open-source platform whose components researchers can self-host and control. It offers maximum flexibility, easily integrates with any survey tool that supports redirection, and ensures complete code transparency and data ownership, which is crucial for scientific rigor and replication.
The hivelab implementation starts with the open-source Slurk-server platform, which serves as the foundational chat environment. The vanilla Slurk-server lacks the necessary infrastructure to accommodate interactions with LLM agents and to seamlessly integrate with external platforms for end-to-end data management. To address these limitations, we have developed hivelab as a suite of specialized, decoupled building blocks that provide a twofold contribution:
upgrading the chat environment by supporting the integration of any type of agent or API, including LLMs, rule-based bots, and external services; and
delivering end-to-end data integration across platforms throughout the experimental pipeline, compatible with any survey engine capable of basic chaining and equipped with WebHooks.
Experimental flow: Traffic Generator → Pre-Survey (Survey Engine) → Chatbot Interaction → Post-Survey (Survey Engine) → Checkout/Completion.
The entire hivelab implementation follows a containerized architecture that runs and controls these components, which include the basic Slurk-server alongside five new extension components.
The hivelab-concierge service operates as a matchmaker agent in the waiting room, forming and dispatching groups based on group size requirements as participants arrive asynchronously from recruitment platforms. hivelab-setup is responsible for creating chat rooms and configuring room permissions and layouts, ensuring compatibility with the base Slurk-server.
The hivelab-chatbot spawns agents with specific prompts, enabling LLM agents that actively converse with participants. A separate service, the hivelab-manager, is responsible for the materialization of moderator duties, such as providing discussion prompts and timing cues.
Finally, hivelab-exp is responsible for calling the setup service to create chat rooms and generating personalized access links for participants before they arrive at the chat interface. Through a unique identifier, it seamlessly links pre-chat survey data, chat transcripts, and post-chat survey results across platforms, enabling participants’ smooth transition across all stages of the design.
hivelab’s primary limitation lies in its participant assignment strategy, which was implemented to accommodate the study design requirements of its original use case. Currently, users are immediately assigned to groups after clicking on the study link. This occurs prior to participants’ consent, making the assignment strategy susceptible to participant attrition. Participants who abandon the study after reading the information but before consenting to participate will result in incomplete groups. These incomplete groups must then be discarded if they do not reach the size required by the experimental design. This limitation can compromise the efficiency of the overall data collection. Future work could optimize this process by delaying group assignment until after consent, ensuring that rooms are formed only with committed participants.
Although the platform has been successfully stress-tested with 170 concurrent participants operating across 40 parallel chat rooms, its maximum capacity is unknown. More research is required to evaluate its operational stability and efficiency with larger participant pools. Finally, the full potential of hivelab’s architecture is currently underexplored. hivelab’s setup is designed to support multi-modal interactions and the deployment of multiple, distinct LLM agents within the same experiment. These advanced capabilities have not yet been systematically tested. Future research could leverage these features to design more complex and nuanced experimental paradigms.
[1] Thomas H. Costello, Tawab Safi, and Hause Lin. Vegapunk: Tool for integrating chatbots into research, experiments, and surveys. https://www.vegapunkdoc.dev/, 2024.
[2] Jana Götze, Maike Paetzel-Prüsmann, Wencke Liermann, Tim Diekmann, and David Schlangen. The slurk interaction server framework: Better data for better dialog models. arXiv preprint arXiv:2202.01155, 2022.
[3] Jingjing Qu, Kejia Hu, Jun Zhu, Wenhao Li, Teng Wang, Zhiyun Chen, Yulei Ye, Chaochao Lu, Aimin Zhou, Xiangfeng Wang, et al. Epitome: Pioneering an experimental platform for AI-social science integration. arXiv preprint arXiv:2507.01061, 2025.
[4] David Schlangen, Tim Diekmann, Nikolai Ilinykh, and Sina Zarrieß. slurk – a lightweight interaction server for dialogue experiments and data collection. In Proceedings of the 22nd Workshop on the Semantics and Pragmatics of Dialogue (AixDial/semdial 2018), 2018.