SHARECHAT is a large-scale corpus of authentic user-LLM conversations sourced directly from publicly shared URLs across five major chatbot platforms. Unlike existing datasets that homogenize interactions through uniform interfaces, SHARECHAT preserves native platform affordances and captures real-world usage patterns (hence, we called it "in the wild"). More detials could be found in our paper here: ShareChat: A Dataset of Chatbot Conversations in the Wild. The dataset is available on Hugging Face: ShareChat.
While many existing research typically treat Large Language Models (LLMs) as generic text generators, they are often integrated as distinct commercial chatbots with unique interfaces and capabilities that fundamentally shape user behavior. Current datasets obscure this reality by collecting text-only data through uniform interfaces that fail to capture authentic human-chatbot interactions.
SHARECHAT addresses these limitations by:
- Preserving Native Affordances: Captures platform-specific features like citations, thinking traces, and code artifacts
- Multi-Platform Coverage: Spans five major platforms with distinct design philosophies
- Authentic Usage: Sourced from voluntarily shared conversations, reducing observer bias
- Extended Interactions: Substantially longer conversations than prior datasets (avg. 4.62 turns vs. 2.02 in LMSYS-Chat-1M)
- Linguistic Diversity: Covers 101 distinct languages
| Metric | Value |
|---|---|
| Total Conversations | 142,808 |
| Total Turns | 660,293 |
| Average Turns per Conversation | 4.62 |
| Languages Covered | 101 |
| Collection Period | April 2023 – October 2025 |
| Avg. User Tokens | 135.04 ± 1,820.88 |
| Avg. Chatbot Tokens | 1,115.30 ± 1,764.81 |
| Platform | Conversations | Turns | Avg. Turns | Languages |
|---|---|---|---|---|
| ChatGPT | 102,740 | 542,148 | 5.28 | 101 |
| Perplexity | 17,305 | 24,378 | 1.41 | 45 |
| Grok | 14,415 | 53,094 | 3.69 | 60 |
| Gemini | 7,402 | 36,422 | 4.92 | 47 |
| Claude | 946 | 4,251 | 4.49 | 19 |
Token statistics computed using the Llama-2 tokenizer for consistent cross-platform comparison.
Conversations were collected from publicly shared URLs discovered via Internet archival services (Wayback Machine).
| Platform | Share URL Format | Collection Period |
|---|---|---|
| ChatGPT | chatgpt.com/share/* |
May 2023 – Aug 2025 |
| Perplexity | perplexity.ai/search/* |
Apr 2023 – Oct 2025 |
| Grok | grok.com/share/* |
Dec 2024 – Oct 2025 |
| Gemini | gemini.google.com/share/* |
Apr 2024 – Sep 2025 |
| Claude | claude.ai/share/* |
— |
And different platforms capture distinct metadata and structural elements:
| Feature | ChatGPT | Perplexity | Grok | Gemini | Claude |
|---|---|---|---|---|---|
| Textual Content | ✓ | ✓ | ✓ | ✓ | ✓ |
| Source Citations | – | ✓ | ✓ | – | – |
| Thinking Blocks | – | – | ✓ | – | ✓ |
| Code Artifacts | – | – | – | – | ✓ |
| Analysis Blocks | – | – | – | – | ✓ |
| Turn Timestamps | ✓ | – | ✓ | – | – |
| Model Version | ✓ | – | ✓ | ✓ | – |
| View/Share Counts | – | ✓ | – | – | – |
IRB Approval: Data collection conducted under IRB approval (#28569).
We prioritize user privacy through a rigorous de-identification pipeline. First, We employed Microsoft's Presidio as the core framework to identify and remove personally identifiable information across multiple data types:
- Names and personal identifiers
- Phone numbers
- Email addresses
- Credit card numbers
- URLs and web addresses
- Other sensitive identifiers
PII detection covers conversations in:
- English, Spanish, German, French, Italian, Portuguese, Dutch, Chinese, Japanese, Russian, and Hebre.
Note: For the released dataset, we retain only conversations in the supported languages listed here and provide a separate URL list for conversations in other languages.
And then we used GPT-OSS-120B to assess the accuracy of PII identification and by verifying that PII has been successfully removed from each message. The removal success rates by platform are:
| Platform | Success Rate | Records with PII | Total Records |
|---|---|---|---|
| ChatGPT | 95.20% | 51041 | 1062949 |
| Claude | 97.01% | 252 | 8,504 |
| Gemini | 95.43% | 3,302 | 72,746 |
| Grok | 94.15 | 6,010 | 106,168 |
| Perplexity | 94.42% | 2,899 | 54,355 |
Lastly, to validate detection accuracy, we manually coded 50 randomly selected conversations (288 turns) that were flagged as containing PII. We observe that the Presidio is rather conservative.
- Original platform-specific user IDs and usernames are not stored or released
- Analyses are conducted on aggregated statistics only
The dataset is released in CSV format for ease of use and accessibility.
Note: Raw HTML/MHTML archives are not available in the current release.
Each conversation record contains:
- Complete sequence of user and assistant turns
- Platform-specific metadata:
- Timestamps (ChatGPT, Grok)
- Model version information (ChatGPT, Grok, Gemini)
- Source citations (Perplexity, Grok)
- Thinking traces (Claude, Grok)
The final released DataFrames provide turn level conversation records from five platforms with a shared core schema, where each row is one message. All datasets include platform, url, turns_count, message_index, role, plain_text, and detected_language_final, enabling consistent cross platform analysis of conversation structure, content, and language. Platform specific metadata is kept in additional columns: Claude includes thinking, code, analysis, and version; Gemini adds model plus two timestamps, created_at and published_at; Grok adds per message timing and provenance through message_create_time, links, source, model, and last_updated, as well as thinking; Perplexity adds citation and engagement context with source_bar, source, last_updated, views, shares, and other_info; and GPT includes model along with both a per message timestamp (message_create_time) and a conversation level timestamp (create_time).
- You must not attempt to identify the identities of individuals or infer any sensitive personal data encompassed in this dataset.
- When leveraging direct outputs of a specific model, users must adhere to its corresponding terms of use.
- The views and opinions depicted in this dataset do not reflect the perspectives of the researchers or affiliated institutions engaged in the data collection process.
If you use SHARECHAT in your research, please cite our paper:
@misc{yan2026sharechatdatasetchatbotconversations,
title={ShareChat: A Dataset of Chatbot Conversations in the Wild},
author={Yueru Yan and Tuc Nguyen and Bo Su and Melissa Lieffers and Thai Le},
year={2026},
eprint={2512.17843},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.17843},
}For technical details about the data extraction process and field definitions for each platform, see the platform-specific documentation: