SHARECHAT: A Dataset of Chatbot Conversations in the Wild

SHARECHAT is a large-scale corpus of authentic user-LLM conversations sourced directly from publicly shared URLs across five major chatbot platforms. Unlike existing datasets that homogenize interactions through uniform interfaces, SHARECHAT preserves native platform affordances and captures real-world usage patterns (hence, we called it "in the wild"). More detials could be found in our paper here: ShareChat: A Dataset of Chatbot Conversations in the Wild. The dataset is available on Hugging Face: ShareChat.

Overview

While many existing research typically treat Large Language Models (LLMs) as generic text generators, they are often integrated as distinct commercial chatbots with unique interfaces and capabilities that fundamentally shape user behavior. Current datasets obscure this reality by collecting text-only data through uniform interfaces that fail to capture authentic human-chatbot interactions.

SHARECHAT addresses these limitations by:

Preserving Native Affordances: Captures platform-specific features like citations, thinking traces, and code artifacts
Multi-Platform Coverage: Spans five major platforms with distinct design philosophies
Authentic Usage: Sourced from voluntarily shared conversations, reducing observer bias
Extended Interactions: Substantially longer conversations than prior datasets (avg. 4.62 turns vs. 2.02 in LMSYS-Chat-1M)
Linguistic Diversity: Covers 101 distinct languages

Dataset Statistics

Metric	Value
Total Conversations	142,808
Total Turns	660,293
Average Turns per Conversation	4.62
Languages Covered	101
Collection Period	April 2023 – October 2025
Avg. User Tokens	135.04 ± 1,820.88
Avg. Chatbot Tokens	1,115.30 ± 1,764.81

Per-Platform Breakdown

Platform	Conversations	Turns	Avg. Turns	Languages
ChatGPT	102,740	542,148	5.28	101
Perplexity	17,305	24,378	1.41	45
Grok	14,415	53,094	3.69	60
Gemini	7,402	36,422	4.92	47
Claude	946	4,251	4.49	19

Token statistics computed using the Llama-2 tokenizer for consistent cross-platform comparison.

Data Collection

Conversations were collected from publicly shared URLs discovered via Internet archival services (Wayback Machine).

Platform	Share URL Format	Collection Period
ChatGPT	`chatgpt.com/share/*`	May 2023 – Aug 2025
Perplexity	`perplexity.ai/search/*`	Apr 2023 – Oct 2025
Grok	`grok.com/share/*`	Dec 2024 – Oct 2025
Gemini	`gemini.google.com/share/*`	Apr 2024 – Sep 2025
Claude	`claude.ai/share/*`	—

And different platforms capture distinct metadata and structural elements:

Feature	ChatGPT	Perplexity	Grok	Gemini	Claude
Textual Content	✓	✓	✓	✓	✓
Source Citations	–	✓	✓	–	–
Thinking Blocks	–	–	✓	–	✓
Code Artifacts	–	–	–	–	✓
Analysis Blocks	–	–	–	–	✓
Turn Timestamps	✓	–	✓	–	–
Model Version	✓	–	✓	✓	–
View/Share Counts	–	✓	–	–	–

IRB Approval: Data collection conducted under IRB approval (#28569).

Privacy and PII Removal

We prioritize user privacy through a rigorous de-identification pipeline. First, We employed Microsoft's Presidio as the core framework to identify and remove personally identifiable information across multiple data types:

Names and personal identifiers
Phone numbers
Email addresses
Credit card numbers
URLs and web addresses
Other sensitive identifiers

PII detection covers conversations in:

English, Spanish, German, French, Italian, Portuguese, Dutch, Chinese, Japanese, Russian, and Hebre.

Note: For the released dataset, we retain only conversations in the supported languages listed here and provide a separate URL list for conversations in other languages.

And then we used GPT-OSS-120B to assess the accuracy of PII identification and by verifying that PII has been successfully removed from each message. The removal success rates by platform are:

Platform	Success Rate	Records with PII	Total Records
ChatGPT	95.20%	51041	1062949
Claude	97.01%	252	8,504
Gemini	95.43%	3,302	72,746
Grok	94.15	6,010	106,168
Perplexity	94.42%	2,899	54,355

Lastly, to validate detection accuracy, we manually coded 50 randomly selected conversations (288 turns) that were flagged as containing PII. We observe that the Presidio is rather conservative.

Additional Privacy Measures

Original platform-specific user IDs and usernames are not stored or released
Analyses are conducted on aggregated statistics only

Data Format

Available Files

The dataset is released in CSV format for ease of use and accessibility.

Note: Raw HTML/MHTML archives are not available in the current release.

CSV Structure

Each conversation record contains:

Complete sequence of user and assistant turns
Platform-specific metadata:
- Timestamps (ChatGPT, Grok)
- Model version information (ChatGPT, Grok, Gemini)
- Source citations (Perplexity, Grok)
- Thinking traces (Claude, Grok)

The final released DataFrames provide turn level conversation records from five platforms with a shared core schema, where each row is one message. All datasets include platform, url, turns_count, message_index, role, plain_text, and detected_language_final, enabling consistent cross platform analysis of conversation structure, content, and language. Platform specific metadata is kept in additional columns: Claude includes thinking, code, analysis, and version; Gemini adds model plus two timestamps, created_at and published_at; Grok adds per message timing and provenance through message_create_time, links, source, model, and last_updated, as well as thinking; Perplexity adds citation and engagement context with source_bar, source, last_updated, views, shares, and other_info; and GPT includes model along with both a per message timestamp (message_create_time) and a conversation level timestamp (create_time).

Caution

You must not attempt to identify the identities of individuals or infer any sensitive personal data encompassed in this dataset.
When leveraging direct outputs of a specific model, users must adhere to its corresponding terms of use.
The views and opinions depicted in this dataset do not reflect the perspectives of the researchers or affiliated institutions engaged in the data collection process.

Citation

If you use SHARECHAT in your research, please cite our paper:

@misc{yan2026sharechatdatasetchatbotconversations,
      title={ShareChat: A Dataset of Chatbot Conversations in the Wild}, 
      author={Yueru Yan and Tuc Nguyen and Bo Su and Melissa Lieffers and Thai Le},
      year={2026},
      eprint={2512.17843},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.17843}, 
}

Appendix: Detailed Platform Documentation

For technical details about the data extraction process and field definitions for each platform, see the platform-specific documentation:

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
conversation_completeness		conversation_completeness
docs		docs
LICENSE		LICENSE
README.md		README.md
filtered_out_conversations_non_target_languages.json		filtered_out_conversations_non_target_languages.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SHARECHAT: A Dataset of Chatbot Conversations in the Wild

Overview

Dataset Statistics

Per-Platform Breakdown

Data Collection

Privacy and PII Removal

Additional Privacy Measures

Data Format

Available Files

CSV Structure

Caution

Citation

Appendix: Detailed Platform Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SHARECHAT: A Dataset of Chatbot Conversations in the Wild

Overview

Dataset Statistics

Per-Platform Breakdown

Data Collection

Privacy and PII Removal

Additional Privacy Measures

Data Format

Available Files

CSV Structure

Caution

Citation

Appendix: Detailed Platform Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages