Skip to content

Add ShareChat Dataset#18

Open
raye22 wants to merge 2 commits intomlabonne:mainfrom
raye22:main
Open

Add ShareChat Dataset#18
raye22 wants to merge 2 commits intomlabonne:mainfrom
raye22:main

Conversation

@raye22
Copy link
Copy Markdown

@raye22 raye22 commented Feb 4, 2026

A new published human-LLM conversation dataset, collected from five different platforms: ChatGPT, Claude, Gemini, Grok, and Perplexity.

@mlabonne
Copy link
Copy Markdown
Owner

mlabonne commented Feb 4, 2026

Thanks for the addition! I don't see how to access the conversations, however. I only see a "plain_text" column (but it's only partial) and the links in "url" don't work for me (e.g., https://chatgpt.com/share/67770737-9244-8006-9592-8e08ab2df3ec returns a 404 error).

Do you know how to get the conversations?

@raye22
Copy link
Copy Markdown
Author

raye22 commented Feb 4, 2026

Thanks so much for the quick response!

To access the full conversations, you need to group the rows by the url column. Within each URL group, you can use the message index and role columns to sequence the interaction between the LLM and the user. And regarding the data quality points you raised:

  1. I checked the empty rows for the plain_text column, and they currently account for less than 1% of the data for most platforms (exception for 2% for grok).
  2. And for the broken urls, the scraping was performed a couple of months ago. Some links (like the one you tested) may now return a 404 error if the original user deleted or made the chat private in the meantime. However, I manually checked a random sample of the ChatGPT URLs, and about 70% are still active.

Given the unique multi-platform and longer interaction data included here, I believe this would be a valuable resource for the community and a great addition to your repository!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants