Welcome to LexiconPrime, a project where we aim to provide free high quality embeddings!
| Dataset | Description | Link |
|---|---|---|
| 20News | Collection of newsgroup documents classified into different topics | 20News |
| BBC News Dataset | News articles categorized into different topics by BBC | BBC News Dataset |
| Kpris Dataset | Large Korean text dataset covering various domains | Kpris Dataset |
| Dataset | Description | Link |
|---|---|---|
| Common Crawl | Web dataset with a vast collection of web pages | Common Crawl |
| Wikipedia | Dump of entire Wikipedia articles | Wikipedia |
| OpenWebText | Large dataset of web pages from diverse domains | OpenWebText |
| BookCorpus | Dataset with text from over 11,000 books | BookCorpus |
| Google News Dataset | Collection of news articles from various sources | Google News Dataset |
| PubMed | Repository of biomedical literature | PubMed |
| Datasets capturing tweets from Twitter | Twitter Datasets | |
| Datasets derived from the Reddit social media platform | Reddit Datasets | |
| Stack Exchange | Network of question-and-answer websites | Stack Exchange Data Dump |
| Yelp Dataset | Dataset with reviews and ratings for businesses | Yelp Dataset |
| GPT-3 Generated Text | Text generated by the GPT-3 language model | GPT-3 Generated Text |
| Amazon Reviews | Dataset containing product reviews from Amazon | Amazon Reviews |
| IMDB Movie Reviews | Dataset of movie reviews from IMDB | IMDB Movie Reviews |
| Reuters News Dataset | Collection of news articles from Reuters | Reuters News Dataset |
| AG's News Topic Classification | News articles classified into different topics | AG's News Dataset |
| WikiText-103 | Large-scale dataset of Wikipedia articles for language modeling | WikiText-103 |
| ArXiv Dataset | Research papers from various disciplines on arXiv.org | ArXiv Dataset |
| EuroParl Corpus | Parallel corpus of European Parliament proceedings | EuroParl Corpus |
| WikiData | Structured knowledge base derived from Wikipedia | WikiData |
| GitHub Repositories | Collection of code and README files from GitHub repositories | GitHub Archive |