Large language models (LLMs) are a type of artificial intelligence (AI) that are trained on massive amounts of text data. This data can include books, articles, code, and websites. LLMs learn the patterns and structures of language from this data, which allows them to perform a variety of tasks, including:
- Generating text
- Translating languages
- Answering questions
- Summarizing text
LLMs are still under development, but they have the potential to revolutionize the way we interact with computers. For example, LLMs could be used to create chatbots that can have more natural and engaging conversations with humans. LLMs could also be used to create new types of creative content, such as poems, stories, and code.
There are a number of general open access datasets that can be used to train and evaluate LLMs. These datasets include:
- CommonCrawl: A massive dataset of web pages
- Wikipedia: A free online encyclopedia
- LibriSpeech: A corpus of English audiobooks
- OpenSubtitles: A corpus of movie and TV subtitles
The following type tags can be used to classify LLMs:
- Generative: LLMs that can generate text, translate languages, and answer questions.
- Discriminative: LLMs that can classify text and identify patterns in text.
- Encoder-decoder: LLMs that use an encoder to convert text into a hidden representation, and a decoder to convert the hidden representation back into text.
- Transformer: LLMs that use the transformer architecture, which is a type of neural network that is well-suited for natural language processing tasks.
I hope this is helpful!
- Number of examples: 100 million
- Number of tokens: 1 trillion
- Dataset splits:
- Train: 80% 🚂
- Validation: 10% 🧪
- Test: 10% 🏁
- Data formats:
- Text: Plain text files, one example per line. 📄
- JSONL: JSON Lines format, with each example represented as a JSON object. 🏷️
- TFRecord: TensorFlow Record format. ⚙️
LLM Source | Description | Number of examples | Number of tokens | Dataset splits | Data formats |
---|---|---|---|---|---|
CommonCrawl | A massive dataset of web pages. | 200 billion | 400 trillion | Train, validation, test | Text, JSONL, TFRecord |
Wikipedia | A free online encyclopedia. | 3 billion | 6 trillion | Train, validation, test | Text, JSONL, TFRecord |
LibriSpeech | A corpus of English audiobooks. | 1 million | 10 billion | Train, validation, test | Text, JSONL, TFRecord |
OpenSubtitles | A corpus of movie and TV subtitles. | 2 billion | 20 trillion | Train, validation, test | Text, JSONL, TFRecord |
CodeSearchNet | A corpus of code snippets. | 100 million | 1 trillion | Train, validation, test | Text, JSONL, TFRecord |
Pile | A dataset of text and code from a variety of sources, including books, articles, code, and websites. | 800 billion | 80 trillion | Train, validation, test | Text, JSONL, TFRecord |
LAION-2B-en | A dataset of text and images. | 2 billion text-image pairs | 40 trillion | Train, validation, test | Text, image |
P3 | A dataset of prompts and datasets across 46 languages & 16 NLP tasks. | 1 million prompts | 10 billion | Train, validation, test | Text, JSONL |
xP3 | A dataset of prompts and datasets across 46 languages & 16 NLP tasks. | 100 million prompts | 1 trillion | Train, validation, test | Text, JSONL |
OpenAssistant Conversations Dataset | A dataset of conversations between users and open assistant systems. | 10 million conversations | 1 billion | Train, validation, test | Text, JSONL |
RedPajama | A dataset of text and code, created by replicating the LLaMA training dataset. | 1 trillion | 100 trillion | Train, validation, test | Text, code |
ROOTS | A multilingual dataset of text from 59 languages. | 100 billion | 1 trillion | Train, validation, test | Text, JSONL, TFRecord |
AI21 Stories | A dataset of human-written stories, used for the AI21 Stories competition. | 10,000 | 10 million | Train, validation, test | Text |
Natural Questions | A dataset of real-world questions and their corresponding answers, used for training and evaluating machine learning models for natural language processing. | 1 million | 10 billion | Train, validation, test | Text |