LAB Paper #75

manisnesan · 2024-04-23T10:41:48Z

Summary of the Article:
- The article is a model card for Labradorite-13b, a language model developed by IBM Research.
- Labradorite-13b is based on the LLaMA-2-13b model and has been trained using the LAB (Large-scale Alignment for chatBots) methodology.
- The model uses synthetic data for alignment tuning and was instructed by the Mixtral-8x7b-Instruct model.
Contributions:
- Introduces a novel synthetic data-based alignment tuning method for large language models (LLMs).
- Demonstrates the application of the LAB methodology in training a derivative of the LLaMA-2-13b model.
- Utilizes a teacher model, Mixtral-8x7b-Instruct, to guide the training process of Labradorite-13b.
Implications:
- The model has not been aligned to human preferences, which may result in problematic outputs[1].
- It inherits the limitations and constraints of its base model, LLaMA-2, and other models in the Llama 2 family[1].
- The approach taken for training Labradorite-13b could influence future methods for aligning LLMs with human preferences and reducing biases.
Suggested Related Papers:
- "LLaMA: Open and Efficient Foundation Language Models" which is related to the base model LLaMA-13B[3].
- "Metharme 13B" which is another instruct model based on Meta's LLaMA-13B, biased towards fiction writing and conversation[2].

Originally posted by @manisnesan in #47 (comment)

manisnesan · 2024-04-23T10:43:49Z

LLMs are typically trained in phases: a self-supervised pre-training phase, followed by supervised alignment tuning phases.
Alignment tuning, typically happens in two stages: instruction tuning, followed by preference tuning.
Instruction tuning is more akin to the traditional model training approach in machine learning, where the model is trained directly on tasks of interest. In this stage, the model is given a task description in the form of an natural language instuction (e.g. Summarize the following news article in 2 lines: {News article}) and the model is trained to maximize the likelihood of the provided ground truth summary
Preference tuning, on the other hand, is done using techniques such as RLHF and DPO, where the response from an instruction-tuned model is rated as preferred or unpreferred using human feedback.

manisnesan · 2024-04-23T10:47:03Z

Self Instruct
- using a larger teacher model to generate synthetic data to train a smaller student model, and incorporating principles in the generation prompt to promote diversity in the generated instruction data
Refer the supplemental material in self instruct

Provide feedback