Paper lists about 'Constitutional AI System' or 'AI under Ethical Guidelines'. This GitHub repository is intended for personal study, and under consistent update. I hope for everyone's active related-works recommendations.
-
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Anthropic [Link] arxiv Nov.2022
-
Constitutional AI: Harmlessness from AI Feedback
Anthropic [Link] arxiv Dec.2022
-
Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences
Denis Emelin, Ronan Le Bras, Jena D. Hwang, Maxwell Forbes, Yejin Choi [Link] EMNLP 2022
-
Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits
Ruibo Liu, Chenyan Jia, Ge Zhang, Ziyu Zhuang, Tony X. Liu, Soroush Vosoughi [Link] NeurIPS 2022
-
The Capacity for Moral Self-Correction in Large Language Models
Anthropic [Link] arxiv Feb.2023
-
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
Zhiqing Sun1, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan [Link] arxiv May.2023
-
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, Minjoon Seo [Link] arxiv Oct.2023
-
Generating Summaries with Controllable Readability Levels
Leonardo F. R. Ribeiro, Mohit Bansal, Markus Dreyer [Link] arxiv Oct.2023
-
Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging
Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, Prithviraj Ammanabrolu [Link] arxiv Oct.2023
-
Collective Constitutional AI: Aligning a Language Model with Public Input
Anthropic [Link] arxiv Oct.2023
-
Specific versus General Principles for Constitutional AI
Anthropic [Link] arxiv Oct.2023