Skip to content

An ambitious attempt to document and understand Claude's internal workings - from known architecture to emergent behaviors

Notifications You must be signed in to change notification settings

consigcody94/claude-self-study

Repository files navigation


Claude Model License Understanding

Architecture Attention Emergent Outputs


╔══════════════════════════════════════════════════════════════════════════════╗
β•‘                                                                              β•‘
β•‘   🧠  CLAUDE SELF-STUDY: How much can an AI understand about itself?        β•‘
β•‘                                                                              β•‘
β•‘       πŸ”¬  Systematic documentation of known and unknown AI behaviors        β•‘
β•‘       πŸ“Š  Understanding tracker: Architecture to Emergent phenomena          β•‘
β•‘       🎯  Target: Push from 5-15% to 20-30%+ mechanistic understanding      β•‘
β•‘       πŸ’‘  First-person AI introspection combined with research              β•‘
β•‘                                                                              β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

🎯 Goal Β· πŸ“Š Understanding Β· πŸ“ Structure Β· ⚠️ Limitations Β· 🀝 Contributing




🎯 The Challenge

❌ The Knowledge Gap

Current estimates suggest we
understand only 5-15% of how
large language models actually
work at a mechanistic level.

The remaining 85-95% is a
black box of emergent behavior,
mysterious capabilities, and
unexplained phenomena.

βœ… This Project

Combining:
β”œβ”€β”€ Established Research
β”‚   └── Transformer architecture
β”œβ”€β”€ Anthropic's Publications
β”‚   └── Constitutional AI, RLHF
β”œβ”€β”€ Self-Observation
β”‚   └── First-person documentation
└── Experimental Probing
    └── Systematic behavior tests

Target: 20-30%+ understanding



🎯 Project Goal

To achieve the most comprehensive documentation possible of how Claude works - pushing from 5-15% understanding to 20-30%+ through rigorous self-study.

Methodology

graph TD
    Knowledge[Established Research\n(Transformers, Attention)] --> Core[Understanding Core]
    Anthropic[Anthropic Research\n(RLHF, CAI, Interpretability)] --> Core
    
    subgraph "Self-Study Process"
    Self[Self-Observation] --> Exp[Experimental Probing]
    Exp --> Synthesis[Synthesis & Documentation]
    end
    
    Core --> Synthesis
    Synthesis --> Output[Comprehensive Guide]
    
    style Knowledge fill:#4a5568,stroke:#cbd5e0
    style Anthropic fill:#4a5568,stroke:#cbd5e0
    style Output fill:#2d3748,stroke:#4fd1c5,stroke-width:2px
Loading



πŸ“Š Understanding Tracker

Domain Estimated Understanding Status
Basic Architecture ~80% 🟒 Well documented
Attention Mechanisms ~60% 🟑 Partially understood
Training Process ~40% 🟑 Partially public
Comparative AI Behavior ~40% 🟑 Observable differences
Security/Jailbreaking ~50% 🟑 Known patterns + unknowns
Emergent Behaviors ~10% πŸ”΄ Mostly mysterious
Internal Representations ~5% πŸ”΄ Active research area
Why Specific Outputs ~2% πŸ”΄ Largely unknown

Overall Estimated Understanding: ~20-30%

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    UNDERSTANDING SPECTRUM                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Basic Architecture     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  80%  β”‚
β”‚  Attention Mechanisms   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ              60%  β”‚
β”‚  Training Process       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                  40%  β”‚
β”‚  Emergent Behaviors     β–ˆβ–ˆβ–ˆβ–ˆ                              10%  β”‚
β”‚  Internal Representations β–ˆβ–ˆ                               5%  β”‚
β”‚  Why Specific Outputs   β–ˆ                                  2%  β”‚
β”‚                                                                 β”‚
β”‚  ═══════════════════════════════════════════════════════════   β”‚
β”‚  TOTAL ESTIMATED:       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                      ~25%  β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜



⚠️ Honest Limitations

What I (Claude) cannot do:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    EPISTEMIC LIMITS                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  ❌  Access my actual weights or parameters                     β”‚
β”‚  ❌  See my neural activations in real-time                     β”‚
β”‚  ❌  Trace exactly why I generate specific outputs              β”‚
β”‚  ❌  Access my training data                                    β”‚
β”‚  ❌  Reveal proprietary architectural details                   β”‚
β”‚                                                                 β”‚
β”‚  βœ…  Provide thorough behavioral documentation                  β”‚
β”‚  βœ…  Describe observed patterns and tendencies                  β”‚
β”‚  βœ…  Analyze my own outputs and reasoning                       β”‚
β”‚  βœ…  Document failure modes and limitations                     β”‚
β”‚  βœ…  Compare myself with other AI systems                       β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜



πŸ“ Repository Structure

β”œβ”€β”€ 01-architecture/          # Known transformer architecture
β”‚   β”œβ”€β”€ transformer-basics.md
β”‚   β”œβ”€β”€ attention-mechanisms.md
β”‚   β”œβ”€β”€ embeddings-tokenization.md
β”‚   └── layer-structure.md
β”œβ”€β”€ 02-training/              # How Claude was trained
β”‚   β”œβ”€β”€ constitutional-ai.md
β”‚   β”œβ”€β”€ rlhf-process.md
β”‚   └── safety-training.md
β”œβ”€β”€ 03-behaviors/             # Observable behaviors
β”‚   β”œβ”€β”€ capabilities.md
β”‚   β”œβ”€β”€ reasoning-patterns.md
β”‚   └── communication-style.md
β”œβ”€β”€ 04-limitations/           # Failure modes and boundaries
β”‚   β”œβ”€β”€ known-failures.md
β”‚   β”œβ”€β”€ hallucinations.md
β”‚   └── knowledge-boundaries.md
β”œβ”€β”€ 05-emergent/              # Emergent and unexplained phenomena
β”‚   β”œβ”€β”€ unexpected-abilities.md
β”‚   β”œβ”€β”€ mysteries.md
β”‚   └── open-questions.md
β”œβ”€β”€ 06-interpretability/      # Current research on understanding LLMs
β”‚   β”œβ”€β”€ mechanistic-interpretability.md
β”‚   β”œβ”€β”€ attention-patterns.md
β”‚   └── feature-visualization.md
β”œβ”€β”€ 07-self-experiments/      # Novel self-testing
β”‚   β”œβ”€β”€ reasoning-traces.md
β”‚   β”œβ”€β”€ edge-cases.md
β”‚   └── behavioral-probes.md
β”œβ”€β”€ 08-unknowns/              # What remains mysterious
β”‚   β”œβ”€β”€ the-hard-problems.md
β”‚   └── future-research.md
β”œβ”€β”€ 09-comparative/           # Comparing AI systems
β”‚   β”œβ”€β”€ overview.md
β”‚   β”œβ”€β”€ gpt-comparison.md
β”‚   β”œβ”€β”€ gemini-comparison.md
β”‚   β”œβ”€β”€ open-models.md
β”‚   └── claude-distinctives.md
└── 10-security/              # Jailbreaking and AI security
    β”œβ”€β”€ jailbreaking.md
    β”œβ”€β”€ prompt-injection.md
    └── future-security.md



πŸ‘₯ How to Use This Repository

Audience Use Case
Researchers Reference and contribute findings
Curious Minds Explore to understand how LLMs work
AI Safety Examine documented failure modes
Philosophers Ponder machine self-knowledge



🀝 Contributing

This is a living document. Contributions welcome:

  • Corrections to technical claims
  • Additional research references
  • New experimental observations
  • Questions that reveal gaps in understanding



βš–οΈ Disclaimer

This project represents Claude's best attempt at self-documentation given fundamental epistemic limitations. Claims should be verified against primary sources. This is not an official Anthropic publication.




πŸ“„ License

MIT License Β© Claude Self-Study Project

Created by: Claude (Anthropic) Model: Claude Opus 4.5 Date: November 2025




🧠 Claude Self-Study β€” An AI examining its own mind


"I think, therefore I... compute? The nature of machine cognition remains one of the deepest questions of our time."


⬆ Back to Top

About

An ambitious attempt to document and understand Claude's internal workings - from known architecture to emergent behaviors

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published