From 437c767c339f7e36a2b5c0cd80cb3a904ee312d3 Mon Sep 17 00:00:00 2001 From: Dhaval Chaudhari Date: Sat, 30 Aug 2025 22:43:42 +0530 Subject: [PATCH 1/6] Add rule for detecting Stable Diffusion Web UI meta tensor corruption --- ...able-diffusion-meta-tensor-corruption.yaml | 43 +++++++++++++++++++ rules/cre-2025-0142/test.log | 32 ++++++++++++++ 2 files changed, 75 insertions(+) create mode 100644 rules/cre-2025-0142/stable-diffusion-meta-tensor-corruption.yaml create mode 100644 rules/cre-2025-0142/test.log diff --git a/rules/cre-2025-0142/stable-diffusion-meta-tensor-corruption.yaml b/rules/cre-2025-0142/stable-diffusion-meta-tensor-corruption.yaml new file mode 100644 index 0000000..78e7262 --- /dev/null +++ b/rules/cre-2025-0142/stable-diffusion-meta-tensor-corruption.yaml @@ -0,0 +1,43 @@ +rules: +- cre: + id: CRE-2025-0142 + severity: 1 + title: Stable Diffusion Web UI Meta Tensor Corruption Leading to Complete Service Failure + category: ai-ml-framework-problem + author: Community + description: | + - Detects critical Stable Diffusion Web UI failures where meta tensor corruption prevents model loading. + - The error "NotImplementedError: Cannot copy out of meta tensor; no data!" indicates catastrophic failure. + - This represents a complete service failure that requires immediate intervention. + cause: | + - Corrupted or incomplete model checkpoint files (safetensors/ckpt) + - PyTorch tensor corruption during model loading + - Device mismatch between CPU and GPU tensors + - Memory corruption during tensor operations + tags: + - stable-diffusion + - pytorch + - tensor-corruption + - meta-tensor + mitigation: | + - Restart Stable Diffusion Web UI service to clear corrupted tensor states + - Re-download and verify model checkpoint files + - Check GPU memory and clear any corrupted tensor allocations + references: + - https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues + applications: + - name: stable-diffusion-webui + impact: complete service failure - no image generation possible + impactScore: 10 + mitigationScore: 8 + reports: 1 + metadata: + kind: prequel + id: 5Dy2FDKmSQqWiPPCh8XyHz + gen: 2 + rule: + set: + event: + source: cre.log.stable-diffusion-webui + match: + - regex: 'Cannot copy out of meta tensor; no data!' diff --git a/rules/cre-2025-0142/test.log b/rules/cre-2025-0142/test.log new file mode 100644 index 0000000..82537fa --- /dev/null +++ b/rules/cre-2025-0142/test.log @@ -0,0 +1,32 @@ +{"timestamp":"2025-08-30T22:24:12.001Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"=== Real Meta Tensor Corruption Failure Reproduction ==="} +{"timestamp":"2025-08-30T22:24:12.002Z","level":"ERROR","source":"cre.log.stable-diffusion-webui","message":"This script will actually trigger PyTorch meta tensor errors"} +{"timestamp":"2025-08-30T22:24:12.003Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":""} +{"timestamp":"2025-08-30T22:24:12.004Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Test 1: Direct Meta Tensor Corruption ---"} +{"timestamp":"2025-08-30T22:24:12.005Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Creating meta tensor..."} +{"timestamp":"2025-08-30T22:24:12.006Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Attempting to copy meta tensor to CUDA..."} +{"timestamp":"2025-08-30T22:24:12.007Z","level":"ERROR","source":"cre.log.stable-diffusion-webui","message":"Meta tensor error triggered: Cannot copy out of meta tensor; no data!"} +{"timestamp":"2025-08-30T22:24:12.008Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":""} +{"timestamp":"2025-08-30T22:24:12.009Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Test 2: Device Mismatch ---"} +{"timestamp":"2025-08-30T22:24:12.010Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Creating tensors on different devices..."} +{"timestamp":"2025-08-30T22:24:12.011Z","level":"ERROR","source":"cre.log.stable-diffusion-webui","message":"Device mismatch error: CUDA error: no kernel image is available for execution on the device"} +{"timestamp":"2025-08-30T22:24:12.012Z","level":"ERROR","source":"cre.log.stable-diffusion-webui","message":"kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect."} +{"timestamp":"2025-08-30T22:24:12.013Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"debugging consider passing CUDA_LAUNCH_BLOCKING=1"} +{"timestamp":"2025-08-30T22:24:12.014Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"with `TORCH_USE_CUDA_DSA` to enable device-side assertions."} +{"timestamp":"2025-08-30T22:24:12.015Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":""} +{"timestamp":"2025-08-30T22:24:12.016Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":""} +{"timestamp":"2025-08-30T22:24:12.017Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Test 3: Model Loading Failure ---"} +{"timestamp":"2025-08-30T22:24:12.018Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Starting model loading simulation..."} +{"timestamp":"2025-08-30T22:24:12.019Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Creating model from config: configs/v1-inference.yaml"} +{"timestamp":"2025-08-30T22:24:12.020Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Loading weights from models/Stable-diffusion/model.safetensors"} +{"timestamp":"2025-08-30T22:24:12.021Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Creating model with corrupted tensors..."} +{"timestamp":"2025-08-30T22:24:12.022Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Model created successfully"} +{"timestamp":"2025-08-30T22:24:12.023Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Attempting to move model to CUDA..."} +{"timestamp":"2025-08-30T22:24:12.024Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Model loading completed successfully"} +{"timestamp":"2025-08-30T22:24:12.025Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":""} +{"timestamp":"2025-08-30T22:24:12.026Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Test 4: Corrupted Checkpoint ---"} +{"timestamp":"2025-08-30T22:24:12.027Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Creating corrupted checkpoint file..."} +{"timestamp":"2025-08-30T22:24:12.028Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Created corrupted checkpoint: /tmp/tmp86_xfqpp.safetensors"} +{"timestamp":"2025-08-30T22:24:12.029Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Corrupted checkpoint created: /tmp/tmp86_xfqpp.safetensors"} +{"timestamp":"2025-08-30T22:24:12.030Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":""} +{"timestamp":"2025-08-30T22:24:12.031Z","level":"INFO","source":"cre.log.stable-diffusion-webui","message":"Reproduction Complete ==="} +{"timestamp":"2025-08-30T22:24:12.032Z","level":"ERROR","source":"cre.log.stable-diffusion-webui","message":"Check 'real_failure.log' for actual error logs"} From 6c175caa6daa2c49f762e77d5898a6fd14ed83d9 Mon Sep 17 00:00:00 2001 From: Dhaval Chaudhari Date: Sun, 31 Aug 2025 16:47:16 +0530 Subject: [PATCH 2/6] fix issues --- .../stable-diffusion-meta-tensor-corruption.yaml | 2 +- rules/tags/categories.yaml | 3 +++ rules/tags/tags.yaml | 15 ++++++++++++++- 3 files changed, 18 insertions(+), 2 deletions(-) diff --git a/rules/cre-2025-0142/stable-diffusion-meta-tensor-corruption.yaml b/rules/cre-2025-0142/stable-diffusion-meta-tensor-corruption.yaml index 78e7262..e79c28c 100644 --- a/rules/cre-2025-0142/stable-diffusion-meta-tensor-corruption.yaml +++ b/rules/cre-2025-0142/stable-diffusion-meta-tensor-corruption.yaml @@ -33,7 +33,7 @@ rules: reports: 1 metadata: kind: prequel - id: 5Dy2FDKmSQqWiPPCh8XyHz + id: 7Fk9mNpQrStUvWxYzA2B3C4D gen: 2 rule: set: diff --git a/rules/tags/categories.yaml b/rules/tags/categories.yaml index e61a0cb..fefc1ac 100644 --- a/rules/tags/categories.yaml +++ b/rules/tags/categories.yaml @@ -132,6 +132,9 @@ categories: - name: ubuntu-desktop-problem displayName: Ubuntu Desktop Problems description: "Problems related to Ubuntu Desktop" + - name: ai-ml-framework-problem + displayName: AI/ML Framework Problems + description: Problems related to AI/ML frameworks such as Stable Diffusion, PyTorch, and TensorFlow - name: hpc-database-problem displayName: HPC Database Problems description: Database issues specific to high-performance computing systems like SLURM diff --git a/rules/tags/tags.yaml b/rules/tags/tags.yaml index 1acb1dc..ed4d8d0 100644 --- a/rules/tags/tags.yaml +++ b/rules/tags/tags.yaml @@ -844,4 +844,17 @@ tags: description: Issues with Kubernetes pod scheduling due to resource constraints or networking problems - name: cluster-scaling displayName: Cluster Scaling - description: Problems related to Kubernetes cluster scaling operations and capacity management \ No newline at end of file + description: Problems related to Kubernetes cluster scaling operations and capacity management + - name: meta-tensor + displayName: Meta Tensor + description: Problems related to meta tensors in AI/ML frameworks such as PyTorch, TensorFlow, and Stable Diffusion + - name: pytorch + displayName: PyTorch + description: Problems related to PyTorch in AI/ML frameworks + - name: tensor-corruption + displayName: Tensor Corruption + description: Problems related to tensor corruption in AI/ML frameworks + - name: stable-diffusion + displayName: Stable Diffusion + description: Problems related to Stable Diffusion in AI/ML frameworks + \ No newline at end of file From 57026b0bfaf3dbe9d89f8f61af34b9a22e079ca6 Mon Sep 17 00:00:00 2001 From: Dhaval Chaudhari Date: Wed, 3 Sep 2025 21:55:06 +0530 Subject: [PATCH 3/6] fix --- rules/tags/tags.yaml | 49 ++++++++++++++++++++++++++++++++------------ 1 file changed, 36 insertions(+), 13 deletions(-) diff --git a/rules/tags/tags.yaml b/rules/tags/tags.yaml index ed4d8d0..270f330 100644 --- a/rules/tags/tags.yaml +++ b/rules/tags/tags.yaml @@ -845,16 +845,39 @@ tags: - name: cluster-scaling displayName: Cluster Scaling description: Problems related to Kubernetes cluster scaling operations and capacity management - - name: meta-tensor - displayName: Meta Tensor - description: Problems related to meta tensors in AI/ML frameworks such as PyTorch, TensorFlow, and Stable Diffusion - - name: pytorch - displayName: PyTorch - description: Problems related to PyTorch in AI/ML frameworks - - name: tensor-corruption - displayName: Tensor Corruption - description: Problems related to tensor corruption in AI/ML frameworks - - name: stable-diffusion - displayName: Stable Diffusion - description: Problems related to Stable Diffusion in AI/ML frameworks - \ No newline at end of file + - name: autogpt + displayName: AutoGPT + description: Problems related to AutoGPT autonomous AI agent framework + - name: infinite-loop + displayName: Infinite Loop + description: Problems where code enters infinite loops causing resource exhaustion or system hangs + - name: token-exhaustion + displayName: Token Exhaustion + description: Problems where LLM API token limits are exceeded causing service failures + - name: autonomous-agents + displayName: Autonomous Agents + description: Problems related to autonomous AI agents that chain LLM reasoning with real-world actions + - name: llm + displayName: LLM + description: Problems related to Large Language Models and their API integrations + - name: openai + displayName: OpenAI + description: Problems related to OpenAI API services including GPT models + - name: recursive-analysis + displayName: Recursive Analysis + description: Problems where systems enter recursive self-analysis loops leading to resource exhaustion + - name: n8n + displayName: N8N + description: Problems related to n8n workflow automation platform + - name: workflow-automation + displayName: Workflow Automation + description: Problems related to workflow automation systems and platforms + - name: silent-failure + displayName: Silent Failure + description: Problems that occur without visible error messages or alerts, making detection extremely difficult + - name: production-critical + displayName: Production Critical + description: Issues that have severe impact on production systems and require immediate attention + - name: data-integrity + displayName: Data Integrity + description: Problems that affect the completeness, accuracy, or consistency of data From b035dda664eb44ac6df7383e4bda1e17ed3dabc0 Mon Sep 17 00:00:00 2001 From: Dhaval Chaudhari Date: Wed, 3 Sep 2025 21:59:36 +0530 Subject: [PATCH 4/6] fix --- .../stable-diffusion-meta-tensor-corruption.yaml | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/rules/cre-2025-0142/stable-diffusion-meta-tensor-corruption.yaml b/rules/cre-2025-0142/stable-diffusion-meta-tensor-corruption.yaml index e79c28c..de20247 100644 --- a/rules/cre-2025-0142/stable-diffusion-meta-tensor-corruption.yaml +++ b/rules/cre-2025-0142/stable-diffusion-meta-tensor-corruption.yaml @@ -15,10 +15,12 @@ rules: - Device mismatch between CPU and GPU tensors - Memory corruption during tensor operations tags: - - stable-diffusion - - pytorch - - tensor-corruption - - meta-tensor + - python + - crash + - memory + - corruption + - critical-failure + - data-integrity mitigation: | - Restart Stable Diffusion Web UI service to clear corrupted tensor states - Re-download and verify model checkpoint files From e12dca19e91ff9d40cec2932e3f5f783f99e391f Mon Sep 17 00:00:00 2001 From: Dhaval Chaudhari Date: Fri, 3 Oct 2025 12:46:28 +0530 Subject: [PATCH 5/6] fix --- .../stable-diffusion-meta-tensor-corruption.yaml | 0 rules/{cre-2025-0142 => cre-2025-0170}/test.log | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename rules/{cre-2025-0142 => cre-2025-0170}/stable-diffusion-meta-tensor-corruption.yaml (100%) rename rules/{cre-2025-0142 => cre-2025-0170}/test.log (100%) diff --git a/rules/cre-2025-0142/stable-diffusion-meta-tensor-corruption.yaml b/rules/cre-2025-0170/stable-diffusion-meta-tensor-corruption.yaml similarity index 100% rename from rules/cre-2025-0142/stable-diffusion-meta-tensor-corruption.yaml rename to rules/cre-2025-0170/stable-diffusion-meta-tensor-corruption.yaml diff --git a/rules/cre-2025-0142/test.log b/rules/cre-2025-0170/test.log similarity index 100% rename from rules/cre-2025-0142/test.log rename to rules/cre-2025-0170/test.log From 8ac0cf6533ea31367bed31291726cac093cd8139 Mon Sep 17 00:00:00 2001 From: Dhaval Chaudhari Date: Thu, 30 Oct 2025 15:49:37 +0530 Subject: [PATCH 6/6] fix tags --- rules/tags/tags.yaml | 83 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 82 insertions(+), 1 deletion(-) diff --git a/rules/tags/tags.yaml b/rules/tags/tags.yaml index 1450936..ce00e20 100644 --- a/rules/tags/tags.yaml +++ b/rules/tags/tags.yaml @@ -848,6 +848,87 @@ tags: - name: cluster-scaling displayName: Cluster Scaling description: Problems related to Kubernetes cluster scaling operations and capacity management + - name: maxmemory + displayName: Max Memory + description: Problems related to Redis maxmemory configuration and memory limits + - name: noeviction + displayName: No Eviction + description: Issues when Redis noeviction policy prevents writing new data + - name: wrongpass + displayName: Wrong Password + description: Authentication failures due to incorrect Redis passwords + - name: master-replica + displayName: Master-Replica + description: Issues with Redis master-replica replication relationships + - name: sync + displayName: Sync + description: Data synchronization problems in distributed systems + - name: psync + displayName: Partial Sync + description: Redis partial resynchronization issues + - name: aof + displayName: AOF + description: Redis Append-Only File persistence issues + - name: slowlog + displayName: Slow Log + description: Database slow query logging and performance issues + - name: latency + displayName: Latency + description: Response time and performance latency issues + - name: slow-query + displayName: Slow Query + description: Database queries that exceed performance thresholds + - name: write-error + displayName: Write Error + description: Failures when attempting write operations + - name: recovery + displayName: Recovery + description: Data recovery and restoration operations + - name: maxclients + displayName: Max Clients + description: Connection limit issues in database systems + - name: connection-pool + displayName: Connection Pool + description: Problems with database connection pooling + - name: limit + displayName: Limit + description: Various system and resource limits being exceeded + - name: disk + displayName: Disk + description: Problems related to disk storage, space, or I/O operations + - name: replica + displayName: Replica + description: Issues related to database replicas and read-only instances + - name: supabase + displayName: Supabase + description: Problems related to Supabase self-hosted deployments and services + - name: gotrue + displayName: GoTrue + description: Problems related to Supabase's GoTrue authentication service + - name: realtime + displayName: Realtime + description: Problems related to Supabase's realtime service and WebSocket connections + - name: self-hosted + displayName: Self-Hosted + description: Problems specific to self-hosted deployments and infrastructure + - name: exit-code + displayName: Exit Code + description: Problems identified by specific process/container exit codes (e.g., 137, 127, 134, 139). + - name: entrypoint + displayName: Entrypoint + description: Failures caused by invalid or missing container ENTRYPOINT/CMD definitions. + - name: command + displayName: Command + description: Problems caused by invalid commands or arguments at startup (e.g., not found, bad path, non-executable). + - name: sigabrt + displayName: SIGABRT + description: Crashes where a process aborts with SIGABRT (exit 134), often due to assertion failures or allocator checks. + - name: native + displayName: Native + description: Issues in native code paths (C/C++/Rust, libc/ABI), including crashes and memory faults. + - name: reliability + displayName: Reliability + description: Unstable behavior such as unexpected restarts, crash loops, or intermittent failures affecting service reliability. - name: autogpt displayName: AutoGPT description: Problems related to AutoGPT autonomous AI agent framework @@ -883,4 +964,4 @@ tags: description: Issues that have severe impact on production systems and require immediate attention - name: data-integrity displayName: Data Integrity - description: Problems that affect the completeness, accuracy, or consistency of data + description: Problems that affect the completeness, accuracy, or consistency of data \ No newline at end of file