-
-
Notifications
You must be signed in to change notification settings - Fork 2
Bayesian Redesign
github-actions[bot] edited this page Oct 15, 2025
·
1 revision
The current system is failing because of a mathematical design flaw, not parameter tuning issues.
-
JIRA_XRAY format has unique indicators:
testExecutions(UNIQUE weight 5),xrayInfo(STRONG weight 3) -
Other formats have only generic indicators: ZEPHYR has
executionId,version(STRONG weight 3 each) - Yet likelihood ratios are only 1.69:1, far below the required 2.0:1
This indicates that the Bayesian scoring function is not properly weighting unique evidence.
likelihood = w₁·completeness^p₁ + w₂·quality^p₂ + w₃·uniqueness^p₃ + interactionsProblems:
- Power functions compress high values: 0.8^0.45 ≈ 0.8^0.5 ≈ 0.89
- Linear combination dilutes discriminative power: 65% uniqueness still gets diluted by 35% other factors
- No proper probabilistic interpretation: This is an ad-hoc scoring function, not a likelihood
P(is_test_data|evidence) = Bernoulli(θ)
θ ~ Beta(α, β) # Prior on test data prevalence
P(evidence|is_test_data=True) = Multinomial(evidence_counts | λ_test)
P(evidence|is_test_data=False) = Multinomial(evidence_counts | λ_not_test)
Posterior: P(is_test_data|evidence) = f(evidence_counts, λ_test, λ_not_test, α, β)
P(format|evidence, is_test_data=True) = Categorical(π)
π ~ Dirichlet(α₁, α₂, ..., αₖ) # Format priors
P(evidence|format=i) = Product over evidence_types:
- Unique indicators: Bernoulli(p_unique_i)
- Strong indicators: Bernoulli(p_strong_i)
- Moderate indicators: Bernoulli(p_moderate_i)
Posterior: P(format=i|evidence) ∝ P(evidence|format=i) * P(format=i)
I(format; evidence) = Σ P(format,evidence) log(P(format,evidence) / (P(format)P(evidence)))
Discriminative power = KL(P(evidence|format) || P(evidence|other_formats))
Posterior probability derived from information gain
P(format=i|features) = softmax(Wᵢ · features + bᵢ)
Features = [
unique_indicator_count,
strong_indicator_count,
moderate_indicator_count,
field_specificity_score,
structural_complexity,
evidence_quality_score
]
Regularization: L1 for feature selection, L2 for stability
Immediate fix using proper probability theory:
- Evidence Independence Assumption
# Current: Arbitrary weighted combination
likelihood = w₁·C^p₁ + w₂·Q^p₂ + w₃·U^p₃
# Proper: Independent evidence multiplication
likelihood = P(completeness|evidence) * P(quality|evidence) * P(uniqueness|evidence)- Log-Likelihood for Numerical Stability
log_likelihood = log(P(completeness|evidence)) + log(P(quality|evidence)) + log(P(uniqueness|evidence))- Proper P(Evidence|Format) Models
# For unique indicators (presence/absence)
P(unique_present|format) = Beta(α_unique_present, β_unique_present)
# For count-based evidence
P(count|format) = Poisson(λ_count) or NegativeBinomial
# For quality scores
P(quality|format) = Beta(α_quality, β_quality)Stage 1: Test data detection using:
- Evidence count models (Poisson/NegativeBinomial)
- Structural quality assessment
- Field specificity scoring
Stage 2: Format discrimination with:
- Dirichlet priors on format frequencies
- Bernoulli/Beta models for indicator presence
- Logarithmic scaling for rare indicators
- Hierarchical Bayesian modeling with PyMC or NumPyro for full posterior inference.
- Mutual information feature selection to identify the most discriminative indicators.
- Hybrid Bayesian + logistic regression approach where Bayesian priors inform logistic regression features.
- Data sparsity: Some formats may lack sufficient training data. Mitigate with informative priors and synthetic augmentation.
- Model interpretability: Keep explanations grounded in evidence metrics and publish derived parameters.
- Computational overhead: Pre-compute format-specific parameters and cache likelihood components.
- Implement evidence independence in the current scorer.
- Replace ad-hoc likelihoods with Beta/Bernoulli models per evidence type.
- Establish benchmarking suite aligned with
wiki/benchmarks/format_detection_benchmark.json.