Skip to content

Conversation

@solresol
Copy link
Collaborator

@solresol solresol commented Nov 4, 2025

Summary

  • replace the Theil-Sen fit on the journal and year scatter charts with an ordinary least squares regression
  • compute and display the regression formula, R², and p-value beneath each scatter plot
  • add numerical helpers for the t-distribution so p-values can be derived directly in the dashboard

Testing

  • python -m compileall extractor/generate_dashboard.py

https://chatgpt.com/codex/tasks/task_e_69095661587483258cee0fada0a71fcd

@claude
Copy link

claude bot commented Nov 4, 2025

Code Review: Switch scatter regression to OLS with statistics

Overall, this is a well-implemented change that significantly enhances the dashboard's analytical capabilities. The switch from Theil-Sen to OLS regression with statistical measures is appropriate for this use case. Below are my detailed findings:


✅ Strengths

  1. Mathematical Correctness: The OLS implementation is mathematically sound with proper calculation of slope, intercept, R², and p-values.

  2. Robust Statistical Functions: The implementation of log-gamma, beta continued fraction, and Student's t-distribution CDF shows solid numerical computing practices with appropriate:

    • Convergence criteria (EPS = 1e-12)
    • Iteration limits (MAX_ITER = 200)
    • Handling of edge cases and numerical instability
  3. Comprehensive Error Handling: Excellent validation throughout:

    • Filtering invalid data points (NaN, non-numeric values)
    • Checking for degenerate cases (n < 2, sxx = 0)
    • Bounds checking on calculated values
    • Defensive checks for division by zero
  4. User Experience: The new regression stats display provides valuable information (formula, R², p-value) in a clear, readable format.

  5. Code Organization: Well-structured with clear separation of concerns (calculation, formatting, display update).


⚠️ Issues & Recommendations

1. Critical: Potential Stack Overflow in logGamma (Line 904)

The recursive call in logGamma can cause stack overflow for negative values:

if (z < 0.5) {
    return Math.log(Math.PI) - Math.log(Math.sin(Math.PI * z)) - logGamma(1 - z);
}

Issue: For values like z = -0.5, this recursively calls logGamma(1.5), then logGamma(0.5), then logGamma(0.5) again, creating infinite recursion.

Recommendation: Add iteration limit or validate input range. For this use case (Student's t-distribution with df/2 where df ≥ 0), you'll always have positive arguments, but defensive validation would be prudent:

const logGamma = z => {
    if (z <= 0 && Number.isInteger(z)) {
        return Number.POSITIVE_INFINITY; // Gamma undefined at non-positive integers
    }
    // ... rest of implementation
};

2. Performance: Duplicate Loop in ordinaryLeastSquares (Lines 1013-1018, 1027-1033)

You iterate through validPoints twice to calculate sums. This could be combined into a single loop:

for (const point of validPoints) {
    const dx = point.x - meanX;
    const dy = point.y - meanY;
    sxx += dx * dx;
    sxy += dx * dy;
    sst += dy * dy; // Add this here
}

// After calculating slope/intercept, second loop only for SSE
for (const point of validPoints) {
    const predicted = slope * point.x + intercept;
    const residual = point.y - predicted;
    sse += residual * residual;
}

This is a minor optimization but improves efficiency.

3. Code Quality: betacf Convergence Warning (Line 948-950)

The function breaks on convergence but doesn't warn if MAX_ITER is reached without convergence. Consider:

if (Math.abs(del - 1) < EPS) {
    return h;
}
if (m === MAX_ITER) {
    console.warn('betacf failed to converge after', MAX_ITER, 'iterations');
}

4. Numerical Stability: Potential Underflow in regularizedIncompleteBeta (Line 961)

The calculation Math.exp(a * Math.log(x) + b * Math.log(1 - x) - lnBeta) could underflow to 0 for extreme values. While your edge case handling is good, you might want to add:

const logFront = a * Math.log(x) + b * Math.log(1 - x) - lnBeta;
if (logFront < -700) { // exp(-700) ≈ 1e-304, near min double
    return x < (a + 1) / (a + b + 2) ? 0 : 1;
}
const front = Math.exp(logFront);

5. UI/UX: HTML Injection Risk (Line 1116)

You're using innerHTML to display regression statistics:

element.innerHTML = `<strong>Formula:</strong> ${formula} ...`;

While formula, rSquaredText, and pValueText are all generated internally (not user input), using textContent for dynamic parts would be safer:

element.innerHTML = '<strong>Formula:</strong> ';
const formulaSpan = document.createElement('span');
formulaSpan.textContent = formula;
element.appendChild(formulaSpan);
// ... etc

Or use textContent for the entire thing if HTML formatting isn't critical.

6. Testing Gap

The PR mentions only syntax testing with python -m compileall. Consider adding:

  • Unit tests for statistical functions (especially edge cases)
  • Validation of OLS results against known datasets
  • Visual regression tests for the dashboard

🔍 Minor Observations

  1. Line 1035: The clamping Math.max(0, Math.min(1, 1 - sse / sst)) is good defensive programming, though mathematically R² should always be in [0,1] for OLS.

  2. Line 1050: The two-tailed p-value calculation 2 * (1 - cdf) is correct for testing if slope ≠ 0.

  3. Formatting: The exponential notation threshold (line 1072) is reasonable, though you might consider scientific notation for very small p-values too.


📊 Statistical Considerations

Why this change matters: The switch from Theil-Sen to OLS is significant:

  • Theil-Sen: Robust to outliers, non-parametric, but doesn't provide R² or p-values directly
  • OLS: Assumes normally distributed errors, sensitive to outliers, but provides standard statistical measures

Recommendation: Consider mentioning in documentation or dashboard UI that OLS assumes linear relationship and normally distributed residuals. For publications using this data, users should validate these assumptions.


✅ Security & Performance

  • Security: No major concerns. All data is internally generated.
  • Performance: Computational complexity is reasonable for dashboard use. The statistical functions are well-optimized.
  • Browser Compatibility: Using modern JS features (arrow functions, const/let, template literals). Should work in all modern browsers.

📝 Summary

Approve with minor recommendations. The core implementation is solid and ready to merge. The suggested improvements are mostly defensive programming enhancements rather than critical bugs (except the potential stack overflow in logGamma, which is unlikely to occur with valid inputs but worth addressing).

Priority fixes:

  1. Add input validation to logGamma to prevent stack overflow
  2. Consider adding convergence warnings to numerical functions
  3. Add unit tests for statistical functions

Nice-to-haves:

  • Optimize the double loop in OLS calculation
  • Improve numerical stability handling
  • Document statistical assumptions for users

Great work on this enhancement! The addition of regression statistics will make the dashboard much more valuable for research purposes.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1088 to +1097
const formatRegressionFormula = regression => {
if (!regression || !Number.isFinite(regression.slope) || !Number.isFinite(regression.intercept)) {
return 'Formula unavailable';
}
const slopeText = formatCoefficient(regression.slope);
const intercept = regression.intercept;
const interceptAbs = Math.abs(intercept);
const interceptText = formatCoefficient(interceptAbs);
const sign = intercept >= 0 ? ' + ' : ' - ';
return `y = ${slopeText}x${sign}${interceptText}`;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Escape JS template literals inside Python f-string

The new regression helpers embed JavaScript template literals like return `y = ${slopeText}x${sign}${interceptText}`; inside a Python f-string. When the dashboard runs, Python tries to evaluate slopeText, sign, and interceptText (and the similar placeholders in the stats string below) as Python expressions and immediately raises NameError, preventing any HTML from being generated. These ${…} sequences need to be escaped (e.g., ${{…}}) or the block should be converted to a regular triple-quoted string so the JavaScript interpolation is left untouched.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants