Switch scatter regression to OLS with statistics #14

solresol · 2025-11-04T05:16:10Z

Summary

replace the Theil-Sen fit on the journal and year scatter charts with an ordinary least squares regression
compute and display the regression formula, R², and p-value beneath each scatter plot
add numerical helpers for the t-distribution so p-values can be derived directly in the dashboard

Testing

python -m compileall extractor/generate_dashboard.py

https://chatgpt.com/codex/tasks/task_e_69095661587483258cee0fada0a71fcd

claude · 2025-11-04T05:17:45Z

Code Review: Switch scatter regression to OLS with statistics

Overall, this is a well-implemented change that significantly enhances the dashboard's analytical capabilities. The switch from Theil-Sen to OLS regression with statistical measures is appropriate for this use case. Below are my detailed findings:

✅ Strengths

Mathematical Correctness: The OLS implementation is mathematically sound with proper calculation of slope, intercept, R², and p-values.
Robust Statistical Functions: The implementation of log-gamma, beta continued fraction, and Student's t-distribution CDF shows solid numerical computing practices with appropriate:
- Convergence criteria (EPS = 1e-12)
- Iteration limits (MAX_ITER = 200)
- Handling of edge cases and numerical instability
Comprehensive Error Handling: Excellent validation throughout:
- Filtering invalid data points (NaN, non-numeric values)
- Checking for degenerate cases (n < 2, sxx = 0)
- Bounds checking on calculated values
- Defensive checks for division by zero
User Experience: The new regression stats display provides valuable information (formula, R², p-value) in a clear, readable format.
Code Organization: Well-structured with clear separation of concerns (calculation, formatting, display update).

⚠️ Issues & Recommendations

1. Critical: Potential Stack Overflow in `logGamma` (Line 904)

The recursive call in logGamma can cause stack overflow for negative values:

if (z < 0.5) {
    return Math.log(Math.PI) - Math.log(Math.sin(Math.PI * z)) - logGamma(1 - z);
}

Issue: For values like z = -0.5, this recursively calls logGamma(1.5), then logGamma(0.5), then logGamma(0.5) again, creating infinite recursion.

Recommendation: Add iteration limit or validate input range. For this use case (Student's t-distribution with df/2 where df ≥ 0), you'll always have positive arguments, but defensive validation would be prudent:

const logGamma = z => {
    if (z <= 0 && Number.isInteger(z)) {
        return Number.POSITIVE_INFINITY; // Gamma undefined at non-positive integers
    }
    // ... rest of implementation
};

2. Performance: Duplicate Loop in `ordinaryLeastSquares` (Lines 1013-1018, 1027-1033)

You iterate through validPoints twice to calculate sums. This could be combined into a single loop:

for (const point of validPoints) {
    const dx = point.x - meanX;
    const dy = point.y - meanY;
    sxx += dx * dx;
    sxy += dx * dy;
    sst += dy * dy; // Add this here
}

// After calculating slope/intercept, second loop only for SSE
for (const point of validPoints) {
    const predicted = slope * point.x + intercept;
    const residual = point.y - predicted;
    sse += residual * residual;
}

This is a minor optimization but improves efficiency.

3. Code Quality: `betacf` Convergence Warning (Line 948-950)

The function breaks on convergence but doesn't warn if MAX_ITER is reached without convergence. Consider:

if (Math.abs(del - 1) < EPS) {
    return h;
}
if (m === MAX_ITER) {
    console.warn('betacf failed to converge after', MAX_ITER, 'iterations');
}

4. Numerical Stability: Potential Underflow in `regularizedIncompleteBeta` (Line 961)

The calculation Math.exp(a * Math.log(x) + b * Math.log(1 - x) - lnBeta) could underflow to 0 for extreme values. While your edge case handling is good, you might want to add:

const logFront = a * Math.log(x) + b * Math.log(1 - x) - lnBeta;
if (logFront < -700) { // exp(-700) ≈ 1e-304, near min double
    return x < (a + 1) / (a + b + 2) ? 0 : 1;
}
const front = Math.exp(logFront);

5. UI/UX: HTML Injection Risk (Line 1116)

You're using innerHTML to display regression statistics:

element.innerHTML = `<strong>Formula:</strong> ${formula} ...`;

While formula, rSquaredText, and pValueText are all generated internally (not user input), using textContent for dynamic parts would be safer:

element.innerHTML = '<strong>Formula:</strong> ';
const formulaSpan = document.createElement('span');
formulaSpan.textContent = formula;
element.appendChild(formulaSpan);
// ... etc

Or use textContent for the entire thing if HTML formatting isn't critical.

6. Testing Gap

The PR mentions only syntax testing with python -m compileall. Consider adding:

Unit tests for statistical functions (especially edge cases)
Validation of OLS results against known datasets
Visual regression tests for the dashboard

🔍 Minor Observations

Line 1035: The clamping Math.max(0, Math.min(1, 1 - sse / sst)) is good defensive programming, though mathematically R² should always be in [0,1] for OLS.
Line 1050: The two-tailed p-value calculation 2 * (1 - cdf) is correct for testing if slope ≠ 0.
Formatting: The exponential notation threshold (line 1072) is reasonable, though you might consider scientific notation for very small p-values too.

📊 Statistical Considerations

Why this change matters: The switch from Theil-Sen to OLS is significant:

Theil-Sen: Robust to outliers, non-parametric, but doesn't provide R² or p-values directly
OLS: Assumes normally distributed errors, sensitive to outliers, but provides standard statistical measures

Recommendation: Consider mentioning in documentation or dashboard UI that OLS assumes linear relationship and normally distributed residuals. For publications using this data, users should validate these assumptions.

✅ Security & Performance

Security: No major concerns. All data is internally generated.
Performance: Computational complexity is reasonable for dashboard use. The statistical functions are well-optimized.
Browser Compatibility: Using modern JS features (arrow functions, const/let, template literals). Should work in all modern browsers.

📝 Summary

Approve with minor recommendations. The core implementation is solid and ready to merge. The suggested improvements are mostly defensive programming enhancements rather than critical bugs (except the potential stack overflow in logGamma, which is unlikely to occur with valid inputs but worth addressing).

Priority fixes:

Add input validation to logGamma to prevent stack overflow
Consider adding convergence warnings to numerical functions
Add unit tests for statistical functions

Nice-to-haves:

Optimize the double loop in OLS calculation
Improve numerical stability handling
Document statistical assumptions for users

Great work on this enhancement! The addition of regression statistics will make the dashboard much more valuable for research purposes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-04T05:19:58Z

extractor/generate_dashboard.py

+        const formatRegressionFormula = regression => {
+            if (!regression || !Number.isFinite(regression.slope) || !Number.isFinite(regression.intercept)) {
+                return 'Formula unavailable';
+            }
+            const slopeText = formatCoefficient(regression.slope);
+            const intercept = regression.intercept;
+            const interceptAbs = Math.abs(intercept);
+            const interceptText = formatCoefficient(interceptAbs);
+            const sign = intercept >= 0 ? ' + ' : ' - ';
+            return `y = ${slopeText}x${sign}${interceptText}`;


Escape JS template literals inside Python f-string

The new regression helpers embed JavaScript template literals like return `y = ${slopeText}x${sign}${interceptText}`; inside a Python f-string. When the dashboard runs, Python tries to evaluate slopeText, sign, and interceptText (and the similar placeholders in the stats string below) as Python expressions and immediately raises NameError, preventing any HTML from being generated. These ${…} sequences need to be escaped (e.g., ${{…}}) or the block should be converted to a regular triple-quoted string so the JavaScript interpolation is left untouched.

Useful? React with 👍 / 👎.

Switch scatter regression to OLS with statistics

c1a1561

chatgpt-codex-connector bot reviewed Nov 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Switch scatter regression to OLS with statistics #14

Switch scatter regression to OLS with statistics #14

Uh oh!

solresol commented Nov 4, 2025

Uh oh!

claude bot commented Nov 4, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Switch scatter regression to OLS with statistics #14

Are you sure you want to change the base?

Switch scatter regression to OLS with statistics #14

Uh oh!

Conversation

solresol commented Nov 4, 2025

Summary

Testing

Uh oh!

claude bot commented Nov 4, 2025

Code Review: Switch scatter regression to OLS with statistics

✅ Strengths

⚠️ Issues & Recommendations

1. Critical: Potential Stack Overflow in logGamma (Line 904)

2. Performance: Duplicate Loop in ordinaryLeastSquares (Lines 1013-1018, 1027-1033)

3. Code Quality: betacf Convergence Warning (Line 948-950)

4. Numerical Stability: Potential Underflow in regularizedIncompleteBeta (Line 961)

5. UI/UX: HTML Injection Risk (Line 1116)

6. Testing Gap

🔍 Minor Observations

📊 Statistical Considerations

✅ Security & Performance

📝 Summary

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Critical: Potential Stack Overflow in `logGamma` (Line 904)

2. Performance: Duplicate Loop in `ordinaryLeastSquares` (Lines 1013-1018, 1027-1033)

3. Code Quality: `betacf` Convergence Warning (Line 948-950)

4. Numerical Stability: Potential Underflow in `regularizedIncompleteBeta` (Line 961)