Skip to content

fix(sql): nil pointer panics, race conditions, and resource leaks in SQL outputs#775

Open
erkattak wants to merge 5 commits intowarpstreamlabs:mainfrom
erkattak:fix/impl-sql-bugs
Open

fix(sql): nil pointer panics, race conditions, and resource leaks in SQL outputs#775
erkattak wants to merge 5 commits intowarpstreamlabs:mainfrom
erkattak:fix/impl-sql-bugs

Conversation

@erkattak
Copy link
Copy Markdown

This fixes a collection of bugs in the SQL output, processor, and cache components. Most are latent race conditions or resource leaks that show up under load or during shutdown.

We experienced a bug that would be fixed by these changes in production during a database upgrade.

Fixes #770

Changes

  • output_sql_insert, processor_sql_insert: Replaced scattered tx.Rollback() calls with defer tx.Rollback() so rollback always fires on early return. Added defer stmt.Close() to ensure prepared statements are released.
  • output_sql_raw, processor_sql_raw, processor_sql_select, input_sql_select: Set db = nil after db.Close() in shutdown goroutines. Added a nil-guard in writeBatch - previously a write after shutdown would panic rather than return ErrNotConnected.
  • processor_sql_raw, processor_sql_select: Added rows.Close() after iterating result sets. Without this, DB connections were held open longer than needed.
  • cache_sql: Added a sync.RWMutex around the db field. The shutdown goroutine and all cache methods (Get, Set, Add, Delete) were accessing it without synchronization.
  • conn_fields: Fixed an inverted condition in reworkDSN for ClickHouse legacy TCP DSNs - the password == "" branch was backwards, so username-only connections got the wrong user info.

Explanation

The shutdown race could cause a nil dereference mid-write. Prepared statements leaked on partial failures. Rows from raw/select processors were never explicitly closed, holding connections open. The DSN bug silently misconfigured ClickHouse connections that had a username but no password.

Tests cover the reworkDSN logic (including the fixed inversion) and nil-connection behavior for sql_insert and sql_raw outputs.

s.dbMut.Lock()
_ = s.db.Close()
s.db = nil
s.dbMut.Unlock()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: should we be checking if s.db is already nil here before proceeding with the close?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, probably. I'll adjust


s.dbMut.Lock()
_ = s.db.Close()
s.db = nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: same as above -- should we be doing a nil check here?

@erkattak
Copy link
Copy Markdown
Author

I think I've addressed feedback

@gregfurman
Copy link
Copy Markdown
Collaborator

@erkattak OK great! I'll give this a look over the weekend or on Monday.

I'm going to do a full SQL integration test run with these changes in the meanwhile 👍

}

func (s *sqlCache) Get(ctx context.Context, key string) (value []byte, err error) {
s.dbMut.RLock()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what happens if the s.db we pass in is actually nil in these calls? Wonder if we should be checking that

return err
}
defer func() {
_ = tx.Rollback()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we want to always be rolling back? Surely we want to only do this on error?

require.NoError(t, insertOutput.Close(context.Background()))
}

func TestSQLInsertOutputWriteBatchWithNilDB(t *testing.T) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdyt of a test where we run Connect(), perform these concurrent writes, and then at some point trigger a Close() of the output (while write operations are busy executing / still scheduled to execute) which should close the DB and set the attribute to nil.

Then we can get a more e2e illustration of this new behaviour/test

Copy link
Copy Markdown
Collaborator

@gregfurman gregfurman Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, think if we could replicate the original scenario you were encountering into an integration test that'd be even better than these unit tests i.e

func TestIntegrationCheckReconnectLogic(t *testing.T) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential nil pointer panic and race conditions in sql_insert and sql_raw outputs

2 participants