fix(sql): nil pointer panics, race conditions, and resource leaks in SQL outputs#775
fix(sql): nil pointer panics, race conditions, and resource leaks in SQL outputs#775erkattak wants to merge 5 commits intowarpstreamlabs:mainfrom
Conversation
| s.dbMut.Lock() | ||
| _ = s.db.Close() | ||
| s.db = nil | ||
| s.dbMut.Unlock() |
There was a problem hiding this comment.
Q: should we be checking if s.db is already nil here before proceeding with the close?
|
|
||
| s.dbMut.Lock() | ||
| _ = s.db.Close() | ||
| s.db = nil |
There was a problem hiding this comment.
q: same as above -- should we be doing a nil check here?
|
I think I've addressed feedback |
|
@erkattak OK great! I'll give this a look over the weekend or on Monday. I'm going to do a full SQL integration test run with these changes in the meanwhile 👍 |
| } | ||
|
|
||
| func (s *sqlCache) Get(ctx context.Context, key string) (value []byte, err error) { | ||
| s.dbMut.RLock() |
There was a problem hiding this comment.
Do you know what happens if the s.db we pass in is actually nil in these calls? Wonder if we should be checking that
| return err | ||
| } | ||
| defer func() { | ||
| _ = tx.Rollback() |
There was a problem hiding this comment.
Why would we want to always be rolling back? Surely we want to only do this on error?
| require.NoError(t, insertOutput.Close(context.Background())) | ||
| } | ||
|
|
||
| func TestSQLInsertOutputWriteBatchWithNilDB(t *testing.T) { |
There was a problem hiding this comment.
Wdyt of a test where we run Connect(), perform these concurrent writes, and then at some point trigger a Close() of the output (while write operations are busy executing / still scheduled to execute) which should close the DB and set the attribute to nil.
Then we can get a more e2e illustration of this new behaviour/test
There was a problem hiding this comment.
Also, think if we could replicate the original scenario you were encountering into an integration test that'd be even better than these unit tests i.e
bento/internal/impl/sql/integration_test.go
Line 1551 in d3d5889
This fixes a collection of bugs in the SQL output, processor, and cache components. Most are latent race conditions or resource leaks that show up under load or during shutdown.
We experienced a bug that would be fixed by these changes in production during a database upgrade.
Fixes #770
Changes
output_sql_insert,processor_sql_insert: Replaced scatteredtx.Rollback()calls withdefer tx.Rollback()so rollback always fires on early return. Addeddefer stmt.Close()to ensure prepared statements are released.output_sql_raw,processor_sql_raw,processor_sql_select,input_sql_select: Setdb = nilafterdb.Close()in shutdown goroutines. Added a nil-guard inwriteBatch- previously a write after shutdown would panic rather than returnErrNotConnected.processor_sql_raw,processor_sql_select: Addedrows.Close()after iterating result sets. Without this, DB connections were held open longer than needed.cache_sql: Added async.RWMutexaround thedbfield. The shutdown goroutine and all cache methods (Get,Set,Add,Delete) were accessing it without synchronization.conn_fields: Fixed an inverted condition inreworkDSNfor ClickHouse legacy TCP DSNs - thepassword == ""branch was backwards, so username-only connections got the wrong user info.Explanation
The shutdown race could cause a nil dereference mid-write. Prepared statements leaked on partial failures. Rows from raw/select processors were never explicitly closed, holding connections open. The DSN bug silently misconfigured ClickHouse connections that had a username but no password.
Tests cover the
reworkDSNlogic (including the fixed inversion) and nil-connection behavior forsql_insertandsql_rawoutputs.