SimpleRA is a high-performance relational algebra engine optimized for large datasets that exceed available main memory. It implements SQL-like operations using advanced algorithms and data structures, featuring block-based storage, B+ tree indexing, and memory-conscious design.
- Block-based Storage: Matrices partitioned into 15×15 blocks with row-major storage
- Matrix Operations:
LOAD MATRIX: Loads matrix from CSV into block structureROTATE: Efficient clockwise rotation via in-place diagonal transposition and off-diagonal swappingCROSSTRANSPOSE: Atomic in-place transposition of two matrices with name swappingCHECKANTISYM: Validates anti-symmetry (A = -Bᵀ) with transposition reversion
- Cache Management:
clearPoolfunction ensures cache consistency by invalidating stale pages
- External Sorting:
- Sorting Phase: (Nb-1) blocks sorted in-memory using std::sort
- Merging Phase: Priority queue-based k-way merge with O(log Nr) complexity
- ORDER BY: Creates sorted tables with ASC/DESC support
- GROUP BY:
- Temporary table sorting by grouping attribute
- Aggregation functions (SUM, AVG, MIN, MAX, COUNT)
- HAVING clause for grouped result filtering
- Partitioned Hash JOIN:
- Hashes smaller table for efficient probing
- Preserves column order (table1 cols + table2 cols)
- B+ Tree Indexing:
- On-demand creation at first SEARCH
- Disk-based nodes (1 node = 1 block)
- Composite keys (value, count) for duplicate handling
- Leaf node linking for efficient range queries
- Operator-Specific Search:
</<=: Leftmost leaf traversal>: Smallest key search + rightward scan==/>=: Lower bound search!=: Dual search combination
- Atomic Operations:
- INSERT: Appends to last page + index updates
- UPDATE: DELETE old record + INSERT new value
- DELETE: Lazy deletion with
del_markerand 25% threshold rebuild
- 10-Block Limit: Strict adherence to memory constraints
- B+ Tree Optimization: Only 3 blocks loaded simultaneously (current node, parent, child)
- File-Based Intermediate Storage:
.srchfiles for SEARCH results (prevents memory overflow).dltfiles for batched DELETE operations.updfiles for batched UPDATE operations
- Efficient Result Handling: SEARCH results written directly to disk files
- Batched Block Operations: Read/write full blocks instead of row-wise access
- Lazy Deletion Strategy:
- Logical deletion via
del_marker(minimizes disk writes) - Physical rebuild only when >25% rows deleted (reduces costly operations)
- Logical deletion via
- Sequential Access Patterns: Leaf node linking enables efficient range scans
- O(log n) Complexity: All index operations (search/insert/delete/update)
- External Sorting: Minimizes disk accesses through:
- Run generation with (Nb-1) blocks
- Priority queue-based merging
- Composite Key Handling:
cnt_search()dynamically resolves duplicates
| Operation | Without Index | With B+ Tree Index | Improvement Factor |
|---|---|---|---|
| Search | O(n) scan | O(log n) | 1000x (1M rows) |
| Range Query | O(n) | O(log n + k) | 500x (1M rows) |
| Insert | O(1) | O(log n) | Negligible overhead |
| Delete | O(n) | O(log n) | 800x (1M rows) |
| Update | O(n) | O(log n) | 800x (1M rows) |
cd SimpleRA/src
make clean
make
./server- Node Structure: Internal/leaf nodes sized to match disk blocks
- Storage:
.idxfiles organized in/data/temp/<table>/<column>/ - Duplicate Handling: Composite keys
(value, count)maintain stable ordering - Range Queries: Leaf node linking enables efficient sequential access
- Logical Deletion:
- Set
del_marker = 1 - Update indexes by removing references
- Set
- Physical Rebuild (when >25% rows deleted):
loadagain()reconstructs table without deleted rows- All indexes rebuilt for consistency
- Batched Processing:
.dltfiles group deletions by page
- Search: Locate target rows using B+ Tree
- Modify: Update values in memory
- Delete: Remove original rows via
executeDELETE() - Insert: Add modified rows as new records via
executeINSERT() - File Handling:
.updfiles batch modifications between steps