|
| 1 | +# CogDB Performance Roadmap |
| 2 | + |
| 3 | +> Weekly release cadence - tracking performance issues and optimizations |
| 4 | +
|
| 5 | +## 🔴 High Priority |
| 6 | + |
| 7 | +### Star Graph / High-Degree Vertex Performance Degradation |
| 8 | +**Discovered:** 2024-12-10 via benchmark |
| 9 | +**Issue:** When inserting many edges from/to the same vertex, performance degrades severely. |
| 10 | + |
| 11 | +| Edges | Speed | Degradation | |
| 12 | +|-------|-------|-------------| |
| 13 | +| 100 | 569 edges/s | baseline | |
| 14 | +| 500 | 633 edges/s | - | |
| 15 | +| 1,000 | 338 edges/s | 47% slower | |
| 16 | +| 5,000 | 83 edges/s | **87% slower** | |
| 17 | + |
| 18 | +**Root Cause:** `put_set()` in `database.py` traverses linked lists to check for duplicates. This is O(n) per insert, making high-degree vertices O(n²) overall. |
| 19 | + |
| 20 | +**Location:** `database.py:241-277` (put_set method) |
| 21 | + |
| 22 | +**Potential Fix:** |
| 23 | +1. Use hash-based set for duplicate checking instead of linked list traversal |
| 24 | +2. Consider bloom filter for faster "definitely not present" checks |
| 25 | +3. Or maintain an in-memory index of vertex adjacencies |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## 🟡 Medium Priority |
| 30 | + |
| 31 | +### Unbounded Cache Growth |
| 32 | +**Issue:** Cache in `cache.py` grows unboundedly - no eviction policy. |
| 33 | +**Fix:** Implement LRU cache with `collections.OrderedDict` |
| 34 | +**Effort:** Low |
| 35 | + |
| 36 | +### Redundant Table Switches in put_node |
| 37 | +**Issue:** `put_node()` calls `use_table()` 5 times per edge insert. |
| 38 | +**Fix:** Cache table references within the method |
| 39 | +**Effort:** Low |
| 40 | + |
| 41 | +--- |
| 42 | + |
| 43 | +## 🟢 Low Priority / Nice to Have |
| 44 | + |
| 45 | +### Efficient Serialization |
| 46 | +**Issue:** Record.marshal() uses string concatenation with `+` |
| 47 | +**Fix:** Use bytearray for efficient concatenation |
| 48 | +**Effort:** Low, ~5% improvement |
| 49 | + |
| 50 | +### Configurable Auto-Flush |
| 51 | +**Issue:** Currently binary (batch mode on/off) |
| 52 | +**Fix:** Add config option for flush frequency (every N records) |
| 53 | +**Effort:** Low |
| 54 | + |
| 55 | +--- |
| 56 | + |
| 57 | +## ✅ Completed |
| 58 | + |
| 59 | +### v3.1.0 (2024-12-10) |
| 60 | +- [x] Batch flush mode - defer flush() during bulk inserts |
| 61 | +- [x] `Graph.put_batch()` method for efficient bulk loading |
| 62 | +- [x] Comprehensive benchmark suite (`test/benchmark.py`) |
| 63 | +- [x] ~1.6x speedup for large batch inserts |
| 64 | + |
| 65 | +--- |
| 66 | + |
| 67 | +## Benchmark Baselines (v3.1.0) |
| 68 | + |
| 69 | +``` |
| 70 | +Chain graph (batch, 5000 edges): 4,377 edges/s |
| 71 | +Social graph (12,492 edges): 3,233 edges/s |
| 72 | +Dense graph (985 edges): 2,585 edges/s |
| 73 | +Read performance: 20,000+ ops/s |
| 74 | +``` |
| 75 | + |
| 76 | +Run benchmarks: `python3 test/benchmark.py` |
0 commit comments