What if we could redesign the cloud from the ground up?
- We would realize that the foundation of the modern cloud stack is small and can be maintained by a tiny team.
- Despite being fast, networking isn't free. The concept of "disaggregated storage and compute" is great for convenience but not for software efficiency.
- Even the best compilers struggle to vectorize high-level code, resulting in approximately 10× lower hardware utilization for computational workloads.
- Most operating systems have custom mechanisms to accelerate I/O that are unused by 99% of modern software, leading to about 10× higher latency for networking and storage.
- Data-center servers often feature purpose-built accelerators, meaning a single simple build toolchain is unlikely to handle bleeding-edge heterogeneous software.
- Reimplementing solutions in every language isn't feasible. An abundant language with a well-defined committee and strong industry support, like C99 and C++17, should be used.
- Consolidating all cloud technology into a mono-repo increases component interdependencies, hindering the adoption of individual parts. A modular design with clear isolation is preferable.
Since 2015, Unum has been striving to meet all these conditions. We have developed a unified framework consisting of concise, low-level implementations for storage, computing, and AI modeling systems, all designed with efficiency in mind. This endeavor compelled us to build our own liquid-cooled clusters for R&D, collaborate closely with multiple cloud providers, and implement assembly-level optimizations often unique in the software industry.
Today, some of our projects run on hundreds of millions of devices, trusted by unicorns, decacorns, trillion-dollar tech companies, governments, and even intelligence agencies. Our primary goal is to power the next generation of computing, focusing on applications in AI and computational science. Since 2022, we've been increasingly open-sourcing our work and look forward to sharing much more soon!
1 Most database management systems are built on top of just a few key-value stores, like RocksDB. Proximity graphs and algorithms like HNSW can replace most indexing data structures. Most networking is built on top of TCP/IP and relies on just a few algorithms. Similar statements hold true for numeric libraries, machine learning frameworks, and even the models built on top of them. 2 InfiniBand is now powering the majority of Top-500 supercomputers, and Remote Direct Memory Access systems provide convenient abstractions for users, but their latency is still orders of magnitude higher than accessing local memory. 3 Our optimizations encompass SIMD instructions across AVX2, AVX-512 generations, NEON, SVE, SVE2, Intel and Apple AMX variants, SME, WMMA, and other NVIDIA extensions. 4 We utilize SPDK, DPDK, and io_uring for Linux kernel bypass. 5 The last 10 years of attempts to build heterogeneous compilers, like SyCL, have failed, so multiple tools have to be used in conjunction. 6 Many mechanisms exist for implementing language bindings. Unum generally focuses on Python, Rust, and JavaScript as the primary languages in machine learning and the web.