Introduce fast processor #1668

plafer · 2025-02-18T22:05:44Z

Work towards: #1558

The current rather naive implementation achieves 40 MHz on the fibonacci benchmark. So achieving 50 MHz should be quite doable, and maybe even 100 MHz (as hoped) - that is for stack-heavy workloads. It's still unclear to me the kind of performance we can expect for workloads that need a lot of memory accesses.

bobbinth · 2025-02-19T01:50:53Z

The current rather naive implementation achieves 40 MHz on the fibonacci benchmark.

Very cool! I wonder where the bottlenecks are here (i.e., is is the match statement to select between different instructions, or is it something else)? If it is the match statement, one other potential way could be to have an array of function pointers and instead of doing match statement, look up functions to execute from that array. Not sure if it will be much faster (or much slower) though.

bobbinth · 2025-02-21T01:50:31Z

processor/src/fast/mod.rs

+
+    fn execute_op(
+        &mut self,
+        operation: &Operation,


I wonder how much of an impact it has that the Operation is pretty heavy. Specifically, I think memory footprint of a single instance is something like 16 bytes. Maybe the compiler can optimize this away somehow, but if not, loading an operation may require multiple memory reads.

bobbinth · 2025-02-21T02:19:18Z

processor/src/fast/mod.rs

+                match self.stack.len() {
+                    // We're swapping ZERO with ZERO, which is a no-op
+                    0 => (),
+                    // the second element on the stack is implicitly ZERO, so swapping puts a
+                    // ZERO on top
+                    1 => self.stack.push(ZERO),
+                    _ => {
+                        let last = self.stack.len() - 1;
+                        // TODO(plafer): try swap_unchecked
+                        self.stack.swap(last, last - 1);
+                    },
+                }


I wonder if there is a way to avoid handling different cases for the current stack length. Maybe we pre-allocate a vector with a bunch of zeros and would always have more than 16 values on the stack. The, we could reduce this to something like:

let last = self.stack.len() - 1; let ptr = self.stack.as_mut_ptr(); unsafe { std::ptr::swap(ptr.add(last), ptr.add(last - 1)); }

Which should have something like 10x smaller cycle count than the current code.

plafer added 2 commits February 20, 2025 16:54

feat: introduce fast processor

4176bdf

use smallvec

066005d

plafer force-pushed the plafer-fast-processor branch from a3ad318 to 1d81af7 Compare February 20, 2025 22:40

bobbinth reviewed Feb 21, 2025

View reviewed changes

plafer added 3 commits February 21, 2025 10:08

opt: execute_basic_block_node executes macro instructions

0bedf80

tests for fast processor

63564fd

feat: add experiments

3c609ba

plafer force-pushed the plafer-fast-processor branch from 1d81af7 to 3c609ba Compare February 21, 2025 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce fast processor #1668

Introduce fast processor #1668

plafer commented Feb 18, 2025

bobbinth commented Feb 19, 2025

bobbinth Feb 21, 2025

bobbinth Feb 21, 2025

Introduce fast processor #1668

Are you sure you want to change the base?

Introduce fast processor #1668

Conversation

plafer commented Feb 18, 2025

bobbinth commented Feb 19, 2025

bobbinth Feb 21, 2025

Choose a reason for hiding this comment

bobbinth Feb 21, 2025

Choose a reason for hiding this comment