Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce fast processor #1668

Draft
wants to merge 5 commits into
base: next
Choose a base branch
from
Draft

Introduce fast processor #1668

wants to merge 5 commits into from

Conversation

plafer
Copy link
Contributor

@plafer plafer commented Feb 18, 2025

Work towards: #1558

The current rather naive implementation achieves 40 MHz on the fibonacci benchmark. So achieving 50 MHz should be quite doable, and maybe even 100 MHz (as hoped) - that is for stack-heavy workloads. It's still unclear to me the kind of performance we can expect for workloads that need a lot of memory accesses.

@bobbinth
Copy link
Contributor

The current rather naive implementation achieves 40 MHz on the fibonacci benchmark.

Very cool! I wonder where the bottlenecks are here (i.e., is is the match statement to select between different instructions, or is it something else)? If it is the match statement, one other potential way could be to have an array of function pointers and instead of doing match statement, look up functions to execute from that array. Not sure if it will be much faster (or much slower) though.

@plafer plafer force-pushed the plafer-fast-processor branch from a3ad318 to 1d81af7 Compare February 20, 2025 22:40

fn execute_op(
&mut self,
operation: &Operation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much of an impact it has that the Operation is pretty heavy. Specifically, I think memory footprint of a single instance is something like 16 bytes. Maybe the compiler can optimize this away somehow, but if not, loading an operation may require multiple memory reads.

Comment on lines +184 to +197
match self.stack.len() {
// We're swapping ZERO with ZERO, which is a no-op
0 => (),
// the second element on the stack is implicitly ZERO, so swapping puts a
// ZERO on top
1 => self.stack.push(ZERO),
_ => {
let last = self.stack.len() - 1;
// TODO(plafer): try swap_unchecked
self.stack.swap(last, last - 1);
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a way to avoid handling different cases for the current stack length. Maybe we pre-allocate a vector with a bunch of zeros and would always have more than 16 values on the stack. The, we could reduce this to something like:

let last = self.stack.len() - 1;
let ptr = self.stack.as_mut_ptr();
unsafe {
    std::ptr::swap(ptr.add(last), ptr.add(last - 1));
}

Which should have something like 10x smaller cycle count than the current code.

@plafer plafer force-pushed the plafer-fast-processor branch from 1d81af7 to 3c609ba Compare February 21, 2025 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants