-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce fast processor #1668
base: next
Are you sure you want to change the base?
Introduce fast processor #1668
Conversation
Very cool! I wonder where the bottlenecks are here (i.e., is is the |
a3ad318
to
1d81af7
Compare
|
||
fn execute_op( | ||
&mut self, | ||
operation: &Operation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how much of an impact it has that the Operation
is pretty heavy. Specifically, I think memory footprint of a single instance is something like 16 bytes. Maybe the compiler can optimize this away somehow, but if not, loading an operation may require multiple memory reads.
match self.stack.len() { | ||
// We're swapping ZERO with ZERO, which is a no-op | ||
0 => (), | ||
// the second element on the stack is implicitly ZERO, so swapping puts a | ||
// ZERO on top | ||
1 => self.stack.push(ZERO), | ||
_ => { | ||
let last = self.stack.len() - 1; | ||
// TODO(plafer): try swap_unchecked | ||
self.stack.swap(last, last - 1); | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a way to avoid handling different cases for the current stack length. Maybe we pre-allocate a vector with a bunch of zeros and would always have more than 16 values on the stack. The, we could reduce this to something like:
let last = self.stack.len() - 1;
let ptr = self.stack.as_mut_ptr();
unsafe {
std::ptr::swap(ptr.add(last), ptr.add(last - 1));
}
Which should have something like 10x smaller cycle count than the current code.
1d81af7
to
3c609ba
Compare
Work towards: #1558
The current rather naive implementation achieves 40 MHz on the fibonacci benchmark. So achieving 50 MHz should be quite doable, and maybe even 100 MHz (as hoped) - that is for stack-heavy workloads. It's still unclear to me the kind of performance we can expect for workloads that need a lot of memory accesses.