Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for AssemblyScript #35

Closed
Dudeplayz opened this issue Apr 16, 2020 · 109 comments
Closed

Add support for AssemblyScript #35

Dudeplayz opened this issue Apr 16, 2020 · 109 comments
Labels
help wanted Extra attention is needed performance

Comments

@Dudeplayz
Copy link
Contributor

Dudeplayz commented Apr 16, 2020

Hey Uri,
We have already spoken about the possibilities to speed up the simulation. I am interested in adding support for AssemblyScript. I followed the recent discussions and speed ups. The question is if you still think it can bring some performance gains, also after the enhancements that were made?

@urish
Copy link
Contributor

urish commented Apr 16, 2020

That's a good question.!

One challenge with AssemblyScript is that the code can no longer be easily extended or modified from the JavaScript realm, and communication between JS and the AssemblyScript code can be quite costly, so we'll want to keep it down to minimum. For starters, I'd suggest trying to create a version of the demo project that works with AssemblyScript, so we can compare the performance.

If AssemblyScript does bring a notable performance improvement that can't be achieved by tuning the TS code, we can probably find a way to include an AssemblyScript binary in the releases, either in this repo or a dedicated repo. Once we have more data about the performance we can consider the best course of action.

@urish urish added performance help wanted Extra attention is needed labels Apr 16, 2020
@Dudeplayz
Copy link
Contributor Author

This sounds like a good plan. I will take a look at it!

@urish
Copy link
Contributor

urish commented Apr 19, 2020

Thanks!

Another idea that I had would be to come up with a AVR → WebAssembly compiler, that is to convert the raw AVR binary into WebAssembly code that does the same, so we don't have to pay the overhead of decoding the each instruction as the program is executing, and perhaps the JIT will be able to do a better job at optimizing the generated code.

@Dudeplayz
Copy link
Contributor Author

Oh wow. This sounds really interesting! The question would be, how to integrate the peripherals.

@urish
Copy link
Contributor

urish commented Apr 19, 2020

That's a good question. I'd imagine having a bitmap or so that will indicate which memory addresses are mapped to peripherals. Whenever you update a memory address, you'd check in the bitmap. If it has a peripheral mapped to it, then you'd call a web-assembly function that will resemble the writeData() function that we currently have....

@Dudeplayz
Copy link
Contributor Author

That should work. The peripherals will be in WebAssembly too? In this case, everything would run in WebAssembly, expect the visuals which have to stay in JavaScript. But then we have again the problem, that all peripherals have to be converted to WebAssembly :/.

@urish
Copy link
Contributor

urish commented Apr 19, 2020

We could also mix-and-match. I believe stuff like the timers, which has to run constantly (in some cases after every CPU instruction or so), will have to live in the WebAssembly land. But some other, less frequent peripherals could possibly bridged over to JavaScript land.

There's definitely much to explore here...

@Dudeplayz
Copy link
Contributor Author

The thing is, it always depends on the workload. So yes it is really interesting, but we need to start at some point with the exploring. Maybe we will see any point where we can say, we are already faster than ever expected. On the whole, you can say the CPU which is simulating will be always a lot faster than the simulated MCU. The gain of optimization is still for slow end devices.

So, should we create a plan where to start with testing and exploring the possibilities each approach has?

@urish
Copy link
Contributor

urish commented Apr 19, 2020

Yes, and as you say, it's good to have some baseline to compare to.

Right now, the JS simulation has some thing that can already be improved (e.g. the lookup table for instructions), and it runs pretty okay on modern hardware - achieving between 50% speed on middle-range mobile phones and 160%+ speed on higher-end laptops.

However, lower end devices (such as Raspberry Pi) only achieve simulation speed of 5% to 10%.

So there is definitely room for improvement, especially if we consider the use-case of simulating more than one Arduino board at the same time (e.g. two boards communicating with eachother)

@Dudeplayz
Copy link
Contributor Author

Ok yes. For some playgrounds etc. it would be really funny and interesting to have multiple Boards running at once. So only if this case will be locked to higher-end devices, we need to improve it.
I also had some interesting ideas with simulating such boards with node.js and other JavaScript runtimes.

I am actually not familiar with the benchmark code. Is the benchmark possible to run under all mentioned approaches or do we need a new benchmark to make a meaningful comparison?

@urish
Copy link
Contributor

urish commented Apr 19, 2020

The current benchmark is pretty minimal - it runs compares a single-instruction program many many times to compare different approaches for decoding.

I think a better benchmark would need:

  1. A larger program, that runs for a specific amount of CPU cycles, and includes a variety of instructions, branches, etc.
  2. A more extensive benchmark program, that uses peripherals, in addition to just running code, so we can check the integration of everything together.
  3. If we want to run it in Web Assembly, we'd also need a runner in Web Assembly, though that shouldn't be to complicated.

I'd probably start with just the 1st, simpler benchmark, to get a feeling if the direction seem promising, and if it is, then we can devise a more extensive benchmark that will allow us to do a comprehensive comparison.

What do you think?

@Dudeplayz
Copy link
Contributor Author

Starting with 1st should be the best approach. I would start looking at AssemblyScript or at WebAssembly directly? With WebAssembly directly we also have the decision between AVR Instruction translation and full interpretation (like JavaScript) in WebAssembly.

Maybe we can focus on some most needed Assembler instructions to reduce the amount of initial instructions and to focus the benchmark on those. So we can faster see first results and decide after, where we dig deeper or if we can already see a clear winner?

@urish
Copy link
Contributor

urish commented Apr 19, 2020

I believe that Web Assembly interpretation (written in C or RUST) wouldn't be much different than AssemblyScript, but it's pretty easy to write one or two instructions, as you suggest, and compare the generate WAT (Web Assembly Text) between the different implementations. Ideally, if there is no significant difference, using AssemblyScript means we can probably keep one code base which is preferable.

Here are some useful resources:

  1. Web Assembly Studio - Great environment for experimenting with different languages for WASM (C, Rust, AssemblyScript), if I'm not wrong, you can also see the generated code (so it can also be helpful with the AVR → WASM compiler).
  2. Not So Fast: Analyzing the Performance of WebAssembly vs. Native Code - According to this article, I'd expect WASM to be about 60-70% the speed of writing native code. So if, for instance, we simulate a single AVR instruction with 30 WASM instructions, that'd be roughly the equivalent of 50 host CPU instructions, or x50 slowdown (so about 20MIPS for every 1000MIPS of host CPU)

@urish
Copy link
Contributor

urish commented Apr 19, 2020

@gfeun (author of #19), you may find this discussion interesting :)

@gfeun
Copy link
Contributor

gfeun commented Apr 19, 2020

Hey, i've been summoned. Yeah, i'm following along 😄

@Dudeplayz
Copy link
Contributor Author

Dudeplayz commented Apr 19, 2020

Yes, I agree. If there is not a really breaking point which requires to translate the TS code, we should stay at TS and use the AssemblyScript compiler.

  1. The Web Assembly Studio looks great. I remember to have found it before in my researches.
    If you can wait, I will try to create a first comparison between the Rust, C and AssemblyScript approach in the next 2 days. These languages are not my preferred ones, so I have to get familiar with the tooling 😅. I will try to create some sharable Web Assembly Studio workspaces and post them here. (And yes, the generated wasm code is visible.)

  2. If we can get the point with translating the AVR binaries to WebAssembly we can overcome the 30 WASM instructions. I think this project has some really interesting possibilities to get the best out of it. If we can get a good combination of all these possibilities we get the opportunity to support a really wide range of end devices and possible runtimes!
    Also, 1000MIPS is for a modern CPU no problem. Also, an RPi 4 could maybe run at something around 70-100% or even more. (measured on some synthetic benchmark)

@Dudeplayz
Copy link
Contributor Author

And hello @gfeun. Nice to meet you!

@urish
Copy link
Contributor

urish commented Apr 20, 2020

Yes, definitely, sounds like a good plan. It's going to be interesting :)

@Dudeplayz
Copy link
Contributor Author

Dudeplayz commented Apr 20, 2020

Hey, I've created 3 WebAssembly Studio (WAS) workspaces for C, Rust and AssemblyScript:
C - https://webassembly.studio/?f=u627pgs1r5q
Rust - https://webassembly.studio/?f=h80i9yrgjxa
AssemblyScript - https://webassembly.studio/?f=j5dlh1kn5s

The code and folder structure is based on the empty template workspaces for each language. I've tried to bring all together in one workspace, but this is more complicated.
For the start, I have implemented in each of them the same basic functions with an identical file structure. All wat (WAS transforms them transparent to and from wasm) files are looking basically the same. Except of some different orderings and the funny point, that Rust and C are swapping the variables for the add function.
Surprisingly the AssemblyScript version is the smallest, measured on the line count. Rust and C have
this line (table $T0 1 1 anyfunc) additional, which tells me nothing because I currently can't understand the file scheme. There is also a difference for Rust and AssemblyScript in the line (memory $memory (export "memory") 17)), where AssemblyScript has 0 instead of 17.
C also has a few lines more, which are maybe only for meta informations, which are left by the others.

I know, the example functions are not really hard to interpret differently. So can you take a look at it and tell me what you think? What would be a good example function to implement to find out something more meaningful?

Currently, I have the opinion, that even if AssemblyScript could have at some point an outbreaking performance difference, we would always have the possibility to implement this specific feature in a different language.

(To see the wasm/wat code, run Build in the WebAssembly Studio)

For some reference, you can access https://developer.mozilla.org/en-US/docs/WebAssembly/Understanding_the_text_format.

@Dudeplayz
Copy link
Contributor Author

Oh, wow. I just realized now that all compilers have pre-evaluated the method calls in the main function 😅.

@urish
Copy link
Contributor

urish commented Apr 21, 2020

Yes, the compile seems to be very good at optimizing - it inlines the functions and then also precalculates the result of the expression. Pretty smart!

So far, it seems like for basic arithmetic, we get roughly the same amount of opcodes. What I'd try to do next is to implement a complete opcode, e.g. define an array of hold the program data, and then implement an opcode that also reads and updates the data memory, such as ADD (the registers reside in the data memory, and it also updates the SREG register, so that would be a memory write).

Does that make sense?

@Dudeplayz
Copy link
Contributor Author

I would say it makes sense 😄. I will try it first with the AssemblyScript version and do some copy and paste of the original code. After that, I will convert it to the other both.

So for correct understanding:

  1. Implement 'ADD' command
  2. Already implement basic opcode decoding?
  3. Small data memory to be used by 'ADD'
  4. Include SREG registers

@urish
Copy link
Contributor

urish commented Apr 21, 2020

Indeed, we'd need some basic opcode decoding to extract the target registers out of ADD. Also, the existing TS code can probably be used without too much changes...

@Dudeplayz
Copy link
Contributor Author

Ok, I will do so.
I would create an additional repo/project for this testing code? I also moved from WebAssembly Studio to IntelliJ (or any other IDE).

@Dudeplayz
Copy link
Contributor Author

I had tonight another idea: Would it be possible to add some async command prefetching to reduce the time for opcode decoding?

@urish
Copy link
Contributor

urish commented Apr 21, 2020

Yes, we can either create a different repo (if the code is entirely different), or a new branch here, in case the code is still the same (or auto-generated from the current code, like I did with the benchmark).

As for IDE, that makes sense. I use VSCode for this repo, so if you open it with VSCode you should get a list of recommended extensions (prettier, eslint, etc).

What do you mean by async command prefetching?

@Dudeplayz
Copy link
Contributor Author

I currently only mean for testing things. So the example AssemblyScript, Rust and C code. For the "real" implementation I would prefer a different branch with the target to bring it to master.

I will try to get warm with VSCode for dev purposes ;D.

We currently discovered the problem with the big if-else statement. The idea would be to evaluate this statement for the next opcode async. But this requires to bring the code in a format where this is possible.

@urish
Copy link
Contributor

urish commented Jun 30, 2021

This is the basic algorithm:

deadline ← cpu.cycles + 250000 (or a similar number)
while cpu.cycles < deadline:
  execute next cpu instruction, updating cycles
  check if we need to run any peripheral callback

So WASM can export a function that gets the amount of cycles, and runs the code for that many cycles (more or less, it doesn't have to be super accurate), then yields control back to JS.

@Dudeplayz
Copy link
Contributor Author

Ok. So the HEX files are compiled to WASM -> Resulting in a WASM program containing all the instructions of the program.
Then a run-loop (the above) executes these instructions and doing any peripheral callback.

Is this correct? I'm feeling like hanging somewhere.

@urish
Copy link
Contributor

urish commented Jul 2, 2021

Is this correct? I'm feeling like hanging somewhere.

Yes. Then, this may open the door for further optimizations (e.g. skipping costly flag calculation if the next instruction discards the flags anyway), but let's first see if we can get the basic thing going and how much faster it is.

@Dudeplayz
Copy link
Contributor Author

Ok. Does this way need a rework of the CPU class? I think the first thing should also be trying to compile the project to WASM and then adding the compiling of the HEX files. Or do you think another way is better?

@urish
Copy link
Contributor

urish commented Jul 5, 2021

I think the first thing should also be trying to compile the project to WASM

This may need a lot more changes than just emitting WASM code directly.

You may find the AVR to JS compiler experiment useful as a reference. If I remember correctly, I started from instruction.ts and did some massive editing there.

The code that decodes the opcodes and their args will probably stay the same, but the part where you generate the code will probably be different. probably something along the lines of:

  ...
  } else if ((opcode & 0xfe08) === 0xf800) {
    /* BLD, 1111 100d dddd 0bbb */
    const b = opcode & 7;
    const d = (opcode & 0x1f0) >> 4;

    /* something that will generate the WASM equivalent of 
        data[d] = (~(1 << b) & data[d]) | (((data[95] >> 6) & 1) << b);

        i.e. convert the following pseudo-code into WASM 
        temp1 ← data[d]
        temp1 ← temp1 & ~(1 << b)
        temp2 ← data[95]
        temp2 ← temp2 >> 6
        temp2 ← temp2 & 1
        temp2 ← temp2 << b
        data[d] ← temp1 | temp2

        where each line in the above code is probably a single or two WASM instructions, and `~(1 << b)` is actually a constant (because we know the value of b at compile time)
    */;
  ...

I hope this is helpful!

@Dudeplayz
Copy link
Contributor Author

Thank you. I will take a deeper look at it later this week.

@Dudeplayz
Copy link
Contributor Author

Hi @urish ,
I hope you are fine! I have made some progress in compiling the library into WASM. Unfortunately, the compiled code does not work the same as the vanilla library. I'm currently struggling because I can't find the reason for this problem. During my work, I had to change some of the code in regard to the missing types or the use of some high-order functions which are not available in AssemblyScript. I got also some trouble with the cross-coupling of the classes and the multiple classes in the single .ts files. My goal is to use the compiled WASM code + the glue code as a drop-in replacement of the current library. So it's a zero-cost performance boost for existing programs. So my question to you is if you can refactor the code in regard to these compatibility issues so that future updates don't break the WASM code.

Maybe you have some time, that we can talk about it in an online meeting?

Best regards!

@urish
Copy link
Contributor

urish commented Nov 2, 2021

Hi Derio! Great to read you got some progress!

Feel free to book some time on my calendar at https://urish.org/calendar

@urish
Copy link
Contributor

urish commented Nov 9, 2021

@Dudeplayz are you joining the call?

@Dudeplayz
Copy link
Contributor Author

@urish I am on the way, 3 min. Sorry!

@urish
Copy link
Contributor

urish commented Nov 9, 2021

Alright, I'm waiting :-)

@Dudeplayz
Copy link
Contributor Author

@urish thanks for the talk this week. I have found the reason for the failing opcodes. It is mainly by an implicit type casting from u16 to u32, where u16 flips around and is then cast to u32, which doesn't flip around again because it can handle larger numbers. The discussion directly in AssemblyScript can be found here AssemblyScript/assemblyscript#2131. It seems that I still found a bug in AssemblyScript.
I already fixed the instructions, by typing the opcode directly as u32 and not u16. The test program executes now correctly. This has still to be tested with some larger programs. Maybe you have some, which you already used. Are all instructions are covered by the unit tests?

@Dudeplayz
Copy link
Contributor Author

And here is a link for the portability which you have mentioned. The typecasting is then a little bit different.

@Dudeplayz
Copy link
Contributor Author

Dudeplayz commented Nov 12, 2021

I got it! The test program is now running without discrepancies. I had also to update the instruction.ts file, which I skipped due assumption it hasn't changed, but that was wrong. Here is the instruction.ts file with the applied fixes using the portability approach. If you like we can merge it into the main project soon. The made changes can be found here: wokwi/avr8js-research@42f0b39. The only thing I haven't tested yet is how to import the portability AS library in normal TS.

I am now trying to get the jest unit tests running, which are throwing some errors due to my node.js environment I think.

@urish
Copy link
Contributor

urish commented Nov 13, 2021

Hi Dario!

Congrats on spotting the issue. Most of the instructions are covered by the unit tests, but not all of them. But I think, if the unit tests pass and Blink also eventually works, that's already a very good starting point.

Thanks for sharing the modified instructions.ts. I merged your changes into a working branch, as-interop. I added an implementation of the u16/u32 functions, so the code still compiles/runs correctly with typescript. See commit b81a21d. If it looks good for you, I'm ready to merge it into master.

@Dudeplayz
Copy link
Contributor Author

I have some trouble getting the jest running with the assemblyscript/loader. I'm working on it to get the unit tests working. If I get it working or get stuck, I will try to get the timer working for testing the blink program.

I had a look, and the only thing I am not sure about is the import of the types file. The docs are not well describing how to get the portability working. They throw some statements at you and mention some projects where to look. Maybe we wait with merging the AS things until I have some more parts finished. I think we have to extend the AssemblyScript library or import it. I had a look at the portability class and the functions they provide are designed to transform any number in a way that describes the same behavior as in AS/WASM (overflows, trimming, etc.). So if we start with that it means that WASM would be the preferred/limiting target.
For the first, you could merge it as it is and we see if we can substitute the types file later with the portability version.

It would be also nice if you could mention my contribution somewhere. Atm, it wouldn't be clear, as you do the commits. I hope that is ok :)

@urish
Copy link
Contributor

urish commented Nov 16, 2021

I have some trouble getting the jest running with the assemblyscript/loader

If you get hung on it for too long, remember you can always run the tests without jest. You'd need to create some kind of expect function, and then you can strip "describe" and replace "it" with a console.log that prints the test name. Some work, but it might be a quick way to get a good feel about the tests.

Maybe we wait with merging the AS things until I have some more parts finished..
For the first, you could merge it as it is ...

I'm not sure - so do you advise to merge or wait?

In general, there are two parts which are very sensitive to performance: instructions and the count() function in the timers. If we introduce code that uses the compatibility library, we need to make sure it doesn't impact performance. My u8...u32 implementations are naïve, but they work here because the code doesn't assume that they affect the number in any way.

It would be also nice if you could mention my contribution somewhere. Atm, it wouldn't be clear, as you do the commits. I hope that is ok :)

Of course. I added a comment with your name at the top of instructions.ts. And if you feel like making the commit under your name - then sure, go for it. Then I can merge your commit in place of mine.

@Dudeplayz
Copy link
Contributor Author

If you get hung on it for too long, remember you can always run the tests without jest.

Thanks for the hint! I will do this now. I was very busy the last 2 weeks so I had to pause a bit.

I'm not sure - so do you advise to merge or wait?

If you don't plan to work on the instructions in the next weeks, we can merge. Otherwise, we could get the problem, that you can't test the compatibility with AS until we copied/merged it in the research project to see if the compiler is fine with it.

Of course. I added a comment with your name at the top of instructions.ts. And if you feel like making the commit under your name - then sure, go for it. Then I can merge your commit in place of mine.

Ok. that would be nice, so I will do my own commit. I have still the problem, that I am not familiar with the Github Merge process 😅

So let's wait a bit until I got the tests running and I will try to create a Merge-Request.

@urish
Copy link
Contributor

urish commented Nov 26, 2021

So let's wait a bit until I got the tests running and I will try to create a Merge-Request.

Sounds like a plan!

In general, merge can be done in a few ways. The most straightforward one is when your branch places all the commits on top of the last commit in master, then these commits are simply copied over. Otherwise, there are a few options when I merge:

  1. Create a merge commit
  2. Squash
  3. Rebase

There's a nice book from @PascalPrecht that explains this, in case you want to better understand the process: https://rebase-book.com/

@Dudeplayz
Copy link
Contributor Author

Hi @urish , I wish you a happy new year!
After I was so demotivated by the not working tests and the year change stress, I now had the time and motivation to continue. I finally got the tests running, yey. Some tests are failing, again a problem of the int sizes. For example, the ADC failed on the sum calculation because the u8 overflows to 0 and not to 256. So my first question would be, how well do you think your tests are covering all possible cases? If I fix the failing tests one by one, would it be sufficient to be safe, that the commands work as they should, or are there still some side effects (caused by the overflows) possible? The second thing is, that I had the idea to change all used numbers to u32, which should avoid any errors (and would make my recent changes obsolete) because they would be identical to the JS number size. So the overflow limits would be similar?

@urish
Copy link
Contributor

urish commented Jan 8, 2022

Happy new year Dario, good to have you back!

how well do you think your tests are covering all possible cases?

Probably just a selection. I'm not even sure all instructions are covered.

If I fix the failing tests one by one, would it be sufficient to be safe, that the commands work as they should, or are there still some side effects (caused by the overflows) possible?

I'd try to reason about the overflow issues you find, and just look which other instructions might be prone to them. Then we can come up with a few additional test cases to cover these cases specifically. Or -

The second thing is, that I had the idea to change all used numbers to u32, which should avoid any errors (and would make my recent changes obsolete) because they would be identical to the JS number size. So the overflow limits would be similar?

That could work. In terms of memory size, the difference between u8 and u32 should be negligible. So I see no reason not to go with u32. JavaScript uses double internally, but if I'm not wrong, they are truncated to i32/u32 when you apply bitwise operations (depending on the operation). So I say - go for it, and let's find out :)

@Dudeplayz
Copy link
Contributor Author

Happy new year Dario, good to have you back!

Thanks :)

Probably just a selection. I'm not even sure all instructions are covered.

I have already added some tests for ADD and the SUB's. I would leave it for now as it is. I think I discovered most of the reasons for the failings in the commands. Always something with wrong int sizes. Later we should add the Rest of the tests. The best here is also to cover the edge cases, where the underlying data type could cause any errors.

I'd try to reason about the overflow issues you find, and just look which other instructions might be prone to them. Then we can come up with a few additional test cases to cover these cases specifically.

The problems I discovered and solved until now are:

  • u8 is used for a u16/u32 value calculation -> value is truncated (ADD as the simplest example)
  • If a left bit shift should go from u8/u16 into u16/u32 sizes, the int size must be also be adjusted beforehand
  • For the MUL's u8 to u16 cast is required to hold the larger result value

In the fix process, I overlooked and overthink some of the commands. I haven't found another reason until now. If you have an idea which command could break, name it and I check it.

That could work. In terms of memory size, the difference between u8 and u32 should be negligible. So I see no reason not to go with u32. JavaScript uses double internally, but if I'm not wrong, they are truncated to i32/u32 when you apply bitwise operations (depending on the operation). So I say - go for it, and let's find out :)

For now, I think the commands are primarily fixed. The change to u32 is only possible inside the instructions. But there are still some drawbacks. The main problem is, that the data array from the CPU is an U8IntArray. So every access to it always results in an u8 value. Some commands are storing the value beforehand in a variable, here just setting the variable to u32 would be fine, but there are also some commands which are using the result directly, there are then always casts necessary. This could blow up the code very fast. Because I think that the commands are working now, I would prefer the following:

  1. Leave it as it is
  2. If errors occur add the missing tests -> always good to cover all, even if it could avoid
  3. Find the reasons for the errors and apply the fixes
  4. If it is to much, refactor the commands to use u32

Another thing I came about is, could we refactor all data array accesses to use the data view? So the data type is always clear by the access command to the data view. Or is the overhead to high?

For now, I would continue to get the cpu.spec.ts running. This results in implementing the rest of the CPU class, which I skipped for the callbacks etc. because there is also some glue code required. In the future, I think we should also update the peripherals to be compiled directly.

@urish
Copy link
Contributor

urish commented Jan 14, 2022

Because I think that the commands are working now, I would prefer the following:

Make sense

Or is the overhead to high?

Good question. From my experience (I did some profiling a few months ago) it greatly varies between browsers. If I recall well, Chrome optimized DataView access, FireFox didn't (I think that's one of the reasons Wokwi runs faster on Chrome). But I guess you could try to do a quick profiling and see if it changed since.

In the future, I think we should also update the peripherals to be compiled directly.

When you get to that, Timers and SPI are good candidates. Timers are pretty complex (so might take a big amount of work to compile), but they can run as often as every CPU clock tick (so can greatly affect the performance). SPI is much simpler, on the other hand, but when it's used it can run as fast as half the clock speed, so it can also become a bottleneck.

@Dudeplayz
Copy link
Contributor Author

But I guess you could try to do a quick profiling and see if it changed since.

I will see if I can test it. If not it shouldn't be a problem. It's just an optimization to unify the calls and make the code a bit cleaner. So I will take a look at the end if time is left.

When you get to that, Timers and SPI are good candidates.

Thank's for the hint. I will do this. I am currently adding the interop for the eventCallbacks of the CPU. Then the CPU is finished, also the glue code for it.

So now the next problem I have discovered. There are currently no closures supported in assemblyscript. Because functions can't passed directly between the JS and WASM world, my plan was to use some glue methods which take a callbackId and the WASM code is then calling with this id and the JS code executes the callbacks on the JS side.

Here is my current code:

// declared JS function to be called from WASM
export declare function callClockEventCallback(callbackId : u32) : void;

// exported WASM function to be called from JS side
export function addClockEvent(cpu : CPU, callbackId : u32, cycles : u64) : AVRClockEventCallback{
    return cpu.addClockEvent(() => callClockEventCallback(callbackId), cycles)
}

The compilation fails with:

ERROR AS100: Not implemented: Closures

         return cpu.addClockEvent(() => callClockEventCallback(callbackId), cycles)

The implementation status is still open AssemblyScript/assemblyscript#798.
Now the idea's how I would workaround it.

  1. Currently the EventCallback has no parameter, that's why I can't reference it in any way to call it correctly back into JS. If I add a parameter that passes a callbackId which is generated and handled directly in the CPU class it would be possible to pass it. This would change the method signature of addClockEvent. There is currently the callback returned. Instead, the callbackId would be returned.
  2. Based on the above method the AVRClockEventEntry could be passed to the callback, then a clear reference would be also possible, but it contains some internals that are possibly not meant for public access?
  3. Based on the above both: Create a new interface/class AVRClockEvent which contains the calling information (callbackId and others), which is passed to the callback. This would also allow to extend it in the future.

What do you think? Is one of these the way to go or do you have another idea and hints?

@Dudeplayz
Copy link
Contributor Author

Dudeplayz commented Jan 17, 2022

A similar problem is with the hooks. Every array access where an anonymous function is passed, is not usable because they can't be passed between JS<->WASM. Currently I only see a solution in adding methods for it, like addWriteHook(addr,...) which set the callback on the array/map. Currently they are accessed directly from outside the CPU class. If add and remove functions are used, the API can be used identical on JS and WASM and the glue code is doing the mapping between them. Even if I see the future in WASM and think the whole project should be based on it some day, I think it is useful to keep the API's identical, which makes access and usage easier and peripherals and hooks can be used in JS and WASM as needed.
Maybe you can give me an ok, to refactor things as needed? I would also continue to ask you for your opinion because you have the knowledge of some performance-related things and it's your project and I think you made some design choices with a look at the future.

@Dudeplayz
Copy link
Contributor Author

Hello @urish ,
I hope you're fine!
I come with good news. I finished the wrapping of the CPU in the JS world and all unit tests are passing. I tried to keep the API's stable. For the JS side API, it's the case. When we are at the point where we have integrated the new code, it should be a simple one-to-one replacement to get the (hopefully, not tested yet) performance improvements of WASM. Currently, all peripherals are still in JS (also all tests passing 🎉). I will try to get started with the timer migration how you have mentioned.
I am currently not sure how to best benchmark it, do you have a hint?
I also have currently the problem, that the assemblyscript/loader needs the /umd package to run with jest, but to run it otherwise, it needs the ESM package. I am not so familiar with the JS tooling to achieve these conventions at build time or are there tricks to achieve it in code, e.g. exporting the one which is currently needed with an extra file? Hope you can help me here.

@urish
Copy link
Contributor

urish commented Feb 12, 2022

Thank you very much Dario!

I think the next step would be, as you suggested, benchmark the performance improvements. I would start with some computationally heavy program, like calculating pi or the Mandelbrot Set.

Also, I have recently come across this great article diving into the minutes of JS vs Web Assembly performance, and how to specifically optimize AssemblyScript code.

After reading it, I'm not very optimistic that AssemblyScript will get us a big performance gain - if we are lucky, it might be somewhat faster than JavaScript. If this is the case, we can instead look at translating the AVR opcodes into Web Assembly, which may get us a much more significant performance gain (as we won't have to decode every instruction again and again).

In any case, having a benchmark (or few) will be very helpful to guide our way. So I suggest focusing on that, and once we're certain about the direction, we can start looking into the details of how to package everything (e.g. assemblyscript/loader).

@urish
Copy link
Contributor

urish commented Jan 6, 2023

Closing due to inactivity.

@urish urish closed this as completed Jan 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed performance
Projects
None yet
Development

No branches or pull requests

3 participants