Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compilation JIT #88

Open
Lecrapouille opened this issue Jun 21, 2020 · 13 comments
Open

Compilation JIT #88

Lecrapouille opened this issue Jun 21, 2020 · 13 comments
Labels
implementation How to implement a Forth system question

Comments

@Lecrapouille
Copy link

Hi,
Seems to me that the compilation Just-In-Time (JIT) is an old technique (~ the 80s). Why this technique is not used nowadays in modern Forths which keeps using virtual machines and byte code. I'm talking about Forth not designed to live in bare metal but within an OS (ie Linux). After all, all script languages, even Lua uses this technique. I know only one Forth project using JIT. Why so few? I know the main idea on how to implement JIT and I read a basic example of stack language in an article (unfortunately directly diving inside code and the problems with the security of executing code from segments in Linux but not describing the main principles of JIT).

@mgcaret
Copy link

mgcaret commented Jun 22, 2020

What is gained by having a Forth compiled by a JIT? There are plenty of Forths that compile to native code. That’s a much simpler option compared to, for instance, firmforth (which I assume is what you found).

@scherrey
Copy link

scherrey commented Jun 22, 2020 via email

@MitchBradley
Copy link

I have been using threaded code Forth systems for years and I have never encountered a case where a more complicated compiling strategy would have been worth the tradeoffs. On PC-class machines, a threaded code Forth runs so blazingly fast that making it scorchingly fast is just pointless. On tiny tiny systems, native compilation has its advantages, but you have to go way down to AVR-class micros before that becomes compelling. ARM and ESP32 chips costing about $1 run a threaded code Forth fast enough for any reasonable purpose - typically faster than the I/O systems.

@Lecrapouille
Copy link
Author

@scherrey
Copy link

scherrey commented Jun 24, 2020 via email

@Lecrapouille
Copy link
Author

But now since I discovered https://github.com/PeterForth/ForthWin-Download using "subroutine threading" (implying the code is composed only from direct calls in machine code) when I edited this ticket I realized that JIT is not needed.

@larsbrinkhoff
Copy link
Member

I think there's a case for running native code out of flash for really tiny devices with say 1K RAM or less.

That said, maybe it's better to spend a few more cents for a part with more memory.

@MaxBarraclough
Copy link

MaxBarraclough commented Mar 11, 2021

@Lecrapouille

But now since I discovered https://github.com/PeterForth/ForthWin-Download using "subroutine threading" (implying the code is composed only from direct calls in machine code) when I edited this ticket I realized that JIT is not needed.

When optimizing native-code compilers are used with Forth, the result is that Forth code runs far faster than when threaded code is used. Whether the performance gains of native-code compilation are 'needed' will of course depend on your specific case.

@massung
Copy link
Member

massung commented Mar 11, 2021

When optimizing native-code compilers are used with Forth, the result is that Forth code runs far faster than when threaded code is used.

This has more to do with the target processor than native vs threaded. On CPUs that do not have severe penalties for consistently-mispredicted, indirect jumps, threaded code can be quite fast. I don't believe it will ever be as fast native (just as C won't ever be as fast as well written, hand-rolled assembly). But, a direct- or subroutine-threaded Forth with NEXT inlined and a reasonably good set of CODE words can be very fast indeed on many CPUs, including modern ones like the ARM (esp. in THUMB mode). And the paid cost of a little performance is often more than made-up for in size savings.

That said, since this discussion thread is about optimizations... On threaded Forths it's trivial to implement word-pair reduction optimizations, and those can end up being quite effective. For example:

OPTIMIZE: OVER OVER 2DUP
OPTIMIZE: OVER SWAP DUPD
OPTIMIZE: 2DUP SWAP ABBA

With the above rules, OVER OVER SWAP becomes 2DUP SWAP which becomes ABBA (yes, my personal Forths have used this word, and yes I named it after the band 😄).

@bshepherdson
Copy link

bshepherdson commented Mar 11, 2021

My FCC does a similar "phrasing" optimization. It uses fairly straightforward direct-threaded code as a baseline, but the compiler buffers up to 4 words, and then when it needs to emit one (or drain the queue because we're at a control flow boundary like IF or ;) it looks up the next 4, 3, and 2 words in tables of optimized phrases.

I collected the (fairly large) set of optimized phrases by running a slate of small benchmarks and some more real-world applications and having the compiler keep track of all the 4-, 3- and 2-word phrases it saw and how often. Then I optimized the most common/most optimizable ~64 of each length.

This gave it a serious speedup, and for a time it was neck-and-neck with Gforth-fast on its target platforms. Further work on Gforth has surpassed FCC substantially on most of the benchmarks, now, alas.

I should say that it differs from the above (OPTIMIZE: OVER OVER 2DUP) style in that the primitives and optimized phrases are hand-written assembly code. In addition to saving the NEXT overhead for very common phrases, many of the optimized phrases have way less memory traffic. (TOS is a register of course, and so reading a couple of values from the thread to do eg. an array index (( array index ) cells + @) can be reduced to a very few instructions.

@massung
Copy link
Member

massung commented Mar 11, 2021

In addition to saving the NEXT overhead for very common phrases, many of the optimized phrases have way less memory traffic.

👍

Yep! Your particular strategy seems quite nice. Furthering the discussion, there's nothing that requires "phrase" optimizations (or "reductions" as I called them) to be hand-rolled, CODE words. Although that's typically what they are.

As an example of where that wouldn't be the case... During an old work project that used subroutine threading that was a bit resource constrained, I came across situations like this (simplified for clarity):

: FOO ( a b -- c ) DROP ... ;
: BAR ( a -- c ) DUP FOO ;

So, while one might think that OPTIMIZE: DUP DROP NOP is ridiculous and would never occur in "real" code, it turns out that it occurred when calling words. In the above example, it took little effort to update the optimizer to match DUP FOO and instead of compiling CALL [DUP], CALL [FOO], compile CALL [FOO+CELL], completely bypassing the stack manipulation all together and saving some bytes in the process.

The above became even more common when talking about DUP IF and ELSE DROP.

@MaxBarraclough
Copy link

MaxBarraclough commented Mar 11, 2021

Does anyone know of a recent comparison of Forth performance, either between different Forth implementations or against, say, C?

This one is by Anton Ertl of Gforth fame, from 1996. I can't figure out how to read that results table, so I'll go with quotes instead. It shows Gforth being much slower than when Forth was translated to C via f2c and then compiled to native code.

Gforth is 3.5-6.5 times slower than f2c opt

It also states that:

These results show that researching better native code generation techniques is not just a waste of time and that there is still a lot to be gained in this area.

But again that's from 1996.

This one, run by the VFX Forth folks, shows a 2012 benchmark where gforth-fast took around 4 times as long as VFX Forth and iForth (both native-code-compiling Forths) to run its LZ77 compression benchmark.

This one from 2018 shows SwiftForth taking 131% as long as C++/g++, and Gforth taking 602% as long as C++/g++. The benchmark problem was to loop through the integers from 2 to 100,000 naively checking the primeness of each integer and printing the ones which are prime. I don't think it used gforth-fast though, and it might be unduly measuring startup time rather than throughput.

The comparisons linked from here are also ancient.

edit A new performance-comparison was just done (discussed here). gforth-fast took just over 6x as long as C. Was a bit surprised to see Python outperform Gforth, but I believe this is due to the Python script leaning heavily on the highly optimised Python standard library (written in C). In the Forth solution, all the heavy lifting is done in Forth.

@bfox9900
Copy link
Member

From the old-timer hobbyist..

This question is as old as Forth I think. It begs the question why would anyone use language(X) when it's not as fast as C?

The Forth ITC VM is a compromise of size and performance and as Mitch mentioned it is a pretty damned good compromise.
If you have the space DTC can give you more speed (~10 ..15%?) and faster linkage to CODE words.
If you need more space make a byte-code Forth. That's the power IMHO.

I went through the exercise of building a word CODE, ( xt --) that is the basis of a simple JIT that "steals" code words from the forth kernel and compiles them inline as headless code words, removing NEXT. The XT of these headless words is compiled into the colon definition. This can typically double the speed of critical areas of a program. You can optimize loops and branches in a similar manner with even more compiler complexity. However it is not optimized native code. It is still using the stack machine code on a register machine. (unless you use Forth hardware) There are still a large number PUSH/POP removals that could be done as the simplest optimization for example.

Sub-routine threading gives nice speed boost and simpler code inlining for intrinsic instructions, but on 16 bit machines the code is 2X bigger. On some older machines the sub-routine overhead was worse than DTC implementations. That is less relevant today of course where 32bits is the norm.

Native code generation is not too difficult but good optimized native code generation is not trivial IMHO and so all the same rules apply that one sees in C naturally.

So although **VFX and iForth are very fast they only optimize in small sections of the code at a time.

The big C compilers can and do perform multiple passes and so get the chance to optimize things that optimizing native code Forth compilers just don't look at. The C approach can make very fast code but I am told it can also cause *demons to fly out your nose . :-)

*un-defined behaviour

** Stephen Pelc has said they stopped comparing VFX to their old compilers when the speed differences hit 12X. :-)

@ruv ruv added question implementation How to implement a Forth system labels Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
implementation How to implement a Forth system question
Projects
None yet
Development

No branches or pull requests

10 participants