ByteCodeDesignDetails

There are a variety of design choices here. First, do you want a bytecode interpreter in the target? Or would you rather execute and compile Forth code? It's simpler to use a bytecode system that compiles to Forth code, but the result may be less compact. Also you have to design all the details for how to do special things like branching.

So the SimplerApproachToByteCodes is to convert Forth source to bytecodes according to some sort of lookup table that says which bytecode goes with which name. The target can then convert the bytecode back to source code and either interpret it or compile it. All the hard work is already done by the Forth compiler. All that the token compiler has to do is to convert tokens to source code, either one word at a time or, for parsing words, it must collect as much text as the parsing word parses.

The main alternative is to make an actual ByteCodeTokenInterpreter. For each byte-token, the token interpreter looks up an execution token and executes it. There must be at least one token that does branches, and one that does exit. The token compiler decides how many tokens to skip or how many to branch back to, and then continues from the new spot. The token return stack keeps track of where in the bytecode to return to at the end of each definition. Some token definitions should have Forth names so they can be called by the Forth system. These need a colon definition that starts the token interpreter at the right spot. Literals, ." .( S" ABORT" etc, if they are included, need to parse strings on the target. There are lots of complications available to include. Many of them provide special functionality but some only duplicate things the Forth compiler already does for you. Parsing is particularly a problem because parsing from bytecode will probably not work the same as parsing from Forth source. Similarly, if you have a construction that works like CREATE DOES> in bytecode it will not be the same as CREATE DOES> in Forth source. You must distinguish among things that happen in the source code when you use the tokeniser, and things that happen on the target when your bytecode parses bytecode, and also things that happen on the target when your bytecode parses the input stream. Simpler if you do only two of those three things.


So your Forth source code has limitations. If you need 15377 it takes a bytecode to represent that constant. And you must redefine every parsing word you use. You will want to redefine : to create a new bytecode that has a definition built with previous bytecodes. You might want a second variation on : that creates a named definition in the target. But mostly it's easier to avoid parsing words than to redefine them.

Additional complications:

* Numbers. You can have a LIT bytecode, followed by a number of bytes that represent a number according to some defined format. The bytes might fit together into a binary number, either big-end or little-end. Or there might be a string that gets parsed to generate the number. The string could have a count byte at the start or a delimiter at the end. If you parse a string you avoid having to define your format for binary numbers.

You can have multiple bytecodes to represent numbers that are different sizes. Or you could have one bytecode that requires a second byte to give the size. This takes extra space every time you use a number, but it saves bytecodes.

* Extra codes. You can use one bytecode to say the next byte is a code from a different table. With one such byte you get 255 one-byte bytecodes and 256 two-byte bytecodes. With two of them you get 254 one-byte bytecodes and 512 two-byte bytecodes. Or you could have 255 one-byte bytecodes, 255 two-byte bytecodes, and 256 three-byte bytecodes. Etc. Ideally the codes you use most will be one-byte codes and the more rarely used will be the longer ones. Or you can have a token that says to switch to one of 256 other bytecode systems, and then you use the other system until you switch again. You could have a core of perhaps 128 words that are common to many bytecode systems, and switch only the other half of the codes.

* Source parsing. You can have a bytecode that uses a string. It can either have a byte (or two bytes etc) to say how long the string will be, or it can have a delimiter to tell it the string is complete. You can have as many bytecodes as you like that use strings, so that your bytecode can include .( ." ABORT" S" etc. 

* Bytecode parsing while reading bytecode. Suppose your source code does something like ['] FOO EXECUTE . To get the same result with byte code, the byte code needs to have the string " FOO" and when the byte code is compiled *then* it finds the execution token on the target and compiles it as a literal. Alternatively, if the ticked word has a bytecode you could get the execution token from the known bytecode. 

* Bytecode that parses at execution time. _This is a gotcha._ If you write a scaffolding word that parses source code while the application is built, or if you write a word that parses text as part of the application, how will the bytecode system tell the difference? Perhaps the same code is used both ways. But there's a big difference between parsing strings inside bytecode versus parsing strings from the input buffer. It's possible to build complicated double-use words that can tell which they are doing and which do the right thing either way. Much simpler to require the programmer to make a distinction. Do you really need a bytecode system that can handle any Forth code that anybody throws at it? If not, you save a lot of bother by requiring programmers to occasionally work around limitations. So you can have a bytecode that corresponds to ' and that will do ' on the target input buffer. You can have a different bytecode that does ' on source code and saves a bytecode that the bytecode reader will convert to a target execution token. Choose which one you want. If you don't need them you can leave either or both out of your own bytecode system.

* CREATE DOES> .  If you want to run CREATE DOES> on your host system you can do it in code that will not be converted to bytecode. Or you can run a version of CREATE DOES> that makes child words in byte code. Or you can run a version of CREATE DOES> that makes parent words in byte code and child words using the interpreter of the target. Three different things that work like CREATE DOES> . If you need more than one of them, they need different names at the least.