Skip to content

Source §1 and §2

martin-henz edited this page May 1, 2020 · 1 revision

This page serves as a developer guide for the virtual machines running Source §1 and §2 in a machine language similar to SVML. SVML has been properly defined in a separate page SVML Specification.

Table of Content

Source §1

Components

In general, the entire machine consists of a compiler and a virtual machine. The compiler first takes in the Abstract Syntax Tree output by the parse() function in Source §4, and compiles it to a sequence of machine code. The virtual machine then runs the machine code accordingly.

Compiler

The compiler takes in a Source §1 program string in the form of:

const input = "\
let x = 0;     \
x;             \
";

which it will parse and compile. In the compilation stage, all the references should be resolved, and no further additional interference on the references should be done by the virtual machine. Below are some notable functions that we would discuss on how the compiler works.

Notable functions:

Starting point: parse_and_compile(program_string)

The compiler creates an empty array for the generated machine instructions to live in. An insert_pointer keeps a pointer to where the machine instruction should be inserted. A START instruction marks the start of the program, followed by making the entire program as a nullary function and running it, creating the global environment. (LDF, CALL, DONE) Refer to lecture note for more on this.

It parses the entire program string into an AST, and compiles it. After the compilation, a compiled instruction set is returned.

push_to_compile(compile_task)

Simply push a task to the bottom of the queue of tasks to be compiled for further compilation.

compile(expr, index_table, insert_flag)

compile takes in an expression in the AST, the current cumulative index table which stores all the references so far, and an insert_flag to compile. At the time of writing, insert_flag is only true for the wrapper function of the entire program, in which the return of the entire program is handled accordingly.

The following expressions are compiled to a primitive machine instruction:

  1. number => LDCN
  2. boolean => LDCB
  3. undefined => LDCU
  4. [name] => LD

The rest of the expressions:

  1. boolean_expression
  2. conditional_expression
  3. primitive_application
  4. application
  5. function_definition
  6. sequence
  7. constant_declaration
  8. return_statement

Are handled differently. Refer to the code.

add_N-nary_instruction(op_code, ...args)

Generally, a machine instruction is nullary (without any arguments). In the case of CALL, LDF etc. which carry payload, you will have to use other add functions.

Virtual Machine

The virtual machine takes in a compiled program from the compiler and starts executing the instructions. Refer to the lecture notes for more details on how each instruction should work. Here, we list the features of the virtual machine without implementation details.

Memory management: node-based heap model

Unlimited heap space is assumed in this. For more details on memory management, please refer to the Garbage Collector documentations. This is modelled after C heap management, where each node in the memory is not protected (node corruption might happen if VM handles it wrongly).

Nodes are generally organized in the following structure:

Memory Offset Value
0 Node tag
1 Node size
2 First child offset
3 Last child offset
4 First entry (if any)
... ...

where the children of the node refers to the references to other nodes it keeps. Should any future work requires modification to reference (e.g. reference forwarding in Garbage Collection), such reference should be included in the range of the children offset.

If a node has no child, first child offset should be larger than the last child offset.

Below are the existing nodes for easy reference:

  • -100 NUMBER_NODE
  • -101 BOOL_NODE
  • -102 ENVIRONMENT_NODE
  • -103 CLOSURE_NODE
  • -104 RTS_FRAME_NODE
  • -105 OS_NODE
  • -106 UNDEFINED_NODE

Instruction set

Below are the instructions defined in the virtual machine:

  • 0: START
  • 1: LDCN // followed by: number
  • 2: LDCB // followed by: boolean
  • 3: LDCU
  • 4: PLUS
  • 5: MINUS
  • 6: TIMES
  • 7: EQUAL
  • 8: LESS
  • 9: GREATER
  • 10: LEQ
  • 11: GEQ
  • 12: NOT
  • 13: DIV
  • 14: POP
  • 15: ASSIGN // followed by: index of value in environment
  • 16: JOF // followed by: jump address
  • 17: GOTO // followed by: jump address
  • 18: LDF // followed by: max_stack_size, address, env extensn count
  • 19: CALL
  • 20: LD // followed by: index of value in environment
  • 21: RTN
  • 22: DONE

Source §2

Upgrades from Source §1

1. Add in support for all the pre-defined constants and functions in compliance to Source §2

The pre-defined constants and functions are inserted in the form of prelude program that is run before the user program is actually run. And there are 2 types of predefined functions: the ones that could be built within source, and those that utilizes underlying Source functions.

Predefined constants simply exists in this format:

const math_consts = "\
const math_E = 2.718281828459045;\
...\
";

Predefined functions able to be built from Source 1:

const predefined_functions = "\
function is_boolean(v) {\
    return v === true || v === false;\
}\
...\
";

The predefined functions relying on the underlying Source functions (e.g. display, error etc.) or those that rely on the machine level execution cannot be implemented in a similar fashion. Hence, we handled it with a premitive injection method. To achieve abstraction, we have divided this injection into four places such that only one place holds the single source of truth.

1. Injected primitive function table

const primitives = list(
    list("PAIR", 40, pair, "pair", list("addr", "addr"), "pair"),
    list("HEAD", 41, head, "head", list("pair")        , "addr"),

where the list of primitives contains entries for the injected function. Each entry is organized in the following way: list(name, opcode, underlying_func, func_name, input_types, output_type).

  • name: name of the function to be displayed
  • opcode: unique opcode to be assigned by developer and used during runtime
  • underlying_func: underlying Source function to be invoked to handle the execution
  • func_name: function name for the injection (explained in ...)
  • input_types: list of types of the arguments (variadic function should have "var")
  • output_type: type of the output of the function

In order to put in more primitive funcitons in the future (if Source language has more primitives), the developer only has to insert one more entry into this table in the format above.

It is followed by a series of auxiliary functions built for accessing and processing its content.

2. Injected function generation

The table is processed by the function generate_injected_prim_func_code(entry). The function generated will take the form of the following:

// foo is the func_name
function foo(x0, x1...) {
    return foo;
}

This might not seem to make sense, as the return statement involves some names that are not defined (at the moment). Here, the compiler actually processes this function application differently. When the function is parsed into an AST and is being compiled, the compiler first looks up the injective primitive table to check if the function is actually one of the injected primitive functions. Thanks to the fact that Source does not support function overloading, we can actually look up the name here. (Future note: if Source starts supporting function overloading, this part has to change too) In a sense, the compiler intercepts this function declaration and handles it differently from other normal function declarations

The function compile_injected_primitive takes care of the compilation of the injected function, which simply looks up for the opcode and adds the same number of LD instructions as the arguments with the correct index, and add the opcode to the machine.

3. AST injection

However, the return part gets a bit tricky. Right now, we implment this part with a hack, knowing that this part of the functions would not matter in the ultimate user program execution as the hack only affects the intermediate state of the AST. The compiler detects that the function declaration is a primitive one, and will then inject the return AST with keyword "injected":

["return_statement", [["name", [<injected_func>, null]], null]]

to

["return_statement", ["injected", [["name", [<injected_func>, null]], null]]]

This is to allow user to still write such code that returns the injected primitive function:

function foo() {
    return <injected_func>;
}

where the return statement will be compiled as per normal. With this injection, the compiler can simply add a RTN opcode following the complied code from the above. The rest of the work is handled by the virtual machine.

A notable challenge here is the variadic function. Since the Source parser parse() does not allow spread operator ... like JavaScript, we have to bypass it. Our implementation involves reading var as the argument, e.g. foo(var), from the injected primitive function. Upon reading the var argument, the compiler will generate machine instruction LDV that loads as many times as required. This part is handled in the virtual machine.

4. Virtual Machine handling of the injected primitive instructions

Future work

Re-implement the current virtual machines to be SVML compatible

Implement tail call

Implement block compilation

Clone this wiki locally