diff --git a/docs/Commentary/Assembly.rst b/docs/Commentary/Assembly.rst index 79bb83b..27b58c1 100644 --- a/docs/Commentary/Assembly.rst +++ b/docs/Commentary/Assembly.rst @@ -1,30 +1,31 @@ Assembly ######## -A lot of languages aim for "no runtime overhead". But this is unattainable, even C structs and arrays have to use memcpy occasionally. Stroscot merely aims to be as fast as C, which means compiling to good C is sufficient. But C is somewhat ill-defined and assembly can be faster in some cases, so it would be nice to do assembly directly as well. +Creating a high-level programming language with absolutely no runtime overhead is a challenging task. For example C has runtime overhead: the generated instruction sequences may not be optimal, memory management with malloc/free may be slower than a custom allocator, data type layouts may not be optimal, function calls follow the C calling convention which is not necessarily the fastest, and the standard library may be slower than hand-written routines. + +So writing code in assembly can be advantageous. It is best to think of a high-level programming language as a macro assembler with particularly complex macros - this begs the question of why most languages make it so hard to use inline assembly. Even in a JIT-style execution model, some amount of code is compiled to assembly. The difference is there are also jumps into the interpreter, which sort of "glues together" regions of compiled code. Architectures ============= -The first step in dealing with assembly is to decide which instruction set architectures to support. I couldn't find a list of processor architectures by popularity, but from `this quora answer `__ and checking it by googling numbers of units sold for other random ISAs, the two primary architectures are x86-64 AMD64 (desktops) and ARMv8-A A64 (mobile devices). +The first step in dealing with assembly is to decide which instruction set architectures to support. I couldn't find a list of processor architectures by popularity, but from `this quora answer `__ and checking it by googling numbers of units sold for other random ISAs, the two primary architectures are x86-64 AMD64 (desktops) and ARM64 (mobile devices). Others to consider as well: -* ARMv9-A A64: It's released, devices expected in 2022. Very similar to v8 so should be able to share the code as a microarchitecture. Verdict: on the roadmap -* C: compilation to a self-contained C program makes porting much easier, and obviates the need for many of these architectures. Verdict: on the roadmap. Note though that this is only compiling to a subset of C - not every C program can be produced. And some things like tail calls are really hard to encode in C, or add significant compilation overhead, so it can't be the only option. +* C: compilation to a self-contained C program makes porting much easier, and obviates the need for many of these architectures. Verdict: on the roadmap. Note though that this is only compiling to a subset of C - not every C program can be produced. For example, jumps (tail calls) are hard to encode in C - you either do one massive function with goto's, or `trampolining `__, or non-portable TCO conventions like Manticore's JWA convention. * WASM: it still doesn't support `tail calls `__. Given the lack of progress it seems like a low priority. Verdict: Contributor. * LLVM: The bitcode format may be worth targeting at some point. Per blog posts the API is much more unstable than the IR, and generating the IR in memory and parsing it is about as fast as using the API. Verdict: Contributor. * RISC-V: There are $100-ish dev boards listed at https://riscv.org/exchange/boards/. No non-dev systems yet. It's a relatively simple ISA, similar to ARM. Verdict: Contributor * 32-bit ARM: Old phones, the Raspberry Pi Zero. The XML database is similar. Verdict: Contributor. * 32-bit x86: Old desktop PCs. From a time/effort perspective it seems cheaper to buy a new computer instead of writing support for these. Verdict: C backend or contributor. -* POWER: `Raptor `__ sells expensive systems. Much more niche than RISC-V. Verdict: C backend. +* POWER: `Raptor `__ sells $5K-ish systems. Much more expensive and niche than RISC-V. Verdict: C backend. * MIPS: the company that develops it went bankrupt and is now doing RISC-V. There are consumer systems available in China (Loongson), but the rumor is that they too are moving to RISC-V or else to their own architecture LoongArch. Verdict: C backend. * z/Architecture: really expensive, weird OS. Verdict: C backend. * SPARC: It's end-of-life but I guess you can still buy servers second-hand. Verdict: C backend. From a design perspective supporting 2 architectures is not much different from supporting 10, it's just a larger set of cases, but 10 is 5x the work of 2. ARM support will be tested through QEMU, x86 natively. There are also CI services that could work (Drone). Code bloat is an issue but keeping each ISA in its own folder should avoid drift. -In addition to the basic ISAs, there are also extensions and `microarchitectures `__ to consider. `PassMark `__ has a list of CPU shares, it's probably wildly skewed to gaming but it's better than nothing. The data on CPU cycles, ports, etc. is rather detailed and has to be generated by running benchmarking programs, so it will probably depend on user submissions; for now I'll use my own CPU (AMD A6-3650 APU). +In addition to the basic ISAs, there are also extensions and `microarchitectures `__ to consider. For example ARM64 is divided into v8-A, v9-A, and others. `PassMark `__ has a list of CPU shares, it's probably wildly skewed to gaming but it's better than nothing. The data on CPU cycles, ports, etc. is rather detailed and has to be generated by running benchmarking programs, so it will probably depend on user submissions; for now I'll use my own CPU (AMD A6-3650 APU). Operating systems ================= @@ -33,66 +34,41 @@ In planned order: 1. Linux for AMD64, because it's what I'm typing on now 2. Android for ARM, because it's my phone and it's easy to hook up -3. Windows for AMD64, which I can emulate with WINE and get access to fairly easily +3. Windows for AMD64, which I can emulate with WINE and test fairly easily We'll exclude Apple for now because their OS documentation sucks, they charge $100/year for a "developer license", and their anti-competitive practices mean that they would probably find some way to shut Stroscot down once Stroscot starts being a serious competitor with Swift. Of course there is nothing stopping someone else from jumping through all the hoops needed to placate Apple and making a port. Instruction database ==================== -Data sources ------------- +x86 has a lot of instructions - somewhere around 1000 unique mnemonics, and many variants of those instructions. ARM too has at least a thousand instruction variants. With so many, it is clear that a structured database of instruction information is needed. -The basic goal is to have official data sources where possible and otherwise generate it automatically via measurement, that way new processors / ISAs can be added quickly. In terms of data sources for ISAs, for x86 the official sources are `Intel's SDM `__ / `AMD's Architecture Programmer's Manual `__, which use English and pseudocode and have numerous typos (if the experiences of others hold true). Also they are only distributed as PDFs. Parsing the PDFs is a lot of work. `EXEgesis `__ uses a hacky Xpdf parser but has some amount of effort invested by Google. `x86doc `__ uses pdfminer to generate HTML which seems like a more friendly starting point. +Goals +----- -More structured but less official are x86 instruction databases: +* 99% Completeness - it is not too hard to cover all of the instructions mentioned in official sources, and all of the "undocumented" instructions discovered so far by tools such as sandsifter and haruspex. But outside of this, it is impossible to be complete - there are simply too many bit patterns. sandsifter/haruspex take days to run and do not even explore the full instruction space, making assumptions about the format of instructions. But these tools have confirmed that there are many undocumented instructions. Therefore, it must be assumed that the database is incomplete - more instructions may be discovered in the future. We should therefore allow raw bit patterns not present in the database, ``instr('f0 0f')`` or similar, throughout the pipeline. -* `Intel XED `__ (`file `__). This might as well be official, although it is technically its own open-source project I think Intel uses it internally. -* LLVM `x86 tables `__ -* NASM `instruction table `__ -* `GNU Assembler (gas) `__ -* `iced `__ -* `OSACA `__ is AGPL licensed and very incomplete -* `Ghidra `__, seems to have semantics -* emulators: https://github.com/colejohnson66/rbx86, https://bochs.sourceforge.io/ -* https://github.com/asmjit/asmjit, https://github.com/bitdefender/bddisasm, https://github.com/dyninst/dyninst, https://github.com/herumi/xbyak, qemu/capstone, https://github.com/diegocarba99/bagheera, https://github.com/mongodb-labs/disasm, zydis, https://github.com/MahdiSafsafi/AMED, https://github.com/nidud/asmc -* Go assembler https://cs.opensource.google/go/go/+/master:src/cmd/internal/obj/x86/avx_optabs.go;l=1791?q=vfixupimmss&ss=go -* https://github.com/Barebit/x86reference/blob/master/x86reference.xml +* Accuracy - Generally, all data should either come directly from official sources or measurement, and be automatically generated. This allows adding new instructions, processors, and microarchitectures as quickly as they become available. Furthermore it is easy to verify the information by checking it against the sources or rerunning the tool. -Semantics: - -* For ARM, we have official XML `Machine Readable Architecture instruction tables `__, which is nice-ish XML, and the code has been validated against ARM's conformance suite. There is a toy disassembler `hs-arm `__ using the tables. EXEgesis also parses the XML. `asl-interpreter `__ runs the descriptions which are written in a special language. - -* For x86-64 some academics have created a `formal X86-64 semantics `__ containing most of the userspace Haswell instructions. It was mostly manually written and has been checked with fuzzing. It is written in the K Framework syntax. It is missing concurrency, crypto (AES), supervisor/privileged, x87 floating-point, MMX, and also has a bug where it rounds too much with fused multiply-add floating point precision. - - -Timing: - -* https://github.com/e12005490/valgrind_timing/tree/117292a3a94f843c173bdb53e4933c6b79570240/variable_time_instructions -* I think EXEgesis or llvm-exegesis generates timings on the current x86 CPU -* ARM: ? +* Consistency - the database should have a consistent format, structure, and representation, so that it can be easily used in the compiler. This format should be documented for accessibility. Definition of an instruction ---------------------------- An instruction is a finite sequence of binary data (generally some number of bytes). The general idea is that instructions are a syntactic unit above bits, like words in a character string. Except unlike words, there's no instruction separator character; instructions are all run together like ``afewinstructions``. Segmenting ARM instructions is simple because they are all 32 or 64 bits. For x86, the length varies from 1 to 15 bytes and is affected by almost all parts of the instruction. `sandsifter `__ can determine the length of the first instruction in some bytes by finding an index for which ``seq|uence`` does not trigger a page fault, but ``se|quence`` does (where ``|`` is a page boundary). `haruspex `__ is even more tricky and examines the microcode speculation buffer performance counters to see how many nops after the byte sequence were speculated. With these tools we can segment arbitrary data into x86 instructions, assuming access to the processor. -We cannot build a 100% complete and verified list of instructions. As far as verification, there are simply too many; sandsifter/haruspex take days to run and do not even explore the full instruction space, making assumptions about the format of instructions. As far as completeness, these tools have been run on various processors and confirmed that there are many undocumented instructions, and there are likely more instructions that will be found in the future. But building a database that is 99% complete and probabilistically verified is not too hard; we take the officially documented instruction patterns, manually add whatever undocumented instructions we can find, and fuzz it a bit with sandsifter to check decoding. - -Still though, we should design for our instruction database being incomplete, and allow a syntax for writing raw instructions, ``instr('f0 0f')`` or similar. It's similar to a ``.db`` statement that allows include file headers or data in the ``.data`` section, but is meant specifically for executable data. Unfortunately with these raw instructions, if they are not in the database, many compiler optimizations are useless: pipelining, register allocation, etc. So for an optimizing compiler we need instruction metadata, like clobbered registers, cycles, possible traps, etc. There are generally sensible defaults in the absence of information (all registers clobbered, no reordering, all traps possible, etc.), but specifying this information allows optimizing raw instructions just as well as database instructions. - Templates --------- -Listing instructions out exhaustively one-per-line would be too much data due to combinatorial explosion, so instead we have a list of templates, each of which can turned into an instruction by filling in the holes. Following Xed we can call the data that is filled in "explicit operands". The explicit operands are distinguished bitstrings and can refer to registers, addresses, and immediate values. We choose each template so that it has similar behavior regardless of what is chosen for its explicit operands. +Listing instructions out exhaustively one-per-line would be too much data due to exponential explosion, so instead we have a list of templates, each of which can turned into an instruction by filling in the holes. Following Xed we can call the data that is filled in "explicit operands". The explicit operands are distinguished bitstrings and can refer to registers, addresses, and immediate values. We choose each template so that it has similar behavior regardless of what is chosen for its explicit operands. -The templates should have names. For automatically generating them it could be a hash of the template string, or else the smallest unique opcode prefix or something. But really we want to use the mnemonics from the docs. +The templates should have names. For automatically generating them from sandsifter data, it could be a hash of the template string, or else the smallest unique opcode prefix, or something. But really we want to use the mnemonics from the official docs where possible, for standardization. -Intel has variable-length instructions and from the docs seems to use byte-based templates, for example 64-bit ADCX is ``66 0F 38 F6 ``. The REX has 3 bits of operand data; the modrm is an operand and can be 1-6 bytes (register or memory with optional SIB/displacement). We could parse the Intel docs for this (EXEgesis + handling all the weird encoding stuff), but I think extracting Xed's `iform list `__ and using Xed for encoding is the way to go. It doesn't match the docs 1-1 but it saves on sanity - e.g. the separate memory / register templates abstract over the complications of MODRM. +Intel in their docs seems to use byte-based templates, for example 64-bit ADCX is ``66 0F 38 F6 ``. The REX is 1 byte with 3 bits of varying operand data; the modrm is an operand and can be 1-6 bytes (register or memory with optional SIB/displacement). -ARM has fixed length instructions and uses a bit-based format, for example A64 ADDS is ``sf 0101011 shift* 0 Rm**** imm6***** Rn**** Rd****``. Here each name is an operand and the stars represent extra bits of the operand - the operand is a fixed-length bitstring. hs-arm `seems `__ to pull out this information just fine, although its operand names are a little weird. +ARM has fixed length instructions and uses a bit-based format, for example A64 ADDS is ``sf 0101011 shift* 0 Rm**** imm6***** Rn**** Rd****``. Here each name is an operand and the stars represent extra bits of the operand - the operand is a fixed-length bitstring. -So the information for each template is: +A basic schema for each instruction template is: * form name (string) * explicit operands (list) @@ -105,67 +81,144 @@ So the information for each template is: * memory (size b, v, etc.) * encoding function ``[Operands] -> Bits`` -* Xed's isa_set field -* the condition on CPUID for this instruction to work -* the valid modes (32-bit, 64-bit, real, protected, etc.) -The isa_set field and fields after are because there are lots of overlapping sets of instructions and maintaining one master set is easier than duplicating the data. +Metadata +-------- -Affected state --------------- +Although we should allow the possibility of having no data about an instruction besides its bit pattern, many compiler optimizations depend on having more data, such as pipelining, register allocation, instruction scheduling, and instruction selection. Basic list of data: -Another important piece of data is the affected state, i.e. the list of read/written flags/registers/etc. This is used in instruction reordering/scheduling and register allocation to minimize `data hazards `__. +* conditions under which instruction is valid (processor, CPUID, valid modes such as 32-bit, real, protected, etc.) +* timing -The affected things depend on the instruction (and the operands). Where can we get this info? + * latency - number of clock cycles required for execution + * throughput - rate at which the instruction can be executed (cycles/instr) + * whether execution time is data dependent (for cryptography) + * micro-ops - if the instruction is broken down + * functional units - ALU, multiplier, divider, and load/store units -It seems possible to automatically determine by fuzzing (weighted towards special cases like 0 and 1). But it's probably really slow and the result is somewhat suspect - it can't determine that a flag/register becomes undefined, and it may miss reads/writes that happen in rare circumstances. +* affected state - read/conditionally read/written/clobbered flags/general-purpose registers/stack pointer/memory/FP regs/SIMD regs/program counter - used in instruction reordering/scheduling and register allocation to minimize `data hazards `__. "undefined" in the context of Intel means "arbitrary bit-pattern may be returned". This is distinct from C's UB but matches LLVM's "undef". +* possible traps/exceptions +* pseudo-resources - to represent non-deterministic instructions, like random number generation, monitors, etc. +* semantics - the mathematical function of state + pseudo-resources -> state represented by the instruction +* Category: nop, movement, arithmetic, logic, floating point, string, cryptography, SSE, AVX, control flow, I/O, system call/privilege rings/virtualization, concurrency, (atomics, fences), cache control (prefetch/barrier), performance monitoring/debugging, virtual memory, interrupts/exceptions/traps -In the Intel docs there is a little ``(r,w)`` or ``(r)`` after the operands. But this doesn't include everything. The rest can be found by scanning the English text, but unless we use NLP this will only give a list of affected things and not read/write info. Xed has info on read/written standard flags. But it abbreviates other flag registers - for example (per the Intel documentation) VFIXUPIMMSS reads MXCSR.DAZ and conditionally updates MXCSR.IE and MXCSR.ZE, but Xed just records a MXCSR attribute. LLVM similarly just has ``USES = [MXCSR]``. NASM and gas don't seem to have flag information at all. iced does have flag info but no MXCSR. The K semantics don't have MXCSR. So I guess Xed is the best data source but we will have to use EXEgesis somehow to scrape the affected flags, and then manually mark them as read/write/conditional or just leave it at coarse reordering information. +To handle novel instructions, there are generally sensible defaults in the absence of information (all registers set to arbitrary bit patterns, all traps possible, non-deterministic "anything goes" semantics, etc.). -For ARM modifying asl-interpreter should give info on flags etc. +Since we represent sets of instruction using templates, the metadata is conceptually specified for a given template as a function ``[Operands] -> Metadata``. -So the affected state database schema: +Data sources +------------ -* form name -* affected things (list) +For x86, the official sources are `Intel's SDM `__ / `AMD's Architecture Programmer's Manual `__, which use English and pseudocode and have numerous typos (if the experiences of others hold true). Also they are only distributed as PDFs. Parsing the PDFs is a lot of work. `EXEgesis `__ uses a hacky Xpdf parser but had some amount of effort invested in it by Google before they stopped maintaining it. `x86doc `__ uses pdfminer to generate HTML which seems like a more friendly starting point. - * type: +More structured but less official are x86 instruction databases: - * explicit operand (+ index) - * fixed register - * pseudo resource - * flag bit +* `Intel XED `__ (`file `__). This might as well be official - although it is technically its own open-source project, I think Intel uses it internally. +* LLVM `x86 tables `__ +* NASM `instruction table `__ +* `GNU Assembler (gas) `__ +* `iced `__ +* `OSACA `__ is AGPL licensed and very incomplete +* `Ghidra `__, seems to have semantics +* emulators: https://github.com/colejohnson66/rbx86, https://bochs.sourceforge.io/, QEMU +* https://github.com/asmjit/asmjit, https://github.com/bitdefender/bddisasm, https://github.com/dyninst/dyninst, https://github.com/herumi/xbyak, qemu/capstone, https://github.com/diegocarba99/bagheera, https://github.com/mongodb-labs/disasm, zydis, https://github.com/MahdiSafsafi/AMED, https://github.com/nidud/asmc +* Go assembler https://cs.opensource.google/go/go/+/master:src/cmd/internal/obj/x86/avx_optabs.go;l=1791?q=vfixupimmss&ss=go +* https://github.com/Barebit/x86reference/blob/master/x86reference.xml - * read: read / not read / conditionally read / unknown - * written: +Overall I think extracting Xed's `iform list `__ and using Xed for encoding is the way to go. It doesn't match the docs 1-1 but it saves on sanity - e.g. the separate memory / register templates abstract over the complications of MODRM. - * value: constant, copied from input, input + constant, undefined/reserved, complex computation - * written with value, not written, conditionally written with value, unknown +We also need instruction semantics. Some academics have created a `formal X86-64 semantics `__ containing most of the userspace Haswell instructions. It was mostly manually written and has been checked with fuzzing. It is written in the K Framework syntax. It is missing concurrency, crypto (AES), supervisor/privileged, x87 floating-point, MMX, and also has a bug where it rounds too much with fused multiply-add floating point precision. I don't know how to manipulate K language stuff but the actual instruction semantics is pretty simple so maybe something could be hacked together. -* possible exceptions +For affected state, there are some choices. In the Intel docs, there is a little ``(r,w)`` or ``(r)`` after the operands, that EXEgesis picks up, but this doesn't include everything. For example (per the Intel documentation) VFIXUPIMMSS reads MXCSR.DAZ and conditionally updates MXCSR.IE and MXCSR.ZE, but these are not in the thing. Xed has info on read/written standard flags. But it abbreviates other flag registers - for VFIXUPIMMSS, Xed just records a MXCSR attribute. LLVM similarly just has ``USES = [MXCSR]``. NASM and gas don't seem to have flag information at all. iced does have flag info but no MXCSR. The K semantics don't have MXCSR. So I guess Xed is the best data source but we will have to use EXEgesis somehow to scrape the affected flags from the instruction description, and then manually mark them as read/write/conditional or just leave it at coarse reordering information. It might be also possible to automatically determine it by fuzzing (weighted towards special cases like 0 and 1). But it's probably really slow and the result is somewhat suspect - it can't determine that a flag/register becomes undefined, and it may miss reads/writes that happen in rare circumstances. -Instructions with no data have all possible affected things present, with read/write unknown. +For ARM, we have official XML `Machine Readable Architecture instruction tables `__. It includes both the encoding and the semantics, and the code has been validated against ARM's conformance suite. There is a toy disassembler `hs-arm `__ using the tables. hs-arm `seems `__ to pull out the template information just fine, although its operand names are a little weird. `asl-interpreter `__ runs the descriptions, which are written in a special language - it should be possible to run this to get affected state. -Pseudo-resource includes things like load-link/store-conditional. LDXR sets monitors (write) and STXR checks monitors (read). A second LL clears the monitor so LL is actually read/write. Anyway the monitor is a pseudo resource, because it's not a register. +Timings: -"undefined" in the context of Intel means "arbitrary bit-pattern may be returned". This is distinct from C's UB but matches LLVM's "undef". +* https://www.agner.org/optimize/instruction_tables.ods +* https://uops.info/ +* I think EXEgesis or llvm-exegesis generates timings +* There are some papers on using ML and measured timings to predict block performance, probably more accurate than instruction-level data -Classification --------------- +Foreign code +============ -We can classify instructions: +One specific set of assembly sequences we care about is calling code in other languages, particularly C. Many functions use the C calling convention, such as C memory allocation and Windows system calls. The most straightforward way to call these is to look up the symbol's in the object file, set up the stack and registers appropriately for the calling convention, and jump to it. It is a bit annoying for example as we must allocate space for the C stack; Go uses 4MB or so. Also the calling conventions are messy - we will have to extract them from libffi or `LLVM `__. -* Data usage: does it read or write flags/general-purpose registers/stack pointer/memory/FP regs/SIMD regs/program counter? -* Category: nop, movement, arithmetic, logic, floating point, string, cryptography, SSE, AVX, control flow, I/O, system call/privilege rings/virtualization, concurrency, (atomics, fences), cache control (prefetch/barrier), performance monitoring/debugging, virtual memory, interrupts/exceptions/traps, +Another solution is to create a stub C file with a method declared with a fixed, known calling convention containing the desired FFI code, This C file can then be compiled to assembly via clang or GCC. Then this assembly can then be processed by identifying the method in the output and converting the contents of that method and its dependencies to Stroscot's IR, essentially decompiling it into Stroscot. The IR can then be optimized to remove the overhead of the intermediary known calling convention, directly assigning registers and returning values. So for example if we wanted to compile the ceiling function, ceil, we'd create a stub method stub_ceil, something like: -Performance ------------ +.. code-block:: C -the data present in LLVM for instruction scheduling (such as uops, execution ports/units, and latencies), + fastcall double stub_ceil(double in) { + return ceil(in); + } -If PSTATE.DIT is 1 the execution time is independent of the values. +So then we would compile that to assembly, decompile it to Stroscot, and patch it into the IR. Compared to calling the symbol, this approach is much more flexible - it handles: -Memory model ------------- +* all calling conventions, and pretty much all languages (as they have FFI's to be called from C) +* functions defined as macros +* inlined functions +* C++ template expansion +* writing arbitrary code, instead of just calling pre-defined functions - and it's all optimized with the language's native compiler *and* Stroscot's decompiler+assembler + +So the stub function approach will handle pretty much everything - it is robust. The calling convention details are baked into the compiled stub, so with no optimization, we can simply include the stub as a blob and call the compiled stub using the known calling convention. And it is not too hard to analyze the assembly, remove intermediate register assignments, and ensure that calling a symbol with the C calling convention optimizes to a direct call just like the simple approach. The costs? We have to decompile assembly to IR, and also we still have to implement at least one C calling convention, albeit we can choose the simplest/easiest one. + +A third approach is to use the stub method but compile to LLVM IR instead of assembly. LLVM has a more structured representation, representing calling conventions and so on explicitly, so we have to implement all the calling conventions again like with the symbol approach. We also have to translate LLVM IR to Stroscot IR, a bit easier than decompiling aassembly as there are only 67 instructions. It may also be easier to optimize, as LLVM's SSA form means we do not have to recover clobbered register information. It is also possible to use the Clang API to directly process C/C++ to LLVM in-memory, which should be faster than generating assembly via a separate process. + +Eventually, it would be good to support all of the methods. For the initial implementation, the LLVM stub seems the priority, as compiling C/C++ robustly and efficiently is the main goal. For simple cases, the compiled stub will likely consist of a call instruction and nothing else, so we can work on supporting that instruction and gradually add more support. So the initial work consists of implementing the C default calling convention and calling Clang/LLVM, giving functionality as good as the direct-symbol approach with about as much work. Then we can explore assembly stubs and direct symbol calls later . + +Now for importing a whole file, like a C header file, it's a bit more involved, and we do actually have to work at the source level. The stub method let us evaluate any snippet of code, but we have to determine the symbols, types, and so on to use in those snippets. So we have to read the header file, process each declaration, and generate a stub for each. Or multiple, in the case of a C++ template function. Again, some things that may look like functions to the C programmer may actually be declared using macros or other techniques, but in this context there is not enough information to determine the desired Stroscot <-> foreign language mapping. So this process can be semi automated at most; it will be able to bind the functions in the common cases but programmers will have to write the stub code and signature manually in complicated cases. + +And then of course there's linking - you have to specify the object file that you're going to link with and so that's sort of are so linking and stress cut straws cut all the jit execution model so of course you can just specify the object that run time as a file path and then stress cut will load that file and link with it um and so that complicates things because it means that you need the file to be available when you're testing the program and when So yeah the whole model is dependent are having a running system And so for example if you want to test an embedded system you have to hook up the embedded system to the network with the first computer And so that the fast computer can do all the heavy lifting white compilation and optimization and stuff um So that's a bit of complexity + +Operations +========== + +To abstract the ISA we consider the instructions from a functional perspective - these functions are called "operations". Operations are exposed in Stroscot as intrinsic functions. This allows using a familiar syntax. + +Operations don't reference registers, the operations all take/return temporaries. Since all registers/flags/etc. can be stored/loaded to memory, temporaries are conceptually an immutable bitstring of a fixed bitwidth. These bitwidths vary by the instruction: x86 uses 1, 8, 16, 32, 64, 80, 128, 256, 512, etc. (for flags, segment registers, general-purpose registers, FPU registers, MMX/SSE/AVX). + +For example the operations corresponding to x86-64 DIV, ADD, and ADC with 64-bit operands look like: + +:: + + divide (src : B64) (high : B64) (low : B64) = + divisor = src + dividend = high ++ low + if divisor == 0 + fault DE + else + quotient = src2 / src1 + if quotient >= 2^64 + fault DE + else + { quotient = quotient, remainder = src2 mod src1 } + + add (src1 : B64) (src2 : B64) = + dest = src1 + src2 + ... compute flags ... + { dest, OF, SF, ZF, AF, CF, PF } + + adc (src1 : B64) (src2 : B64) (cf : B1) = + dest = src1 + src2 + cf + ... compute flags ... + { dest, OF, SF, ZF, AF, CF, PF } + +Accessing memory is handled by a separate operation - but in the ISA x86 has combined read-add instructions: + +:: -A memory model is needed to determine if reordering data writes will change the behavior of a concurrent program. + read : Addr -> {B64 | B32 | B16 | B8} + read a = + if noncanonical a -- https://stackoverflow.com/questions/25852367/x86-64-canonical-address + if referencesSSsegment a + fault SS(0) + else + fault GP(0) + else + if unaligned a && enabled alignment_checking + fault AC(0) + else if not_in_physical_memory a + fault PF(fault-code) + else + memory[a] diff --git a/docs/Commentary/BuildSystem.rst b/docs/Commentary/BuildSystem.rst index 0520e3c..cf3ecbd 100644 --- a/docs/Commentary/BuildSystem.rst +++ b/docs/Commentary/BuildSystem.rst @@ -523,6 +523,13 @@ in the cache daemon. Reproducibility =============== +traceability - It should be clear where the binary came from and how it was produced. +automated reproducibility - Users can run one command to reproduce all binaries to verify that they have not been tampered with. + +Example: "this package was built by running on revision of the distribution's package repository." A user can then easily reproduce the binary by fetching the specified sources and running the specified command. This build will in most cases depend on a previous generation of built binaries. Thus, we get a chain of verifiable binaries stretching back in time. + +Binaries may also come from upstream, with no clear production process. This would typically be the case when a language does not have a clear bootstrap process. The distribution should at least clearly label the provenance of the binary, e.g. "this binary was downloaded from https://upstream-compiler.example.org/upstream-compiler-20161211-x86_64-linux.tar.xz at time DDD with hash 123". + Reproducibility check --------------------- @@ -542,6 +549,11 @@ If you know that some rule is not reproducible (e.g. it generates a random signi Similarly, some files may not be redistributable (copyright license or similar), these are reproducible but the data cannot be stored in the cloud cache. +Bootstrap +--------- + +So that there is not a chicken-egg problem, it is important that the build system code itself can be boostrap-built with a shell script or other simple executable. Probably there should be a special mode of the build system's build system that emits the build system instructions as a shell script. + Daemon-less mode ================ diff --git a/docs/Commentary/Checklist.rst b/docs/Commentary/Checklist.rst index 139a7e2..6fc42e5 100644 --- a/docs/Commentary/Checklist.rst +++ b/docs/Commentary/Checklist.rst @@ -310,9 +310,13 @@ Anything more complex is a language or library feature (I don't distinguish the Adoption ~~~~~~~~ -Language and feature designers sometimes explicitly focus on adoption. But a widely-adopted language becomes ossified, as nobody want their code broken. Hence SPJ's motto of "avoid success at all costs". But this seems too drastic - most languages have adopted "change management" processes that allow some level of evolution while maintaining adoption. Automatic migration (as in early Go) and the compiler supporting multiple language versions/dialects at once should allow a wide variety of changes to be made with minimal breakage. +How many users should Stroscot have? Well, as with SPJ's motto of "avoid success at all costs", there is such a thing as too popular. A widely-adopted language becomes ossified, as nobody want their code broken. This can be addressed by developing "language change/evolution management" tools, like automatic migration (as in early Go) and the compiler supporting multiple language versions/dialects at once. These should allow any sorts of changes to be made with minimal breakage to users, even if the language is significantly popular, while still adding minimal delay and overhead to language development. Explicitly, I do not want governance procedures/processes like PEP's or the Rust council, which prevent language changes on a social level - never solve a problem through social means when there is a technical solution. -Another tendency, particularly for languages backed by industry, is to get semi-popular very quickly and then suddenly drop off the radar a few years later. This is due to being "all hype" and not really adding anything new. At least in the early days, there is some benefit to discouraging adoption, via burdensome installation requirements or frequent breaking changes. Although it slows adoption in the short term, such policies strengthen the community by forcing members to participate fully or not at all. Those who remain find that their welfare has been increased, because low-quality "what's going on'" content is removed and feedback loops are shorter. The overall language design benefits as result, and can evolve much faster. (Compare: strict religions that prohibit alcohol and caffeine consumption and modern technology, a random guy `pruning `__ the 90% of members who have not posted a message in the past two weeks from his website) +So with that out of the way, growth is really a social problem. Do I want to spend my days reading PR's and writing comments, as Linus Torvalds does, or spend my days coding? Well, I am not really that great a coder. I type code slowly and over-design. Honestly it would be great to design by English. But it is not like everyone will drop what they are doing and be at my beck and call. It is an exchange of interests - Stroscot will have to provide some value to users, and they will have to judge that contributing to Stroscot's vision is better than using other software. Still though, for individuals that do decide to contribute to Stroscot, I will not turn them away. I think once the technical tools for dealing with adoption are in place, SPJ's motto is in fact wrong and success is necessary and desirable. + +Then there is the question of whether to focus on adoption. I think this is like performance - it definitely matters, it definitely contributes to long-term language progress, and it directly affects financial success (in terms of donations / visibility). So it is worth tracking. But like performance, it is often premature to expend significant effort on adoption. Like buying ads for the language - probably a waste of money compared to improving error messages or some such. Focusing on the core goals of Stroscot like functionality, minimality, learnability, and concision will naturally lead to user adoption in the long term. With link aggregators and a decent website, it is possible to go from zero to 100,000 users in a week (c.f. `hitting the front page `__). But it takes "the perfect storm" of user interests, informative website, and positive comments and votes. I think one simple mark of progress is that the project becomes interesting enough that *someone else* - unrelated to the project - submits the project to a link aggregator. That is probably the point at which it is worth devoting attention to adoption (as opposed to learnability). A language need at least 10-15 years of development before charting on the TIOBE index or PyPL. Long-term, it is more important to avoid fading into obscurity than to shoot for #1. + +Another problem, particularly for languages backed by industry, is that they get semi-popular very quickly, and then suddenly drop off the radar a few years later. This is due to being "all hype" and not really adding anything new. At least in the early days, there is some benefit to discouraging adoption, via burdensome installation requirements or frequent breaking changes. Although it slows adoption in the short term, such policies strengthen the community by forcing members to participate fully or not at all. Those who remain find that their welfare has been increased, because low-quality "what's going on'" content is removed and feedback loops are shorter. The overall language design benefits as result, and can evolve much faster. (Compare: strict religions that prohibit alcohol and caffeine consumption and modern technology, a random guy `pruning `__ the 90% of members who have not posted a message in the past two weeks from his website) But with this approach, one must be careful that the programming language still provides sufficient value to hold at least some amount of users - otherwise there is no feedback at all. The barriers to adoption must also be reasonable, and similarly barriers to prevent people from leaving are probably limited to implicit ones like language lock-in. It is not worth discouraging users too strongly, as these attempts can backfire with blog posts such as "my terrible experience trying to use Stroscot" or "my terrible experience trying to get rid of Stroscot", destroying what little reputation the language may have built up. Although discouraging adoption may be the policy, each individual user's interaction with the community should be engaging and issues that appear should actually be addressed. @@ -321,7 +325,7 @@ There are not really any best practices to encourage adoption but :cite:`meyerov * Numerous people have made efforts to design programming languages, but almost all of these have failed miserably in terms of adoption. Success is the exception to the rule. Contrariwise, others observe that language usage follows a "fat tail" distribution, meaning that failure is not as bad an outcome as one might expect and even a "failed" language can have some popularity. * Successful languages generally have convoluted adoption paths, suggest that extrinsic factors are influential. (TODO: How influential? Top 10? Top 100?) * Language failures can generally be attributed to an incomplete understanding of users' needs or goals. -* Evolution or re-invention, by basing a design on existing experiences, increases understanding. +* Evolution or re-invention, by basI suspect that most languages will need at least 5-10 years of development before reaching their first stable release, followed by another 5 years or so before it starts to take off. That's all assuming you end up lucky enough for it to actually take off, as there are many languages that instead fade into obscurity.ing a design on existing experiences, increases understanding. * Surveying the literature is often advocated but rarely or never followed to a rigorous standard. The main sticking point is that it is difficult to evaluate language features accurately except by attempting to use them in a new language. * In the `diffusion of innovation `__ model, innovation is communicated over time through different channels along a network. Adoption is a 5-step linear process for each node: @@ -332,7 +336,17 @@ There are not really any best practices to encourage adoption but :cite:`meyerov 5. Confirmation: an individual finalizes the adoption decision, such as by fully deploying it and publicizing it. Encouraging such individuals to publish experience reports can start the adoption cycle over and cause the language to spread further. The limiting factor here is observability, the ability to get results. * Power - A language needs a unified design, and generally this means designating a single person to make the final decisions. Essentially, a single person weighing priorities based on their knowledge of the market and pain points is more effective than group voting. In a committee, nobody feels responsible for the final result, so each person does a shallow analysis of surface costs and benefits. In contrast, an individual feels empowered and really has the incentive to understand the issues deeply and design an effective solution. Absent mitigating factors such as a strong committee chair or shared vision, group design generally result in terrible `"kitchen sink" `__ languages. These languages have an incoherent design, with many features that sort of work, but no particular attraction to any actual users. "Kitchen sink" languages are generally short-lived due the the difficulty of implementing an endless stream of special-case features and maintaining the resulting complex, sprawling codebase. Of course, so long as the power structure is clear, delegation of roles and duties is quite reasonable, e.g. designating a person for data analysis. -* Evidence - Everyone has opinions, but if there's a disagreement, it doesn't help much in making a decision. Although common, "The CEO said so" is not really a good reason to choose a particular design. A survey of users who've actually used a language for a while is probably the highest form of evidence. Automated analysis of source code repositories and developer communications is perhaps a bit better, although the automation introduces its own biases. Experience reports from language designers are also high-quality evidence, although a little less reliable. Mathematical theory and logic stands on their own, I guess I could verify it with Coq or something but generally a proof is a proof. Anecdotal reports I would say are medium-quality, as the plural of anecdote `is `__ data (the "not" version `appeared later `__). All of these require filtering out the opinions from the evidence, and there is some error in evolving and repurposing insights from one language to a new language. Feedback and testing (discussed next) ensure that Stroscot is developing properly, and can confirm hypotheses but may fail at broad design considerations. Absent strong evidence, arguing via toy code examples seems pretty decent, although can suffer from "cherry-picking" meaning that the design may not work in practice for code dissimilar to the examples. Flix suggests evaluating features against a list of principles, but I tried it and generally the principles are too vague or unrelated to be useful. Also, the choice of principles is subject to bias. I would say the biggest principle for Stroscot is inclusivity, because figuring out how to include a feature means the feature must actually be seriously considered, whereas in other languages it is easy to fall into the trap of "don't know, don't care". +* Evidence - Everyone has opinions, but if there's a disagreement, opinions don't help much in making a decision. Although common, "The CEO said so" is not really a good reason to choose a particular design. I would rank evidence as follows: + + * Mathematical theory and logic stand on their own, I guess I could verify it with Coq or something but generally a proof is a proof. + * Semi-automated analysis of source code repositories and developer communications, with manual inspection/validation of the results + * A survey of users who've actually used a language for a while. + * Experience reports from language designers are also high-quality evidence. There is some error in evolving and repurposing insights from one language to a new language. + * Anecdotal reports I would say are medium-quality, as the plural of anecdote `is `__ data (the "not" version `appeared later `__). It requires filtering out the opinions - what we want are claims, supported or unsupported, rather than simply "I don't like it". + * Testing / prototyping can confirm hypotheses but may fail at identifying broad design considerations. + * Arguing via toy code examples seems pretty decent, although can suffer from "cherry-picking" meaning that the design may not work in practice for code dissimilar to the examples. + * Flix suggests evaluating features against a list of principles, but I tried it and generally the principles are too vague or unrelated to be useful. Also, the choice of principles is subject to bias. I would say the biggest principle for Stroscot is inclusivity, because figuring out how to include a feature means the feature must actually be seriously considered, whereas in other languages it is easy to fall into the trap of "don't know, don't care". + * Feedback - It is quite useful to get feedback from potential users and other, early and often. Feedback, unlike the designer, is not impacted by project history or the designer's preconceptions. The `Pontiac Aztek `__ checked all the boxes regarding functionality, and had the highest customer satisfaction ratings for those who drove it, but every time the focus groups looked at it, they said "it's the ugliest car in the world and we wouldn't take it as a gift". Per `Bob Lutz `__, managers at GM ignored the focus groups, and the Aztek was a flop - barely anybody bought it, because it was indeed too ugly (although it did develop a cult following). However, just showing around a design and asking "what do you think?" has several problems. First, people's opinions change as they are exposed more - maybe their gut reaction is that they hate it, but if they spend an hour trying it out, they'd love it. The solution is measure, measure, measure - for example, an initial poll and a poll after a tutorial. Another useful trick is limiting the stimulus to what is under study - if syntax is not relevant, don't present any syntax, and then the discussion will naturally focus on semantics. If the "feel" of the language is being discussed, present a collage of concepts. Second, unstructured responses usually answer the wrong question - what matters is estimating how the design impacts certain business objectives and success criteria, but maybe the interviewee will spend half an hour discussing a tangent. This can be addressed by structuring and timeboxing the feedback with a rubric, and perhaps explaining some background with a video. Of course, qualitative feedback is most important, so the questions should still be open-ended. It is also best to speak to interviewees individually, rather than in a group, so that their opinions do not influence or dominate each other. Individual discussion is more likely to present a balanced opinion, whereas groups can pile on negative feedback. OTOH, a group does enunciate the overall consensus more clearly, and e.g. submitting to HN is a convenient way of getting group but not individual feedback, unles a survey link or similar is included. * Testing - When qualitative considerations are silent, decisions must be made on quantitative grounds. The standard for websites is A/B testing: allocate some traffic to version A, and some to version B, and measure metrics such as average time to completion of task. A little more complex is a stochastic k-armed bandit test with Thompson sampling, which allows testing arbitrarily variants and also automatically reduces testing of poor-performing variants. We can do this for a language too, with some difficulty: get a random ID from the service, randomize choices, measure metrics in the compiler, report back, have a privacy policy and ensure GPDR complance, require the ID so as to generate customized documentation, and voila. Given that the audience is programmers it probably makes sense to allow overriding the arm selection. @@ -361,7 +375,7 @@ So that's for the initial implementation. Now in Facebook's case, they already h So overall, what does performance mean for language design? Mainly that the language should have both high-level and low-level constructs, and the compiler should have a lot of optimization smarts. But these are already expressed in the "don't surrender the adequate representation of a single datum of experience" and "tracking the progress of technology" principles. Hence, as a design principle, performance is simply not that important a consideration. As long as Stroscot implements the best optimizations, it will naturally meet or beat the performance of C compilers on C-like programs, and similarly for other language styles. For cases where there is no strong existing referent, like logic programming, the simple presence of the feature to begin with is the attraction, and performance is secondary as well. Google has done some work on build speeds; the main performance-focused feature there will be fine-grained incremental compilation to reduce compile times. -But secondary does not mean unimportant. It is certainly worth fostering a "performance culture", with performance tests, profiling tools, and so on, for both the compiler and libraries. Still, in the near term, the project is still in its "rapid prototyping" phase and hence the compiler's potential performance is not a consideration. +But secondary does not mean unimportant. It is certainly worth fostering a "performance culture", with performance tests, profiling tools, and so on, for both the compiler and libraries. Still, in the near term, the project is still in its "rapid prototyping" phase and hence the compiler's performance is not a consideration. The potential for performance is though, e.g. optimal register allocation is considered a "feature" rather than a "premature optimization" hence will be implemented without concern for its potential performance impact (runtime speedup or compiler slowdown). Goals ===== @@ -514,7 +528,7 @@ PyPL index (top 28) * most portable/widespread language. runs on just about every piece of silicon (although some require specialized compilers) * language of most OS's, hence used for FFI stuff * statically compiled, compilers are very efficient. -* unsafe pointers, common to see memory corruption and security vulnerabilities. valgrind, smart fuzzing, and static analysis have allowed catching these with great difficulty. Also there is the Boehm GC, used by many people who don't want to deal with memory management. +* difficult to work with - unsafe pointers, common to see memory corruption and security vulnerabilities. valgrind, smart fuzzing, and static analysis have allowed catching these with great difficulty. Also there is the Boehm GC, used by many people who don't want to deal with memory management. * header files slow down compilation as they have to be read many times during compilation * paradigm: imperative @@ -572,6 +586,8 @@ PyPL index (top 28) * ownership model/borrow checker has been found difficult to use by several studies (`1 `__, `2 `__, `https://dl.acm.org/doi/pdf/10.1145/3510003.3510107`__). Also it is incomplete - can't even write linked lists without `endless pain `__. In practice Rust programmers `end up `__ using reference counting or GC to ensure memory safety in complex cases * concurrency safe, but async suffers from "borrow checker"-itis and uses atomic reference counting * has claimed to be "blazing fast" since `2014 `__. But we see in `the one test of this claim on Google `__ that C is faster. +* learning experience circa 2015 was "rough" +* compatibility pendulum has swung towards "too stable" - many changes that "should" be made for a better language that can't be * paradigm: imperative 13. Kotlin @@ -749,7 +765,9 @@ TIOBE Next 22 34. D -* C/C++ style but different. Never really took off AFAICT. +* C/C++ style but different. +* Never really took off AFAICT - rift caused by the differences between D version 1 and 2, a general lack of learning resources and packages +* garbage collected * many features that have been incorporated into C++, others that haven't like scope guards * paradigm: imperative diff --git a/docs/Commentary/Code-Generation.rst b/docs/Commentary/Code-Generation.rst index c578124..30f8d81 100644 --- a/docs/Commentary/Code-Generation.rst +++ b/docs/Commentary/Code-Generation.rst @@ -7,6 +7,10 @@ Code generation is NP-hard and affects performance a lot. There are three main o #. Register allocation - Spilling, Rematerialization #. Instruction scheduling - Code motion, Block ordering +Generally it is required to write a native code generator, linker, C standard library, and so on. Zig and Roc are doing it, learn from them. It is important to ensure that such bespoke components outperform the established alternatives, and are state of the art. Few developers actually work in this area so it is not hard to beat them in terms of design. But actually getting it to work takes years. In contrast interpreters/bytecode generators are really easy to implement, focus on them first. + +It can be tempting to focus heavily on providing a fast implementation, such as a fast and memory efficient compiler, and one can easily spend months working on this. But it should only be done after the compiler is well-tested and the standard library is near-complete. Then you can start benchmarking as 20%-30% of effort. + Constrained optimization ======================== @@ -24,91 +28,41 @@ Although optimality may not be often used in industrial code, there's at least o Optimizations ============= -calling convention/ABI - accumulate arguments on the stack vs popping them - omit frame pointer - tail recursive calls - scalar replacement of aggregates - removal of unused parameters - replacement of parameters passed by reference by parameters passed by value. - omitting unexported functions - automatic constructor/destructor implementation - null pointer dereference behavior - Emit function prologues only before parts of the function that need it, rather than at the top of the function. - alignment - functions, loops, jumps, labels + +Calling convention/ABI + +* accumulate arguments on the stack vs popping them +* omit frame pointer +* tail recursive calls +* scalar replacement of aggregates +* omitting unexported functions +* automatic constructor/destructor implementation +* null pointer dereference behavior +* Emit function prologues only before parts of the function that need it, rather than at the top of the function. +* alignment - functions, loops, jumps, labels + +register allocation + +* swing modulo scheduling +* allow allocating "wide" types non-consecutively +* Graph coloring algorithm: Chow's priority, Chaitin-Briggs +* region size: loop, function +* rematerialization +* use registers that are clobbered by function calls, by emitting extra instructions to save and restore the registers around such calls. +* Use caller save registers if they are not used by any called function. +* rename registers to avoid false dependencies + instruction scheduling - reorder instructions - region size: basic blocks, superblocks - schedule instructions of the same type together + +* reorder instructions +* region size: basic blocks, superblocks +* schedule instructions of the same type together + instruction selection - combining instructions, e.g. fused multiply-add, decrement and branch - transformation of conditional jumps to branchless code (arithmetic and conditional moves) - remove redundant instructions -register allocation - swing modulo scheduling - allow allocating "wide" types non-consecutively - Graph coloring algorithm: Chow's priority, Chaitin-Briggs - region size: loop, function - rematerialization - use registers that are clobbered by function calls, by emitting extra instructions to save and restore the registers around such calls. - Use caller save registers if they are not used by any called function. - rename registers to avoid false dependencies -optimizations - replace standard functions with faster alternatives when possible - inlining - deduplication of constants, functions, code sequences (tail merging / cross jumping) - common subexpression elimination (CSE) - dead code/store eliminate (DCE/DSE) - conditional dead code elimination (DCE) for calls to built-in functions that may set errno but are otherwise free of side effects - global constant and copy propagation - constant propagation - which values/bits of values passed to functions are constants, function cloning - value range propagation - like constant propagation but value ranges - sparse conditional constant propagation (CCP), including bit-level - elimination of always true/false conditions - move loads/stores outside loops - loop unrolling/peeling - loop exit test - cross-jumping transformation - constant folding - specializing call dispatch (possible targets, likely targets, test/branch) - Code hoisting - evaluate guaranteed-evaluated expressions as early as possible - copy propagation - eliminate unnecessary copy operations - Discover which variables escape - partial/full redundancy elimination (PRE/FRE) - modified/referenced memory analysis, points-to analysis, aliasing - strips sign operations if the sign of a value never matters - convert initialization in switch to initialization from a scalar array - termination checking - loop nest optimizer based on the Pluto optimization algorithms. It calculates a loop structure optimized for data-locality and parallelism. - graphite - loop distribution, loop interchange, unroll, jam, peel, split, unswitch, parallelize, copy variables, inline to use first iteration values, predictive commoning, prefetch - final value replacement - loop to calculation using initial value and number of loop iterations - explode structures to scalars in registers - vectorization - loop vectorization, basic block vectorization, cost free (for debugging), likely faster, or code size - reorder blocks, duplicate blocks, partition into hot/cold to improve paging and cache locality - specialization of division operations using knowledge of the denominator - -profiling: - generate approximate profile of new/modified code, guessing using heuristics - cold functions, functions executed once, loopless functions, min/max/average number of loop iterations, branch probabilities, values of expressions in program, order of first execution of functions - AutoFDO profile https://perf.wiki.kernel.org/ - -live-patching: depending on optimizations, all callers maybe impacted, therefore need to be patched as well. - - -floating-point variables - register or memory. - on machines such as 68881 and x86, the floating registers keep excess precision. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. - fast: allow higher precision / formula transformations if that would result in faster code. it is unpredictable when rounding to the IEEE types takes place and NaNs, signed zero, and infinities are assumed to not occur. - standard: follow the rules specified in ISO C99 or C++; both casts and assignments cause values to be rounded to their semantic types - strict: rounding occurs after each operation, no transformations - exception handling, mode handling - -Magic numbers: - search space sizes - Increasing values mean more aggressive optimization, making the compilation time increase, but with diminishing improvement in runtime execution time. Generally a formula producing a boolean "try optimization" answer or an integer "maximum number of possibilities to consider". - memory limit - If more memory than specified is required, the optimization is not done. - analysis skipping - ignore objects larger than some size - ratios - if inlining grows code by more than this, cancel inlining. tends to be overly conservative on small functions which can increase by 300%. +* combining instructions, e.g. fused multiply-add, decrement and branch +* transformation of conditional jumps to branchless code (arithmetic and conditional moves) +* remove redundant instructions Register allocation =================== diff --git a/docs/Commentary/Compiler-Library.rst b/docs/Commentary/Compiler-Library.rst index cfef956..7157cb8 100644 --- a/docs/Commentary/Compiler-Library.rst +++ b/docs/Commentary/Compiler-Library.rst @@ -1,7 +1,9 @@ Compiler library ################ -For the language to be useful it must have an implementation, and ideally a self-hosting implementation. In this framework the compiler, and hence the language, are really just another set of library modules. I will unimaginatively call this library the "compiler library". Per the standard library evolutionary design process, the compiler library will likely serve as the prototype for the standard library, so some investment is worthwhile. +For the language to be useful it must have an implementation, and ideally a self-hosting implementation. In this framework the compiler, and hence the language, are really just another set of library modules. I will unimaginatively call this library the "compiler library". Per the standard library evolutionary design process, the compiler library will likely serve as the prototype for the standard library, so some investment is worthwhile. Writing a full-featured standard library is important for testing the compiler and getting the "feel" of the language. + +It is particularly important to have regression tests from the beginning, to catch near-infinite loop bugs and so on. Synthesizing ============ @@ -198,6 +200,15 @@ For fixed-precision integers and floating point, the operations work in stages: Commonly, the rounding is considered part of the operation, and the rounding mode is just fixed to some ambient default, but this is not optimal with respect to performance. Herbie provides a different approach. Given a real-valued expression and assumptions on the inputs, Herbie produces a list of equivalent computations, and computes their speed and accuracy for various choices of machine types and rounding. The programmer can then choose among these implementations, selecting the fastest, the most accurate, or some trade-off of speed and precision. The question is then how to expose this functionality in the language. The obvious choice is to make the rounding operation explicit. In interpreted mode arbitrary-precision is used, at least to the precision of the rounding, and in compiled mode Herbie is used. Or something like that. +floating-point variables + +* register or memory. +* on machines such as 68881 and x86, the floating registers keep excess precision. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. +* fast: allow higher precision / formula transformations if that would result in faster code. it is unpredictable when rounding to the IEEE types takes place and NaNs, signed zero, and infinities are assumed to not occur. +* standard: follow the rules specified in ISO C99 or C++; both casts and assignments cause values to be rounded to their semantic types +* strict: rounding occurs after each operation, no transformations +* exception handling, mode handling + Matrix multiplication ===================== @@ -663,3 +674,225 @@ Function pipelines Haskell has function composition ``(.)`` and Julia has the "pipe" operator ``(|>)``. According to `YSAGuide `__ pipelines like ``a . b . c`` are bad style and one should instead use intermediate results, ``\x -> { a1 = a x; b1 = b a1; c1 = c b1; return b1 }``, except with better named variables than ``x,a1,b1,c1``. The reason given is that debugging composed functions in the REPL is hard and clutters stacktraces. This sounds like a debugger problem - function pipelines are shorter and easier to read. + +Values +====== + +These are derived values, so they don't belong on the Values page. + +Lists +----- + +A list is a function ``Nat -> Value|EndOfList``, with the proviso that if ``list i = EndOfList`` then ``list (i+k) = EndOfList`` for all ``k>=0``. Basic list syntax is the usual syntactic sugar for list values. + +:: + + [] // empty list + arr = [a, b, c] + +Heterogeneous lists are possible, ``[1,"x",int32 3]``. + +In Stroscot tuple is synonymous with list. + + +Arrays +------ + +An indexing scheme is a function ``IndexValue -> [0..length]``. An array is a term ``array scheme list``. A typed array also specifies an element type, ``typed_array type scheme list``. Tensors are arrays that use compound values as indexes. Sparse arrays use indexing schemes that map many indices to one value. + +Strings +------- + +A string is a list of bytes of a given length. Subtypes include C null-terminated strings, UTF-8 encoded strings, and characters (UTF-8 encoded strings containing exactly one grapheme cluster). Also filenames and paths. + +Bitvectors +---------- + +A bitvector is the symbol ``bits`` applied to a list of bits, ``bits [1,0,1]``. + +Date/time +========= + +Various types of complete and incomplete date/time values. C.f. ISO 8601, `Wikipedia `__ + +* Timescale: TAI, TT, GPS, TCB, TCG, TDB, UT1, TST, LTST, UTC ``Z``, utc offset ``±hh:mm``, civil timezone, smeared UTC (3+ variants) +* Instants: + + * Named instants: Rata Die, Unix epoch, GPS epoch, ... (absolute or dependent on timescale) + * Offset time: offset, unit, instant + * complete date-time ``±YYYYY-MM-DDThh:mm:ss.ss`` + timescale + * time record: a collection of chronological key-value information sufficient to specify an instant (includes timescale). A date may be specified based on a calendar such as proleptic Gregorian or the many other calendars. + +* duration: a time difference as a collection of time fields (including a timescale) +* partial date: a collection of time fields (including timescale) that has the semantics of replacing or overriding the fields of an instant with those of the partial date's +* recurrence: RRule as documented in the iCalendar RFC. (https://dateutil.readthedocs.io/en/stable/rrule.html) + +Records +======= + +A record is an ordered list of key-value pairs, ``record {a = 1, b = 2, c = 3}``. + +Maps +---- + +A map is an unordered set of key-value pairs, with the restriction that each key has one value. ``map {a = 1, b = 2, c = 3}`` + +A table is a map from tuples to values. + +Multimap +-------- + +A multimap is an unordered bag of key-value pairs. ``multimap {a = 1, a = 1, b = 2, c = 2, c = 3}`` + +Bags +==== + +A bags is a function ``numInstances : Any -> Nat``. + +Pointers +======== + +Pointers are just a wrapper for particular bit patterns (integers), like ``pointer 0xdeadbeef``. You can do integer arithmetic and turn it into a pointer, but at least on x86-64 not all 64-bit integers are valid pointers. + +References +========== + +References are a term ``ref r123``, where ``r123`` is a symbol. Most symbols are autogenerated inside the reference creation operation ``ref``, but there is no technical restriction on which symbols are allowed. + +Numbers +======= + +A decimal number consists of a sign and a list of digits, a decimal marker, and more digits. A floating-point number is similar but adds a base and a separately specified exponent. A rational numbers is a ratio of two integers. A complex number is a real and an imaginary component, or a polar angle and magnitude, in both cases real numbers. + +Mathematical structures +======================= + +A polynomial is a list or array of terms, where each term is the product of a number (coefficient) and zero or more variables, each raised to a power. A truncated power series is a polynomial in a single variable, starting from a constant and with terms increasing up to some power. A generating function is an infinite (formal) power series. + +An equation is a term where the top term is ``=`` and the left and right are mathematical expressions. A system of equations is a list of equations with ``and`` applied. A differential equation is an equation including partial derivatives or derivatives with respect to time. + +A probability distribution is a function from a sigma-algebra to a real number [0,1]. + +A special function is a function over the real or complex numbers, usually taking one or two arguments and producing one output. Modular forms are special functions. + +Fields/rings/ideals are generally represented as simple symbols or terms. + +Piecewise/interpolated functions are terms that contain the data as a list. + +An optimization problem is a set (the search space, usually R^n or Z^n or so on with some constraints) together with an objective function mapping each possibility to a real number, that is either minimized or maximized. + +Intervals +========= + +An interval has a start value, an end value, and two flags for whether each side is open or closed. There is also a "ball" representation as the middle and distance to each side. The values can be numbers but also other things like date/times. + +Units +===== + +Units are symbols or terms. Quantities are terms ``quantity number unit``. See the Units page. + +Multimedia +======== + +I guess we use SVG as the main format for graphics, plots, and geometric objects. Raster images are just arrays of pixel colors. Similarly audio files and time-series data are arrays of samples. + +Graphs +====== + +Directed graphs are a set of vertices and a set of (vertex,vertex) pairs. Undirected graphs are the same but with the proviso that if (a,b) is present then (b,a) must be as well. + +Chemicals +========= + +There are several ways of depicting these: + +* IUPAC name / CAS number - string/symbol +* `SMILES `__ - a string/term +* Chemical markup language - XML +* An annotated graph. Each bond (edge) has metadata, and the vertex stores 3D coordinates. + +Databases +========= + +A database connection is a term with file handle data and other stuff. A database query is a term too. A database result is a list or a term or whatever. + +Regular expression +================== + +A regular expression is a term with symbols for each formalism, like "or", concatenation, capture groups, quantifiers, the wildcard character, character classes, and so on. + +Finite-state machine +==================== + +A FSM is represented as some rewrite rules, each state/input rewrites to some new state. + +GIS +=== + +A geographic point has a geographic reference and 3 coordinates. Files may store vector and raster data. Raster is just an array with point information. Vector data may be points, lines, multi-lines (strings), or polygons. Topology information encodes topological connections between different shapes. + +Quantum computing +================= + +Similarly to how a classical computation maps a RAM of n bits to another RAM of n bits, a quantum computation (gate) over n qubits is a unitary matrix of size 2^n by 2^n. A more structured representation involves forming quantum circuits out of qubit wires and quantum gates. There are also measure and set operations to interact with classical computations. + +Logical propositions +==================== + +These are just formulas, terms and binders built using the logical connectives (and, or, not, forall, exists, etc.) With sequent calculus we also allow the turnstile as a symbol. + +Infinite values +=============== + +A potentially-infinite expression is a function ``f : Nat -> Value`` that is Cauchy (under the term distance metric). Such expressions are considered equivalent to any other Cauchy sequence that is distance 0 away in the limit. There is a canonicalization or evaluation operation on potentially-infinite expressions; it either resolves the expression to a finite term, or proves that the term is not finite and resolves the expression to a canonical "simplest form" Cauchy sequence, or else errors. + +Generally such expressions are generated by evaluating with "fuel". For example, computing ``let fib x y = x:(fib y (x+y)) in fib 0 1`` with progressively more fuel generates the sequence ``0:*, 0:1:*, 0:1:1:*, 0:1:1:2:*``, etc. (where ``*`` is a long expression where evaluation ran out of fuel) This is Cauchy and not equivalent to any finite term, so it will be canonicalized as a "simplest form" Cauchy sequence, maybe as something like ``infiniteValue $ \n -> let fib x y = x:(fib y (x+y)) in take n (fib 0 1)``. + +One well-behaved subset of infinite values is the rational infinite values, which can be specified as the solution of a system of equations, for example ``let x=1:x in x``. These can be verified to be in normal form by ensuring no reduction in any equation. + +Programming with infinite data structures is simple, natural, and practical. But it is more complex than finite data structures. Computers do not have an infinite amount of memory, so these structures are not physically represented in memory, but rather symbolically. It is generally slower to access individual elements, and the execution engine may exhibit high resource usage if the computation is complex. Overall, performance is more unpredictable compared to a finite data structure. Some computations may not work at all. A simple rule of thumb is "demand" - the execution engine analyzes which parts of the structure are required, and generally a computation will work well if only a finite portion needs to be evaluated. But there are also tricks to evaluate computations with an infinite demand. If nothing works, the execution engine fails. But there are many tricks implemented - generally, if you can do it by hand the computer can do it too. + +So for example infinite lists, like the natural numbers starting with 1. We can try some operations: + +- when we print this infinite list, it is going to exceed our maximum output. So the right thing is to summarize the data - something like "[1,2,3,...]", where the ... indicates output was omitted. There are many heuristics to use for summarizing output. +- summing the infinite list, this computes the limit as the list goes to infinity, so is infinity. +- filtering the even numbers from an infinite list: this gives us the infinite list of all the even numbers. +- filtering all the numbers less than 100: this again requires computing a limit, should be simple. +- take - taking a certain number of elements of the list + +timeouts - is it better to just churn through a computation and let the user do Ctrl-C, or to enforce some limit like 5s per command? If the user running a long computation they will be very annoyed at an arbitrary timeout. What is more useful is "progress monitoring" - detecting "hangs" where evaluation is definitely not advancing productively. + +Infinite values separate control and data - first you construct an infinite value (data), then you decide what to do with it (control). In many languages, if you wanted to generate a list of 10 primes, and a list of 100 primes, you'd need to write a function to generate a finite list, and call it twice with a different amount of stuff to generate. But with more complicated usages of infinite data structures, it's not that simple - for example, if you want the first ten even numbers, with an infinite list it's just taking elements from the filtered list, but without, you would have to back-propagate the demand to generate 1 through 20. In general there is no clear way to do it up front by allocating a finite list. + +Numbers +======= + +Normal numbers are simply terms, but a real number is a function ``ω+1 -> {-1,+1}`` excluding certain trivial sequences (c.f. `surreal numbers `__ S_ω "the field of real numbers" and sign expansion representation), and similarly p-adic numbers need special handling as the different metric completion of the rational numbers. + +Modules +======= + +Modules are a set of rewriting rules together with exported and hidden symbols. These symbols are qualified to the module's value, in something of a recursive knot. The knot can be resolved by expanding the module out as an infinite term. + +Qualified symbols +================= + +A qualified symbol or identifier is a term consisting of a bare symbol together with a module. A literal looks like ``qsym sym123``. Normally you use the period operator ``.`` and write ``module.sym``, but the period operator is a function so this is an expression not a value. Examples of qualified symbols include the predefined symbols of the prelude, like ``null``, ``true``, and ``false`` - the latter two forming the boolean type. + +Portable operations +=================== + +Across platforms, hardware assembly instructions vary and many will likely not exist. So we need to abstract this. The API structure is that we have "native" modules providing direct hardware instructions, which give compile-time errors on attempts to use it on non-native platforms. Then on top we have a portable library that provides a cross-platform interface using switch statements like ``case platform of A -> implA; B -> implB; ...``. Most hardware operations are ubiquitous, so it makes sense to expose them directly as portable operations. Actually, it is hard to think of a reason not to expose a hardware primitive - probably every assembly instruction for our supported platforms should be exposed as a portable operation. + +Addition wraps on all 64-bit hardware in the same way so only needs one portable operation. Other instructions like division have differing behavior, so we can provide 0-returning (ARM native) and ``DivideByZero`` exception-throwing (Intel native) division as portable operations. There is also the intersection of these functions with signature ``Int -> Int\{0} -> Int``, which is useful when we want zero-overhead on multiple platforms. Ideally the compiler will be able to prove that the divisor is non-zero and so both of the 0-returning/exception-throwing versions will compile to one instruction on both ARM and x86. + +Then we also want portable APIs at a higher level, such as the limb-multiply routine in libGMP. + +Supporting different platforms (Linux, macOS, Windows, etc) is hard: + +* certain things aren't portable (e.g. the assembly used for switching thread stacks) +* differences between Linux distributions - llvm-sys doesn't compile on Alpine Linux, and so Inko doesn't support Alpine Linux for the time being. +* macOS ARM64 runners cost $0.16 per minute, and isn't available for forks / third-party contributors +* Some platforms aren't supported on CI runners at all, like FreeBSD, so you need to use qemu or similar software to run FreeBSD in a VM + +Recommendation: err on the side of not supporting platforms, and document this, instead of sort-of-but-not-quite supporting it. \ No newline at end of file diff --git a/docs/Commentary/Compiler.rst b/docs/Commentary/Compiler.rst index 1edc806..7f71c0f 100644 --- a/docs/Commentary/Compiler.rst +++ b/docs/Commentary/Compiler.rst @@ -72,7 +72,7 @@ writeFile (compileToExecutable main) where it's clear that compileToExecutable is doing the work. -the nanopass framework is pretty +the nanopass framework is pretty Model ===== @@ -192,6 +192,8 @@ E.g. overloading/dispatch can be implemented in a variety of ways, specialized f Profile-guided optimization is an effective solution to this lack of information: we instrument a binary with counters for the various questions we might ask, and generate a profile with the answers. We might need to run a binary several different times to get good coverage so we also need a way to combine profiles together, i.e. profiles form a commutative monoid. Profiles themselves introduce a "Heisenbug" problem: we cannot measure the detailed performance of an unprofiled program, and turning profiling off may change the performance significantly. The solution is to build with profiling support for almost all of the compilation pipeline. We should only omit profiling instructions for non-profiled builds at the assembly level. And if we use hardware-assisted sampling profiling then we don't even need profiling instructions, in many cases, so profiling can simply be always enabled. Still, if we are using profile information all the time and making major decisions based on it, it is important to be mostly accurate even on the initial run, so a good approximation is also key. (TODO: approximation of profiles is probably a whole research area, explore) +Direct Method Resolution: Optimizing method calls to assembly jumps to specific addresses during execution + Optimization variables ====================== @@ -220,8 +222,19 @@ build stage 2 compiler with the stage 1 compiler using the stage 1 package datab Complex bootstrap ================= -Actually bootstrapping is more complex. The compiler is really two components, an interpreter and a specializer. The input program can take arguments. The interpreter can take arguments (dialects, libraries). The specializer can take arguments (bytecode, optimization instructions, plugins). The output program can take arguments (compiled objects, runtime components such as libc or a garbage collector). All of these arguments and options aren't handled easily. +Software is bootstrappable when it does not depend on a binary seed, i.e. a seed that is not built from source. The “trusting trust” attack is only a symptom of an incomplete or missing bootstrap story - if every program is built from source, the attack is impossible. In practice, every software needs some bootstrap binaries, but the number and size of binary seeds should be a bare minimum. + +For example Guix uses bootstrap-seeds (hex0 binaries), bootar (extract tar), and a static build of GNU Guile 2.0.9 (for build scripts / utilities). Then it builds gash (Scheme implementation of bash), https://github.com/oriansj/stage0-posix, and GNU Mes. Mes is a mutually self-hosting Scheme interpreter, C compiler, and C runtime library. Maybe you don't trust GNU Guile as the bootstrap. You can use "diverse double-compiling" and substitute the Scheme implementation of your choice as the bootstrap host implementation. For example GNU Mes itself. As the build is reproducible and depends minimally on the build host, the resulting GNU Mes should be identical regardless. GNU Mes can thus be regarded as a high-assurance bootstrap seed, that pretty much verifies itself. From GNU Mes, Guix then builds tcc (patched TinyCC), old gzip/make/patch, gcc 2.95 + GNU tools, gcc 4.9.4 + GNU tools, and finally modern gcc and the rest of the software stack. + +So that is interesting and all, but how do we bootstrap Stroscot? Building a "self-hosted" compiler is a real challenge. You need to maintain at least two compilers (one to bootstrap your self-hosted compiler, and the self-hosted compiler itself). There is really a combination of strategies: + +* Chaining a prior build - we see from the gcc build that chaining prior builds is a valid strategy whenever there is a fundamental change in the build requirements / compiler language (such as GCC changing from C to C++). In fact it is technically valid to use the "natural bootstrap process" - build each commit from the version of the previous commit, down to the initial bootstrap. But it is a bit slow - to reproduce a build at commit N you have to build roughly N binaries. Also fragile, as what do you do with a commit that breaks the build. It is better to have a manually-specified custom chain. It is important to specify the bootstrap chain within the compiler repo, directly or as a commit hash of a different repo, so you don't run into git bisect issues like "I checked out an old commit but it uses a different bootstrap process so it doesn't build". + +* Seed compiler code - We can generate lower-level code from the source code, such as C, Java, Haskell, WASM, or a custom bytecode. The code can be generated automatically from the main compiler's source, as a backend target, but it is not clear if this is sufficiently verifiable - I guess it depends on how readable the code is and whether it can be matched efficiently with the original code. For example, much of the code is devoted to optimizing, backends, error messages, caching, and langauge server which is not necessary for bootstrapping. It is also possible to write this seed compiler code by hand, but then you have to maintain two compilers. + +* Seed interpreter/VM - Bootstrapping from machine code with Hex0 is possible but it makes a lot more sense for portability and sanity to use a higher-level language as the initial seed. We could use GNU Mes, GCC, the JVM, WASM, Haskell, etc. as the seed language. The key is that the interpreter/VM can process the seed compiler output. It does not need to be particularly optimized, it just has to bootstrap an initial self-hosted version - e.g. it most likely does not have to free memory. Practically it will be a recent self-hosted optimized build that is used as the final step of the chain, for git bisect etc. +Actually bootstrapping is more complex. The compiler is really two components, an interpreter and a specializer. The input program can take arguments. The interpreter can take arguments (dialects, libraries). The specializer can take arguments (bytecode, optimization instructions, plugins). The output program can take arguments (compiled objects, runtime components such as libc or a garbage collector). All of these arguments and options aren't handled easily. Like platforms, probably it is easiest to bootstrap x86 first and then build other platforms by cross-compiling. Compile-time code execution =========================== @@ -369,3 +382,4 @@ Functions generally assume a fixed set of types and a fixed memory representatio State is also an issue because the memory manager must be aware of the local state of a piece that reloaded and avoid leaking memory. In the case of handles such as an OpenGL context the desirable behavior is to transfer them over to the new code, but if the initialization code is changed then the handle should instead be closed and re-initialized. So we see some sort of incremental program execution going on. +live-patching: depending on optimizations, all callers maybe impacted, therefore need to be patched as well. diff --git a/docs/Commentary/Concurrency.rst b/docs/Commentary/Concurrency.rst index 3e3a7fb..360d454 100644 --- a/docs/Commentary/Concurrency.rst +++ b/docs/Commentary/Concurrency.rst @@ -142,3 +142,7 @@ But still, for complex data science computations we might want automatic paralle Haskell's "par" is interesting, but too fine-grained to be efficient. You have to manually add in a depth threshold and manually optimize it. It's just as clear to use explicit fork/join operations, and indeed the ``rpar/rpar/rseq/rseq`` pattern proposed in `the Parallel Haskell book `__ is just fork/join with different naming. As far as the actual task granularity, Cliff Click says the break-even point is somewhere around the middle of the microsecond range, thousands of cycles / machine code instructions. Below that the overhead for forking the task exceeds the speedup from parallelism, but above you can make useful progress in another thread. + +https://en.wikipedia.org/wiki/Communicating_sequential_processes + + diff --git a/docs/Commentary/Dispatch.rst b/docs/Commentary/Dispatch.rst index 38c9497..ec547c6 100644 --- a/docs/Commentary/Dispatch.rst +++ b/docs/Commentary/Dispatch.rst @@ -266,6 +266,7 @@ This depends on the ``lub`` primitive defined in `Conal's post `__, it is better to use destructors to release the resource, rather than using finally blocks. It is quite natural that acquiring a resource returns a resource handle with an accompanying a destructor for that handle, whereas there is not necessarily a natural point to place a finally clause. For example, with finally, the following are problematic: + +* Returning a handle - the caller must remember to use finally to free the returned handle +* Overlapping handles - suppose we want Alloc-A Alloc-B Free-A Free-B. finally requires nesting so we will have to do some complex layout, like ``try { try { a = acquire; b = acquire } finally { release a } } finally { release b }``. It is unnatural and prone to error. + +In realistic systems, there are far more resource acquisitions (and destructions) than kinds of resources. Comparing the two, we see that ``a = acquireAndCreateDestructor; use a`` is significantly simpler and less code than ``try { a := acquire; use a } finally { release a }``. Therefore, Stroscot follows C++ in using RAII / destructors rather than having finally. + no-throw ======== @@ -582,15 +592,16 @@ Callers have to code to handle the exceptions, so they need to know which except There are several warnings that check exception lists: -* unused-exception - an exception or exception set is listed, but there is no way to throw it +* unreachable-exception - an exception or exception set is listed, but there is no way to throw it * unlisted-exception - an exception may be thrown on a given input, but is not contained in the return type -* duplicate-exception - supposing the return type is ``E1|E2|R``, both ``E1|R`` and ``E2|R`` are valid signatures +* duplicate-exception - for example, supposing the return type is ``A|B|C``, ``A`` is duplicate if ``B|C`` also lists all exceptions +* overlapping-exception - for ``A|B``, warns if any exception is in both ``A`` and ``B`` -Sample signature styles (enforced by the compiler where relevant): +Sample signature styles (can be enforced by the compiler with the warnings): -1. ``precise`` - the set of thrown exceptions is listed in the signature. All possible exceptions given the types of the arguments are listed, and no unreachable exceptions are allowed in the list. -2. ``lower`` - a set of definitely thrown exceptions are listed, but other exceptions may be thrown -3. ``upper`` - like precise, all possible exceptions must be listed, but unreachable excpetions may also be listed +1. ``precise`` - the set of thrown exceptions is listed in the signature. All possible exceptions given the types of the arguments are listed (no unlisted exceptions), and no extraneous exceptions are allowed in the list (no unused or overlapping exceptions). +2. ``lower`` - a set of definitely thrown exceptions are listed, but other exceptions may be thrown (no unreachable exceptions; duplicate exceptions only if A is a subset of B|C) +3. ``upper`` - like precise, all possible exceptions must be listed, but unreachable exceptions may also be listed (no unlisted or duplicate exceptions) With ``lower`` it is not possible to say that a function doesn't throw, but with the other two it is. @@ -610,7 +621,7 @@ The ``lower`` style of signature doesn't require any synonyms because exceptions With ``upper`` a synonym style is to define one exception set ``LibraryException`` with all the common exceptions your library throws (overflow, divide by zero, out of memory, etc.) and use that in each signature. It is not too hard to maintain a single exception set for a library. It's a little better than Java's ``throws Exception`` because the exception set is finite, but requires almost as little maintenance as ``lower``. Exceptions that people should care about can be documented by adding them redundantly to the signature, ``DivideByZero|LibraryException``. And exceptions that aren't thrown can be asserted by removing them, e.g. ``LibraryException\DivideByZero``. Application code can use set operations to build a combined set, ``AppException=(Library1Exception|Library2Exception)\(HandledException1|HandledException2)``. -With ``precise``, the style I came up with is to have a built-in compiler function ``exceptions _`` that computes the exception set of each function. Then for the actual signature you can write a self-referential signature ``a : ... -> Int | exceptions a``, if you don't want to make any guarantees about exception behavior, or ``Int | (exceptions a \ SomeException)``, to say that ``SomeException`` is not thrown, or ``Int | (exceptions a | SomeException)``, to say that ``SomeException`` is definitely thrown. ``exception x`` is somewhat magical is that it knows the rest of the signature and scopes the list of exceptions appropriately, e.g. for the signature ``x : Int -> Int | ExceptionA``, ``exceptions x = ExceptionA``, but for the signature ``x : Bool -> Bool | ExceptionB``, ``exceptions x = ExceptionB``, and similarly in the signature ``x : Int | Bool -> Int | Bool | exceptions x``, ``exceptions x = ExceptionA | ExceptionB``. +With ``precise``, the style I came up with is to have a built-in compiler function ``exceptions _`` that computes the exception set of each function. Then for the actual signature you can write a self-referential signature ``a : ... -> Int | exceptions a``, if you don't want to make any guarantees about exception behavior, or ``Int | (exceptions a \ SomeException)``, to say that ``SomeException`` is not thrown, or ``Int | (exceptions a | SomeException)``, to say that ``SomeException`` is definitely thrown. ``exception x`` is somewhat magical is that it knows the rest of the signature and scopes the list of exceptions appropriately, e.g. for the signature ``x : Int -> Int | ExceptionA``, ``exceptions x = ExceptionA``, but for the signature ``x : Bool -> Bool | ExceptionB``, ``exceptions x = ExceptionB``, and similarly in the signature ``x : Int | Bool -> Int | Bool | exceptions x``, ``exceptions x = ExceptionA | ExceptionB``. Or maybe it is simpler to use ``Exception`` which is just the type of all exceptions. With ``precise`` you can also write a specification without referencing ``exceptions a``. doing a "full list" of all the component exceptions, or a "computed list" writing the set as a computation of child functions. So if ``a`` returns ``Int`` normally and calls ``b`` and ``c`` and catches ``SomeException`` from ``b``, then the computed list would be ``a : Int | (exceptions b \ SomeException) | exceptions c``. Both types of list cost some thought but ensure reliability as every exception is accounted for. A full list ensures that control flow is local because newly thrown exceptions must be caught or added to the list for every method in the chain. A computed list does not list exceptions that propagate through the function, so is less verbose. To newly throw an exception, it only needs to listed where it is thrown and where it is caught. @@ -620,7 +631,6 @@ So with ``lower`` or the self-referential ``precise`` style, no extra work is re The full list style is attractive for small projects, but as Gunnerson says, for large projects this requires too much maintenance and thus decreases productivity and code quality. But there are various viable alternatives, with varying levels of precision. - Java checked exceptions ----------------------- @@ -809,8 +819,9 @@ Interruptible cleanup Interruptible cleanup actions - the interaction of async exceptions and cleanups. A cleanup function which may block and should be interruptible to avoid a long delay in execution. -When closing a file one often wants to flush buffers (fsync). So there are 3 variants of hClose: -* The flush marks a checkpoint, and should retry until complete regardless of interruptions +When closing a file one often wants to flush buffers (fsync). There are at least 3 variants of hClose w.r.t. fsync: + +* The flush marks a checkpoint, and should retry fsync until success regardless of interruptions * The flush is unnecessary, just close the file * The flush is productive but interruptible (EINTR), and should not be retried on interrupt. This avoids the situation where the flush takes a long time and the thread is unkillable. Note that it requires two async exceptions to kill the thread, one to enter the cleanup handler and another to interrupt the flush. @@ -837,7 +848,7 @@ On Windows there is `no direct equivalent `__. -* Windows 10 RS5 adds "Special User APCs" (QUEUE_USER_APC_FLAGS_SPECIAL_USER_APC) which are delivered quickly via a kernel-mode APC, but then run as a user APC. The kernel-level API passes in a CONTEXT argument containing the registers from before the APC, like Linux's signal handler, but the documented API doesn't have this info. It may be possible to get it somehow with the documented API, or we can live dangerously and use the kernel-level API. +* Windows 10 RS5 adds "Special User APCs" (QUEUE_USER_APC_FLAGS_SPECIAL_USER_APC, `source `__) which are delivered quickly via a kernel-mode APC, but then run as a user APC. The kernel-level API passes in a CONTEXT argument containing the registers from before the APC, like Linux's signal handler, but the documented API doesn't have this info. It may be possible to get it somehow with the documented API, or we can live dangerously and use the kernel-level API. System calls on Windows are implemented with layers of C in between, so the handler has to ensure the C code completes to ensure proper cleanup. So it walks the stack and overwrites the first user-mode frame with an exception handling information frame, skipping internal Windows stack frames. This functionality is also useful on Linux, if we're using glibc. diff --git a/docs/Commentary/FunctionalLogic.rst b/docs/Commentary/FunctionalLogic.rst index e69de29..fbf606e 100644 --- a/docs/Commentary/FunctionalLogic.rst +++ b/docs/Commentary/FunctionalLogic.rst @@ -0,0 +1,90 @@ +Functional logic +################ + +Curry, Verse, Flix, Oz, ... all functional logic languages. + +Values. A value v is either a variable x or a head-normal form hnf. A variable counts as a value because an expression may evaluate to an as-yet-unknown logical variable. +A head-normal form hnf is a conventional value. +Expressions e include values v and other constructs: + sequencing 𝑒𝑞; e. An "expression or equation" 𝑒𝑞 is either an ordinary (possibly failing) expression e, or an equation v = e; the syntax ensures that equations can only occur to the left of a “; ”. + ∃x. e - existential, introduces logical variable + fail - yield no values + choice e1 || e2 - yield multiple values (e1 followed by e2) + one{e} - if e fails, fails, otherwise returns the first of the values yielded by e + all{e} - reifies choices as a tuple; n choices mean tuple of length n +A program 𝑝 ::= one{e} is a closed expression e (freevars(𝑒) = ∅) taking the first result. If the expression fails, the program fails. + +Rewrite rules: +∃x1 x2 ··· xn. e means ∃x1. ∃x2. ···∃xn. e +x := e1; e2 means ∃x. x = e1; e2 +⟨e1, ···, en⟩ means x1 := e1; ···; xn := en; ⟨x1, ···, xn⟩ where xi are fresh (again skip rebinding values) +e1 = e2 means‡ x := e1; x = e2; x where x is fresh (skip if e1 is a value v and the equation is not the last in a sequence) +if (∃x1 ···xn. e1) then e2 else e3 means (one{(∃x1 ···xn. e1; 𝜆⟨⟩. e2) || (𝜆⟨⟩. e3)})⟨⟩ +Note: In the rules marked with a superscript 𝛼, use 𝛼-conversion to satisfy the side condition + + +A multi-equation pattern match such as +function pat1 = body1 +function pat2 = body2 + +desugars to + +function := λp.((∃x1 ··· xn. p = pat1; body1) || (∃x1 ··· xn. p = pat2; body2)) + +Primops and literals: +Hnfs include integer constants and primitive operators +, > +e1 + e2 means add⟨e1, e2⟩ +e1 > e2 means gt⟨e1, e2⟩ +app-add add⟨k1, k2⟩ −→ k3 where 𝑘3 = 𝑘1 + 𝑘2 +app-gt gt⟨k1, k2⟩ −→ k1 if 𝑘1 > 𝑘2 +app-gt-fail gt⟨k1, k2⟩ −→ fail if 𝑘1 ⩽ 𝑘2 +u-lit k1 = k2; e −→ e if 𝑘1 = 𝑘2 + +Lambdas: +A head-normal form hnf includes a lambda 𝜆x. e. +Expressions e include applications v1 v2 +e1 e2 means f := e1; x := e2; f x, where f,x are fresh (skip rebinding values) +𝜆⟨x1, ···, xn⟩. e means 𝜆p. ∃x1 ··· xn. p = ⟨x1, ···, xn⟩; e p fresh, n ⩾ 0 +app-beta𝛼 (𝜆x. e) (v) −→ e{v/x} if 𝑥 ∉ fvs(v) +u-lambda a=b is stuck if a or b is a lambda + +Tuples: +A head-normal form includes a tuple ⟨v1, ···, vn⟩. +app-tup ⟨v0, ···, vn⟩(v) −→ ∃x. x = v; (x = 0; v0) || ··· || (x = n; vn) fresh x ∉ fvs(v, v0, ···, vn) +app-tup-0 ⟨⟩(v) −→ fail +u-tup ⟨v1, ···, vn⟩ = ⟨v′1, ···, v′n⟩; e −→ v1 = v′1; ···; vn = v′n; e +all-choice all{v1 || ··· || vn } −→ ⟨v1, ···, vn⟩ +all-value all{v} −→ ⟨v⟩ +all-fail all{fail} −→ ⟨⟩ + +Failure: +u-fail hnf1 = hnf2; e −→ fail if no unification +u-occurs x = V [ x ]; e −→ fail if V ≠ □ (i.e., all but x=x fail) +fail-elim 𝑋 [ fail] −→ fail +one-fail one{fail} −→ fail +choose-r fail || e −→ e +choose-l e || fail −→ e + +Existential: +exi-elim ∃x. e −→ e if x ∉ fvs(e) +eqn-elim ∃x. 𝑋 [ x = v; e ] −→ 𝑋 [ e ] if x ∉ fvs(𝑋 [ e ]) and v ≠ V [ x ] +exi-float𝛼 𝑋 [ ∃x. e ] −→ ∃x. 𝑋 [ e ] if 𝑥 ∉ fvs(𝑋 ) +exi-swap ∃x. ∃y. e −→ ∃y. ∃x. e + +Equality: +subst 𝑋 [ x = v; e ] −→ (𝑋 {v/x}) [ x = v; e{v/x} ] if v ≠ V [ x ] +hnf-swap hnf = v; e −→ v = hnf ; e +var-swap y = x; e −→ x = y; e if x ≺ y + +Sequences: +seq-swap 𝑒𝑞; x = v; e −→ x = v; 𝑒𝑞; e unless (𝑒𝑞 is y = v′ and y ⪯ x) +val-elim v; e −→ e +seq-assoc (𝑒𝑞; e1); e2 −→ 𝑒𝑞; (e1; e2) +eqn-float v = (𝑒𝑞; e1); e2 −→ 𝑒𝑞; (v = e1; e2) + +Choice: +one-value one{v} −→ v +one-choice one{v || e} −→ v +choose-assoc (e1 || e2) || e3 −→ e1 || (e2 || e3) +choose SX [𝐶𝑋 [ e1 || e2 ] ] −→ SX [𝐶𝑋 [ e1 ] || 𝐶𝑋 [ e2 ] ] + diff --git a/docs/Commentary/IR.rst b/docs/Commentary/IR.rst index a4ab9ba..fc98c34 100644 --- a/docs/Commentary/IR.rst +++ b/docs/Commentary/IR.rst @@ -24,10 +24,59 @@ Unlike Java bytecode, the IR is target-specific. Specific instances of the IR wi For debugging and diagnostic purposes, it is important to define a reverse transformation back from IR to AST, so that error messages can display the relevant code. For example, inlining may expand a macro, and the error in the macro may require showing bits of the IR for context. The AST generated from the IR does not have to be the original AST, it should instead accurately depict the IR. But the IR should capture source locations and other information necessary for good error messages. -Transformations -=============== - -As mentioned in the goals section, the in-memory form of IR should be suitable for all of the various transformations analyses. A `talk `__ by Graydon Hoare on compilers mentions the paper :cite:`allenCatalogueOptimizingTransformations1971`. He says we need 8 optimization passes to get 80% of GCC/LLVM performance: Inline, Unroll (& Vectorize), CSE, DCE, Code Motion, Constant Fold, Peephole. An ideal IR makes these optimizations as cheap as possible. +Optimizations +============= + +As mentioned in the goals section, the in-memory form of IR should be suitable for all of the various transformations and optimizations. An ideal IR makes these optimizations very cheap - not free (otherwise the initial IR wouldn't be a faithful representation of the source) but not more than a few operations either. In particular having to do a scan of the entire IR for a transformation is a no-no. + + + +* removal of unused parameters +* replacement of parameters passed by reference by parameters passed by value. +* replace standard functions with faster alternatives when possible +* inlining +* deduplication of constants, functions, code sequences (tail merging / cross jumping) +* common subexpression elimination (CSE) +* dead code/store eliminate (DCE/DSE) +* conditional dead code elimination (DCE) for calls to built-in functions that may set errno but are otherwise free of side effects +* global constant and copy propagation +* constant propagation - which values/bits of values passed to functions are constants, function cloning +* value range propagation - like constant propagation but value ranges +* sparse conditional constant propagation (CCP), including bit-level +* elimination of always true/false conditions +* move loads/stores outside loops +* loop unrolling/peeling +* loop exit test +* cross-jumping transformation +* constant folding +* specializing call dispatch (possible targets, likely targets, test/branch) +* Code hoisting - evaluate guaranteed-evaluated expressions as early as possible +* copy propagation - eliminate unnecessary copy operations +* Discover which variables escape +* partial/full redundancy elimination (PRE/FRE) +* modified/referenced memory analysis, points-to analysis, aliasing +* strips sign operations if the sign of a value never matters +* convert initialization in switch to initialization from a scalar array +* termination checking +* loop nest optimizer based on the Pluto optimization algorithms. It calculates a loop structure optimized for data-locality and parallelism. +* graphite - loop distribution, loop interchange, unroll, jam, peel, split, unswitch, parallelize, copy variables, inline to use first iteration values, predictive commoning, prefetch +* final value replacement - loop to calculation using initial value and number of loop iterations +* explode structures to scalars in registers +* vectorization - loop vectorization, basic block vectorization, cost free (for debugging), likely faster, or code size +* reorder blocks, duplicate blocks, partition into hot/cold to improve paging and cache locality +* specialization of division operations using knowledge of the denominator + +Magic numbers: + +* search space sizes - Increasing values mean more aggressive optimization, making the compilation time increase, but with diminishing improvement in runtime execution time. Generally a formula producing a boolean "try optimization" answer or an integer "maximum number of possibilities to consider". +* memory limit - If more memory than specified is required, the optimization is not done. +* analysis skipping - ignore objects larger than some size +* ratios - if inlining grows code by more than this, cancel inlining. tends to be overly conservative on small functions which can increase by 300%. + + + + + A `talk `__ by Graydon Hoare on compilers mentions the paper :cite:`allenCatalogueOptimizingTransformations1971`. He says we need 8 optimization passes to get 80% of GCC/LLVM performance: Inline, Unroll (& Vectorize), CSE, DCE, Code Motion, Constant Fold, Peephole. Reduction --------- @@ -56,17 +105,22 @@ CSE Common subexpression "elimination" is actually identifying identical expressions in the IR and giving them a shared representation in an IR graph. It is related to graph reduction, which per :cite:`balabonskiUnifiedApproachFully2011`, can be characterized as giving each term in the unshared IR a label, and using an implementation such that all terms with the same label are represented as a single object (node) and reduced as a unit. The specific technique to identify duplicate expressions is "hash-consing". Hash-consing can be applied incrementally, so that CSE can be applied continuosly as other transformations are applied. One issue is merging alpha-equivalent expressions, :cite:`maziarzHashingModuloAlphaEquivalence2021`, which can be dealt with by encoding variable backedges as paths through a spanning tree. :cite:`mauborgneRepresentationSetsTrees1999` gives an algorithm identifying sharable components in cyclic graphs, useful for recursive structures. -As optimal reduction is also a term labelling, there should be an "optimal hash-consing" technique that identifies the Levy-labelling of terms with maximal sharing. More formally, there are three ways to define the equivalence relation of optimal reduction. The first is Levy labelling - take an initial term with unique atomic labels for every subterm, perform reductions according to a special label-generation rule, then observe which labels are equivalent in the result. The second is extraction, which maps a redex and its history to its origin. The third is the zig-zig relation, the smallest equivalence relation containing the "copy of" relation, based purely on reduction history. All of these relations are equivalent. But to make the semantics as a reduction graph tractable, all of these are defined with respect to an initial lambda term. For compile-time usage though, we would like the maximal equivalence - a "hash consing" algorithm which takes an unlabelled term and produces the labelling with maximal sharing. In the zig-zag relation, we would like to equate all redexes that can be produced by copying from any initial term, not +As optimal reduction is also a term labelling, there should be an "optimal hash-consing" technique that identifies the Levy-labelling of terms with maximal sharing. More formally, there are three ways to define the equivalence relation of optimal reduction. The first is Levy labelling - take an initial term with unique atomic labels for every subterm, perform reductions according to a special label-generation rule, then observe which labels are equivalent in the result. The second is extraction, which maps a redex and its history to its origin. The third is the zig-zig relation, the smallest equivalence relation containing the "copy of" relation, based purely on reduction history. All of these relations are equivalent on redexes. But to make the semantics as a reduction graph tractable, all of these are defined with respect to an initial lambda term. For compile-time usage though, we would like the maximal equivalence - a "hash consing" algorithm which takes an unlabelled term and produces the labelling with maximal sharing. In the zig-zag relation, we would like to equate all redexes that can be produced by copying from any initial term, not + +The reduction ``(\x. E[x]) e --> E[e]`` shows that it will share all identical expressions, just as CSE with graph reduction. But it will also share an expression and its reduction, hence computing the labelling is at least of complexity :math:`\Sigma^0_1`. Let's avoid that by only considering terms pre-reduced to normal form. Are there any other equivalences besides normal CSE? + -The reduction ``(\x. E[x]) e --> E[e]`` shows that it will share all identical expressions, just as CSE with graph reduction. But it will also share an expression and its reduction, hence computing the labelling is at least of complexity :math:`\Sigma^0_1`. - To show that a maximal labelling exists, we need a join property of the form "for a term+history a, and another term+history b, there is a term+history c with all equivalences from a and also those from b". + Consider each case of lambda term: -Let's avoid that by only considering terms pre-reduced to normal form. Consider each case of lambda term: +* Bound variable with bound variable: a bound variable may equate with all of its other occurrences. But since the term is reduced, it cannot unify with anything else - each unique variable must have a unique label in the initial history. For example, for ``λy.y (λx.xx)``, the two bound appearances of ``x`` may be equated from an initial term ``λx.(λz.zz)x``. But we know that ``x`` and ``y`` cannot be equated, and similarly in ``λy.λy.x y``. -* Bound variable: a bound variable may unify with all of its other occurrences. But since the term is reduced, it cannot unify with anything else - each unique variable must have a unique label in the initial history. For example ``(λx.xx)((λy.y)`` +* Lambda abstraction: A lambda abstraction cannot equate Suppose we have ``λx.M`` and ``λy.N``. + + + To show that a maximal labelling exists, we need a join property of the form "for a term+history a, and another term+history b, there is a term+history c with all equivalences from a and also those from b". @@ -123,9 +177,9 @@ CPS does expose control flow as continuation values, but it has problems. First, The CBV CPS encoding is quite annoying, e.g. :cite:`downenSequentCalculusCompiler2016` it inverts nested function calls ``map f (map g xs)`` as ``λk.map g (λh.h xs (λys.map f (λh'.h' ys k)))``. Per :cite:`maurerCompilingContinuations2016` this makes CSE harder, e.g. ``f (g x) (g x)`` vs ``g (\xv. g (\yv. f k xv yv) x) x``. Also rewrite rules are harder to apply. Even CBV has an encoding - :cite:`flanaganEssenceCompilingContinuations1993` point out that "realistic" CBV CPS compilers mark source function calls as using special continuation closures to allow efficient calls. The call-by-need transform is worse - :cite:`okasakiCallbyneedContinuationpassingStyle1994` describes how the thunk graph itself must be represented in the CPS term. It does have the benefit that the term graph is built incrementally, by gluing together subgraphs generated on demand by reduction, but the graph is still obfuscated as imperative code. :cite:`kennedyCompilingContinuationsContinued2007` states assigning names to continuations is really a benefit, but doesn't discuss the other drawbacks of the encoding. -:cite:`sabryReasoningProgramsContinuationpassing1992` demonstrated that CBV CPS was reversible, and proved that beta-eta-reduction in CPS corresponded to the A-reductions plus call-by-value reduction on the original term. Hence, per :cite:`flanaganEssenceCompilingContinuations1993`, many compilers adopted reducing the expression to A normal form between other transformations as a replacement for CPS. However, per :cite:`kennedyCompilingContinuationsContinued2007`, ANF is not closed under beta-reduction - inlining can create nested lets, which then have to be "renormalized", floated out or rearranged via "commuting conversions". Similarly, the A-reduction ``E (if x then a else b) = if x then E a else E b`` duplicates the evaluation context, and as such is commonly left out. The workaround is to introduce a "join point", the ``e`` in ``let e z = ... in if x then e a else e b``. But join points are essentially continuations, second-class in that they are represented by function bindings. Even if specifically marked, they are fragile, in that per :cite:`maurerCompilingContinuations2016` the case-of-case transformation must handle join points specially, and similarly other transformations must preserve join points (e.g. not inlining the join point. Furthermore, they are not really functions, requiring a special calling convention to compile efficiently. As Kennedy says, "Better to start off with a language that makes continuations explicit." +:cite:`sabryReasoningProgramsContinuationpassing1992` demonstrated that CBV CPS was reversible, and proved that beta-eta-reduction in CPS corresponded to the A-reductions plus call-by-value reduction on the original term. Hence, per :cite:`flanaganEssenceCompilingContinuations1993`, many compilers adopted reducing the expression to A normal form between other transformations as a replacement for CPS. However, per :cite:`kennedyCompilingContinuationsContinued2007`, ANF is not closed under beta-reduction - inlining can create nested lets, which then have to be "renormalized", floated out or rearranged via "commuting conversions". Similarly, the A-reduction ``E (if x then a else b) = if x then E a else E b`` duplicates the evaluation context, and as such is commonly left out. The workaround is to introduce a "join point", the ``e`` in ``let e z = ... in if x then e a else e b``. But join points are essentially continuations, second-class in that they are represented by function bindings. Even if specifically marked, they are fragile, in that per :cite:`maurerCompilingContinuations2016` the case-of-case transformation must handle join points specially, and similarly other transformations must preserve join points (e.g. not inlining the join point). Furthermore, they are not really functions, requiring a special calling convention to compile efficiently. As Kennedy says, "Better to start off with a language that makes continuations explicit." -So both CPS and ANF suck. Fortunately, :cite:`downenSequentCalculusCompiler2016` presents Sequent Core, which retains the advantages of first-class continuations while avoiding the drawbacks. Sequent Core does not force choosing an evaluation order. A nested function application is munged a little, but the order is not inverted and CSE still works. ``Cut`` glues together graph pieces, but is itself part of the graph, hence does not need to encode its sub-graphs. Functions reduce to join points and other types of sequent, rather than the reverse. Reduction is uniformly cut-elimination, and does not require renormalization. +So all of CPS, ANF, and joint points suck. Fortunately, :cite:`downenSequentCalculusCompiler2016` presents Sequent Core, which retains the advantages of first-class continuations while avoiding the drawbacks. Sequent Core does not force choosing an evaluation order. A nested function application is munged a little, but the order is not inverted and CSE still works. ``Cut`` glues together graph pieces, but is itself part of the graph, hence does not need to encode its sub-graphs. Functions reduce to join points and other types of sequent, rather than the reverse. Reduction is uniformly cut-elimination, and does not require renormalization. Downen at al. implemented Sequent Core, but the implementation was complex. I think though that this complexity was mainly due to the need to interface with GHC and closely follow the design of GHC's original System F-like Core. A novel implementation focusing on a "clean" sequent logic, and emphasizing duality and symmetries, like Stroscot's, should be able to avoid this implementation complexity. I asked SPJ and he was like "go for it." SSA represents code as procedures containing imperative blocks, and can't express higher-order features. But, per :cite:`appelSSAFunctionalProgramming1998`, SSA code blocks map directly to mutually recursive tail-called functions, with the procedure as a distinguished function. Indeed, the Swift Intermediate Language adopted block arguments instead of φ-nodes. SSA's basic blocks identify "small" functions that can be compiled easily, but this pass can be replicated in any IR. The other aspect of SSA, single static variable assignment, is reflected in a pass that remove mutable variables by replacing them with additional input and output arguments. diff --git a/docs/Commentary/Learning.rst b/docs/Commentary/Learning.rst index b36f980..9123273 100644 --- a/docs/Commentary/Learning.rst +++ b/docs/Commentary/Learning.rst @@ -46,8 +46,9 @@ Neuroscience We can distinguish many tasks in programming: -* writing code -* skimming code +* designing code - often happens purely in the head. Programmers take long walks or similar to figure out a problem, and then when they got home, they simply write out the code they have thought of. The "hard part" happens away from the computer. +* writing code - mechanical typing, hard to tell if it's very cognitive +* skimming code - relies on the language centres of the brain, the ventral lateral prefrontal cortex. (citation needed) * reading code carefully for bugs * learning a new programming language @@ -57,6 +58,12 @@ Programming misconceptions aren't likely to release dopamine, For experienced programmers, a specific "Stroscot for Y programmers" guide series should be sufficient to retrain programmers away from their bad habits. +working memory - when you’re thinking of a sequence of events, you need to keep the line of logical reasoning held in your working memory + +flow state - it’s easy to think ‘just one more compile’ and end up staying up all night. + +coding provides an immediate feedback loop with testing and seeing the results. This releases Dopamine (citation needed) + TODO: check out the Unified Learning Model book Content diff --git a/docs/Commentary/Logic.rst b/docs/Commentary/Logic.rst index db3eda7..fc3842f 100644 --- a/docs/Commentary/Logic.rst +++ b/docs/Commentary/Logic.rst @@ -10,8 +10,6 @@ Logic \newcommand{\multimapboth}{\mathbin{\mathrm{⧟}}} \newcommand{\bang}{{\mathrm{!}}} \newcommand{\whim}{{\mathrm{?}}} - \newcommand{\bangc}{{\mathrm{!}}} - \newcommand{\whimc}{{\mathrm{?}}} \newcommand{\ocin}{\mathrel{\raise{-1pt}{\mathrm{!}}\mathord{\in}}} \definecolor{mygray}{RGB}{156,156,156} \newcommand{\sk}[1]{{\color{mygray} #1}} @@ -25,33 +23,51 @@ The use of sequent calculus is inspired by :cite:`downenSequentCalculusCompiler2 Between classical, intuitionistic, and linear logic, I went with linear logic. It's the most expressive, in that intuitionistic and classical logic can be encoded fairly naturally, but linear logic has more operators. Intuitionistic logic is traditional and has a direct machine implementation, but there is an operational semantics for linear logic :cite:`mackieGeometryInteractionMachine1995` and linear logic makes the expensive "copy term" operation more explicit. In intuitionistic logic, copying can happen anywhere in the reduction, which is harder to handle. The "boxes" of linear logic which demarcate copying boundaries are supposedly the basis of optimal reduction :cite:`guerriniTheoreticalPracticalIssues1996`. +There are also other logics similar to Girard's linear logic, like deep inference. Most papers on deep inference seem to add the `mix rule `__ which corresponds to assuming :math:`1 \leftrightarrow \bot`. This doesn't seem attractive compared to plain linear logic - it proves contradiction as a theorem, hence loses the embedding of classical logic. `This page `__ mentions that mix holds in various models of linear logic such as coherent spaces and the game-theoretic semantics, but models are usually a simplification and there are models such as the syntactic model where mix doesn't hold. :cite:`strassburgerDeepInferenceExpansion2019` presents a deep inference system for non-mix MLL but says extending it to full LL is future work. + +There is also the question of if, having removed weakening and contraction, the remaining structural rules, exchange and associativity, should be dropped. Linear logic gives the idea of the propositions being in named slots, as e.g. :math:`\vdash A^a, B^b`` is for almost all purposes the same as :math:`\vdash B^b, A^a``, and only in some narrow cases would we want to differentiate them. This associative array semantics corresponds well to the RAM model. In contrast, dropping exchange gives non-commutative or ordered logic, leading to a stack or list on each side. But :cite:`shiVirtualMachineShowdown2005` shows that a register model is much better for an implementation - the extra stack swapping instructions are more overhead than the additional register names. Stack-like access is just too restrictive. Similarly, dropping associativity gives a tree-like semantics, and trees are not iterable in constant time. The number of operators would explode because every tree structure / stack index would create a new operator. Hence linear logic is the clear winner. But, there seem to be reasonable ways of embedding linear logic in non-associative / non-commutative logic by adding associative / commutative modalities. :cite:`blaisdellNonassociativeNoncommutativeMultimodal2022` If there was a popular logic with such an embedding, then we could switch from linear logic to that. But per :cite:`millerOverviewLinearLogic2004` "no single [non-commutative] proposal seems to be canonical at this point." + :cite:`downenSequentCalculusCompiler2016` say "Sequent Core lies in between [intuitionistic and classical logic]: sometimes there can only be one conclusion, and sometimes there can be many." Specifically, terms and types are intuitionistic (single conclusion), but commands, continuations and bindings are written to allow multiple propositions. Ret/Ax, WR/Name, and Jump/Label all introduce right-weakening. Linear logic seems a lot cleaner than this mess, but also lies between intuitionistic and classical logic, in that (like intuitionistic) there is no excluded middle, but also (like classical) there is duality and no limitation on the number of conclusions. -Linear logic can be formulated as two-sided, one-sided, or intuitionistic. I chose two-sided because again, it's the most expressive. Intuitionistic logic doesn't typically have negation, but it can be encoded as :math:`A \to \bot`, pretty bad for trying to simplify connectives. One-sided might seem attractive, as on first glance it makes formulas more uniform, but it actually doesn't really improve complexity because the one-sided logic has the same number of connectives, and the duals require special handling - e.g. an application of identity looks like :math:`\vdash \Sigma X^\bot, \Pi X` instead of :math:`\Pi X \vdash \Pi X`. So we end up with a representation of a connective as having a boolean for polarity, whereupon there is not much issue with adding a boolean for side as well to obtain the two-sided sequent. The two-sided sequent formulation preserves intent, e.g. the definition of refutations as sequents :math:`A \vdash` is not distinct from that of a proof in the one-sided calculus. Taking the two-sided logic as basic, the one sided logic can be formulated as a transformation that moves everything to the right side with negation and pushes down negations to atomic statements by applying duality. Such a transformation is of course useful, but forcing it to be applied from the start does not seem to serve much purpose. +Linear logic can be formulated as two-sided, one-sided, or intuitionistic. I chose two-sided because again, it's the most expressive. The two-sided sequent formulation preserves intent, e.g. the definition of proofs as sequents :math:`\vdash A` is lost in the one-sided calculus as everything is of that form. Taking the two-sided logic as basic, the one sided logic can be formulated as a transformation that moves everything to the right side with negation and pushes down negations to atomic statements by applying duality. Such a transformation is of course useful, but forcing it to be applied from the start does not seem to serve much purpose. Intuitionistic logic doesn't have negation - it has to be encoded as :math:`A \to \bot`, pretty bad for trying to simplify connectives. -There are also other logics similar to Girard's linear logic, like deep inference. Most papers on deep inference seem to add the `mix rule `__ which corresponds to assuming :math:`1 \leftrightarrow \bot`. This doesn't seem attractive compared to plain linear logic - it proves contradiction as a theorem, hence loses the embedding of classical logic. `This page `__ mentions that mix holds in various models of linear logic such as coherent spaces and the game-theoretic semantics, but models are usually a simplification and there are models such as the syntactic model where mix doesn't hold. :cite:`strassburgerDeepInferenceExpansion2019` presents a deep inference system for non-mix MLL but says extending it to full LL is future work. +There is one advantage to one-sided, namely it reduces the number of positions, so a naive representation goes from two arrays of formulas/slots to one. But the one-sided logic still has dual connectives - for example an application of identity of par looks like :math:`\vdash A \par B, A^\bot \otimes B^\bot` instead of :math:`A \par B \vdash A \par B`. So we also should combine dual connectives, by adding a flag for polarity. But then we might as well adding a boolean for side as well, to recover the two-sided sequent logic. So we could represent a rule as (connective,polarity,side), where polarity is positive or negative, side is left or right, and the connectives themselves are unpolarized and unsided. This is how they are presented in the reference: the propositions are on the left and right of the sequent, and each connective is annotated with plus or minus. The only minor constraint of this notation is that left propositions are listed before right propositions, which is fine since the list of sequent propositions is unordered. -There is also the question of if, having removed weakening and contraction, the remaining structural rules, exchange and associativity, should be dropped. Linear logic gives the idea of the propositions being in named slots, as e.g. :math:`\vdash A^a, B^b`` is for almost all purposes the same as :math:`\vdash B^b, A^a``, and only in some narrow cases would we want to differentiate them. This associative array semantics corresponds well to the RAM model. In contrast, dropping exchange gives non-commutative or ordered logic, leading to a stack or list on each side. But :cite:`shiVirtualMachineShowdown2005` shows that a register model is much better for an implementation - the extra stack swapping instructions are more overhead than the additional register names. Stack-like access is just too restrictive. Similarly, dropping associativity gives a tree-like semantics, and trees are not iterable in constant time. The number of operators would explode because every tree structure / stack index would create a new operator. Hence linear logic is the clear winner. But, there seem to be reasonable ways of embedding linear logic in non-associative / non-commutative logic by adding associative / commutative modalities. :cite:`blaisdellNonassociativeNoncommutativeMultimodal2022` If there was a popular logic with such an embedding, then we could switch from linear logic to that. But per :cite:`millerOverviewLinearLogic2004` "no single [non-commutative] proposal seems to be canonical at this point." +But polarity+side is not quite canonical as we have for example that the rule :math:`𝕁^+_L` is the same structure as :math:`𝕁^-_R`, and similarly :math:`𝕁^-_L` is the same as :math:`𝕁^+_R`. In the compiler, we would like to only have one datatype alternative for each rule structure, and then have a flag for distinguishing the remaining side/polarity bit. Looking at the rules: + +* Class 1: + + * J+R/J-L is a data constructor or function application + * !+R/!-L is a box + * Q+R/Q-L is also pretty much a box + +* Class 2: + + * !d-R/!d+L is an accessor, and weakening/contraction follow similar structure + * J-R/J+L is a case statement + * Q-R/Q+L is substituting a value in a derivation at the type level + +Per ChatGPT, we can classify these as "build"/"break", thus getting our desired bit. Actually we will use specific rule names most of the time, so we don't need the bit. Then we can use the side as the second bit (as that is syntactic and easier to read from a sequent than polarity). Type annotations ================ In :cite:`downenSequentCalculusCompiler2016`, TyApp is written in natural deduction style. Actually, it seeems type annotations just sort of float around, they should not be part of the sequents at all. In my presentation of logic, terms are the proof trees, and propositions the types. So when it is written that ∀ and ∃ consume terms with types, like ∀R :math:`(\Gamma, a : \kappa \to v : \tau) \vdash (\Gamma \to \Lambda a : \kappa. v : \forall a : \kappa . \tau)`, :math:`a : \kappa` is just an assertion, and I would instead just write :math:`(\Gamma \to \tau) \vdash (\Gamma \to \forall a : \kappa . \tau)`, where :math:`a` has no free occurrence in :math:`\Gamma` and is ambiently assumed to be of some type. Stroscot is a unityped language, so the types can just be ignored for now. -Jumbo connectives -================= +Jumbo connective +================ -Based on :cite:`levyJumboLcalculus2006`, Stroscot aims for the largest allowable set of operators. In particular we generalize into two jumbo operators, :math:`\Sigma` (sigma) and :math:`\Pi` (pi). The generalized :math:`\Pi` rule is similar to Levy's rule except it allows multiple conclusion propositions. The generalized :math:`\Sigma` rule is the dual of :math:`\Pi`. We have indexed variables :math:`A_{ij}` and :math:`B_{ik}` where :math:`0 \leq i < N, 0 \leq j < m_i, 0 \leq k < n_i`. We call :math:`N` the length of the jumbo type and the list :math:`[(m_i,n_i)]` the jumbo-arity. +Based on :cite:`levyJumboLcalculus2006`, Stroscot aims for the largest allowable set of operators. In particular we have one connective, the "jumbo" :math:`𝕁` (double-struck capital J, decided per ChatGPT), which comes in two polarities :math:`𝕁^+` and :math:`𝕁^-`. :math:`𝕁^+` is positive, and on the left (input) side it is data, like Levy's :math:`\Sigma` tuple rule. :math:`𝕁^-` is negative, and on the right acts like Levy's :math:`\Pi` function rule. In both cases though they are more general, allowing multiple conclusions, dualization, and other features of linear logic. Each :math:`𝕁` variant is defined relative to a case structure, like :math:`𝕁 [(a, [A_{11},A_{12}] \vdash [B_{a1}]),(b, [A_{b1}] \vdash []), (c,[]-[B_{c1}])]` We have a list of labels, :math:`l_i, 1 \leq i \leq N`, and each label has left and right sequent propositions :math:`A_{ij}` and :math:`B_{ik}` where :math:`1 \leq j \leq m_i, 1 \leq k \leq n_i`. Following Levy, we call :math:`N` the length of the jumbo type and the list :math:`[(m_i,n_i)]` the jumbo-arity. -In :cite:`downenSequentCalculusCompiler2016`, TLK/TRK correspond to the intuitionistic jumbo-style Sigma operator. But they have no corresponding Pi operator, but rather Case/Deflt, which have no effect on the sequent type, hence no logical meaning. They say they considered adding "general existential types" which seems to be a Pi operator in disguise, but mapping to GHC Core was too difficult and required a "heavy-handed encoding". Since Stroscot is a new language, we can have a full set of jumbo operators to do all sorts of case-matching. +In :cite:`downenSequentCalculusCompiler2016`, TLK/TRK correspond to an intuitionistic-style one-sided :math:`𝕁^+` operator. But they have no corresponding :math:`𝕁^-` operator, but rather Case/Deflt, which have no effect on the sequent type, hence no logical meaning. They say they considered adding "general existential types" which seems to be a :math:`𝕁^-` operator in disguise, but mapping to GHC Core was too difficult and required a "heavy-handed encoding". Since Stroscot is a new language, we can have a full set of jumbo operators to do all sorts of case-matching. -When the RHS of :math:`\Sigma` is nonempty we get terms with holes, that can be pattern-matched by filling the holes, e.g. `difference lists `__. (TODO: check that this actually gives efficient concatenation) +When the RHS of :math:`𝕁^+` is nonempty we get terms with holes, that can be pattern-matched by filling the holes, e.g. `difference lists `__. (TODO: check that this actually gives efficient concatenation) -The dual of implication is called "subtraction" or "difference" and is denoted :math:`-`. For an ADT, the RHS of the difference is empty, i.e. ``a A | b B1 B2 | C`` looks like :math:`\Sigma [(a, [A]-[]),(b, [B_1, B_2]-[]), (c,[]-[])]`. This follows :cite:`wadlerCallbyvalueDualCallbyname2003` and :cite:`crolardFormulaeastypesInterpretationSubtractive2004` but is flipped compared to Pi's implication. +The dual of implication is called "subtraction" or "difference" and is denoted :math:`-`. For an ADT, the RHS of the difference is empty, i.e. ``a A | b B1 B2 | C`` looks like . This follows :cite:`wadlerCallbyvalueDualCallbyname2003` and :cite:`crolardFormulaeastypesInterpretationSubtractive2004` but is flipped compared to Pi's implication. -The jumbo connectives have the nice "unpacking" property that any combination of :math:`\Sigma` connectives is equivalent to a single :math:`\Sigma` connective, and likewise for :math:`\Pi`. +The jumbo connectives have the nice "unpacking" property that any combination of :math:`𝕁^+` connectives is equivalent to a single :math:`𝕁^+` connective, and likewise for :math:`𝕁^-`. -The index :math:`i` in Levy's presentation is a tag drawn from a finite subset of a countable set of labels. But we can draw it from a subset of the universal set and hence get dependent types that depend on terms. In particular :math:`\Pi` gives a dependent function type and :math:`\Sigma` gives a dependent pair type. +The index :math:`i` in Levy's presentation is a tag drawn from a finite subset of a countable set of labels. But we can draw it from a subset of the universal set and hence get dependent types that depend on terms. In particular :math:`𝕁^-` gives a dependent function type and :math:`𝕁^+` gives a dependent pair type. Common connectives ================== @@ -75,22 +91,22 @@ The negations and shifts show up in polarized / focused linear logic. We use the Programming types ----------------- -With the programming types we see the justification for the jumbo types: they can represent abstract data types (ADTs). Even though we can encode :math:`\Pi,\Sigma` using the common connectives: +With the programming types we see the justification for the jumbo types: they can represent abstract data types (ADTs). Even though we can encode :math:`𝕁` using the common connectives: .. math:: - \Pi [(\#t_1,[A_{1,1},A_{1,2},\ldots] \multimap [B_{1,1},B_{1,2},\ldots]),\ldots] \equiv (\smash{\stackrel{-}{\neg}} A_{1,1} \par \smash{\stackrel{-}{\neg}} A_{1,2} \par \ldots \par B_{1,1} \par \ldots) \land \ldots + 𝕁^- [(\#t_1,[A_{1,1},A_{1,2},\ldots] \vdash [B_{1,1},B_{1,2},\ldots]),\ldots] \equiv (\smash{\stackrel{-}{\neg}} A_{1,1} \par \smash{\stackrel{-}{\neg}} A_{1,2} \par \ldots \par B_{1,1} \par \ldots) \land \ldots - \Sigma [(\#t_1,[A_{1,1},A_{1,2},\ldots] \multimap [B_{1,1},B_{1,2},\ldots]),\ldots] \equiv (A_{1,1} \otimes A_{1,2} \otimes \ldots \otimes \smash{\stackrel{+}{\neg}} B_{1,1} \otimes \ldots) \lor \ldots + 𝕁^+ [(\#t_1,[A_{1,1},A_{1,2},\ldots] \vdash [B_{1,1},B_{1,2},\ldots]),\ldots] \equiv (A_{1,1} \otimes A_{1,2} \otimes \ldots \otimes \smash{\stackrel{+}{\neg}} B_{1,1} \otimes \ldots) \lor \ldots With the encoding, we lose the free-form tags and have to use strings like "RRRRRL". This leads to unbalanced proof trees and a general lack of expressiveness of the proof language. -Lambdas have several conflicting definitions, so I have left them out. :cite:`maraistCallbynameCallbyvalueCallbyneed1995` says call by name lambdas are :math:`\Omega_N = \bang \Omega_N \to \Omega_N`, and call by value lambdas are :math:`\Omega_V = \bang (\Omega_V \to \Omega_V)`. :cite:`filinskiLinearContinuations1992` agrees on call by name lambdas, but says call by value lambdas are :math:`\Omega_V = \bang (\Omega_V \to \whim \Omega_V)`. A paper I forgot to cite says that these all expand to the same infinite type so are equivalent, making me wonder if linear logic actually represents reduction strategy at all. +Lambdas have several conflicting definitions, so I have left them out. :cite:`maraistCallbynameCallbyvalueCallbyneed1995` says call by name lambdas are :math:`\Omega_N = \bang^+ \Omega_N \to \Omega_N`, and call by value lambdas are :math:`\Omega_V = \bang^+ (\Omega_V \to \Omega_V)`. :cite:`filinskiLinearContinuations1992` agrees on call by name lambdas, but says call by value lambdas are :math:`\Omega_V = \bang^+ (\Omega_V \to \bang^- \Omega_V)`. A paper I forgot to cite points out that most of these expand to the same infinite type so are equivalent, making me wonder if these types actually represent differing reduction strategies at all. -Exponentials -============ +Exponential +=========== -There are two S4 modalities !/bang/"of course" (positive/affirmative) and the dual ?/whim/whimper/"why not" (negative). +The exponential is an S4 modality. Again it comes in two polarities. These are usually written ! (bang/"of course", positive) and ? (whim/whimper/"why not", negative) but to make polarity explicit we write :math:`\bang^+` (equivalent to !) and :math:`\bang^-` (equivalent to ?). ! was chosen as the base symbol because use of ? is rather rare. Contraction ----------- @@ -102,14 +118,9 @@ Subexponentials Standard linear logic only contains bang and whim. But per :cite:`nigamAlgorithmicSpecificationsLinear2009` these are not "canonical" - if we introduce two modalities :math:`\bang_1, \bang_2` with distinct but identical rules, we cannot prove :math:`\bang_1 A \equiv \bang_2 A`. So in keeping with the maximalist approach we present here the logic with subexponentials. The subexponentials function as annotations on exponentials, in that we can erase all the subexponential labels to a single standard exponential, and we can infer subexponential labels, computing the minimal subexponential structure necessary for the program to work. Subexponentials whose only operations are promotion/dereliction can be deleted from the program. :cite:`danosStructureExponentialsUncovering1993` -For notation, there's a lot of choices. I propose the following: +For notation, there's a lot of choices. I propose :math:`\bang^\pm_S` (with :math:`S = (m,x) \in (M,X)`), where :math:`m` is in an index set :math:`M` and :math:`x \in X, X = P(\{c, w\})`. :math:`\bang^\pm` are the well-known full exponentials :math:`m=\cdot,x=\{c,w,d\}`, as is standard. -* :math:`\bang^x_m,\whim^x_m`, :math:`\bang_{(m,x)},\whim_{(m,x)}`, or more simply :math:`\bang_S,\whim_S` (with :math:`S = (m,x) \in (M,X)`) is the full notation for a subexponential, where :math:`m` is in an index set :math:`M` and :math:`x \in X, X = P(\{c, w\})`. -* The "well-known" subexponentials are :math:`\bang^x,\whim^x`, where the index :math:`m=\cdot \in M` is omitted. These allow easily omitting various operations. -* The "full" subexponentials are :math:`\bang_m,\whim_m`, and have :math:`x=\{c,w,d\}` - this allows easily writing distinguished subexponentials. -* Combining these conventions, :math:`\bang,\whim` are the well-known full exponentials :math:`m=\cdot,x=\{c,w,d\}`, as is standard. - -To use these we must define a relation :math:`\leq` on :math:`(M,X)` such that :math:`((M,X),\leq)` is a poset abd satisfies two more requirements: +To use these we must define a relation :math:`\leq` on :math:`(M,X)` such that :math:`((M,X),\leq)` is a poset and satisfies two more requirements: #. :math:`(m,x) \leq (n,y)` implies :math:`x\subseteq y`. #. if :math:`x \subseteq y`, then :math:`(m,x) \leq (m,y)` for all :math:`m`. @@ -118,49 +129,48 @@ Justifying these: * Reflexivity ensures the identity theorem. * Transitivity and the first requirement ensure cut elimination. -* Antisymmetry ensures that if :math:`\bang^x_m A \equiv \bang^y_n A` then :math:`m=n` and :math:`x=y`, so that we do not have duplicate notation for a particular modality. +* Antisymmetry ensures that if :math:`\bang^+_M A \equiv \bang^+_N A` then :math:`m=n` and :math:`x=y`, so that we do not have duplicate notation for a particular modality. * The second requirement is not strictly necessary, but makes the notation more coherent. If it is not satisfied we can simply split the index :math:`m` into two or more indexes :math:`m_i`. -The rule for promotion requires that :math:`(z,o)\leq (x_i,m_i)` and :math:`(z,o)\leq (y_i,n_i)` for the elements of the context. +The rule for promotion requires that :math:`Z \leq M_i` and :math:`Z \leq N_i` for the elements of the context. .. math:: :nowrap: \begin{array}{cc} - \rule{\overrightarrow{\bang^{x_i}_{m_i} \Gamma_i } \vdash A, \overrightarrow{\whim^{y_i}_{n_i}\Delta_i} }{\overrightarrow{\bang^{x_i}_{m_i} \Gamma_i } \vdash \bang^z_o A, \overrightarrow{\whim^{y_i}_{n_i}\Delta_i}}{\bang} + \rule{\overrightarrow{\bang^+_{M_i} \Gamma_i } \vdash A, \overrightarrow{\bang^{-}_{N_i}\Delta_i} }{\overrightarrow{\bang^{+}_{M_i} \Gamma_i } \vdash \bang^{+}_Z A, \overrightarrow{\bang^{-}_{N_i}\Delta_i}}{\bang^+} & - \rule{\overrightarrow{\bang^{x_i}_{m_i} \Gamma_i } , A\vdash \overrightarrow{\whim^{y_i}_{n_i}\Delta_i} }{\overrightarrow{\bang^{x_i}_{m_i} \Gamma_i }, \whim^z_o A \vdash \overrightarrow{\whim^{y_i}_{n_i}\Delta_i}}{\whim} - + \rule{\overrightarrow{\bang^+_{M_i} \Gamma_i }, A \vdash \overrightarrow{\bang^{-}_{N_i}\Delta_i} }{\overrightarrow{\bang^{+}_{M_i} \Gamma_i } ,\bang^{-}_Z A \vdash \overrightarrow{\bang^{-}_{N_i}\Delta_i}}{\bang^-} \end{array} -Dereliction must be allowed, otherwise the identity rule does not hold. +Dereliction is allowed unconditionally, so that the identity rule holds. .. math:: :nowrap: \begin{array}{cc} - \rule{\sk{\Gamma}, A \vdash \sk{\Delta} }{\sk{\Gamma}, \bang^x_m A \vdash \sk{\Delta}}{\bang d} - & \rule{\sk{\Gamma} \vdash A, \sk{\Delta} }{\sk{\Gamma} \vdash \whim^x_m A, \sk{\Delta}}{\whim d} + \rule{\sk{\Gamma}, A \vdash \sk{\Delta} }{\sk{\Gamma}, \bang^+_M A \vdash \sk{\Delta}}{\bang^+ d} + & \rule{\sk{\Gamma} \vdash A, \sk{\Delta} }{\sk{\Gamma} \vdash \bang^-_M A, \sk{\Delta}}{\bang^- d} \end{array} -Weakening requires :math:`w \in x`. +Weakening requires :math:`w \in \text{snd} M`. .. math:: :nowrap: \begin{array}{cc} - \rule{\sk{\Gamma} \vdash \sk{\Delta} }{\sk{\Gamma}, \bang^x_m A \vdash \sk{\Delta}}{\bang w} - & \rule{\sk{\Gamma} \vdash \sk{\Delta} }{\sk{\Gamma} \vdash \whim^x_m A, \sk{\Delta}}{\whim w} + \rule{\sk{\Gamma} \vdash \sk{\Delta} }{\sk{\Gamma}, \bang^+_M A \vdash \sk{\Delta}}{\bang^+ w} + & \rule{\sk{\Gamma} \vdash \sk{\Delta} }{\sk{\Gamma} \vdash \bang^-_M A, \sk{\Delta}}{\bang^- w} \end{array} -Contraction requires :math:`c \in x` +Contraction requires :math:`c \in \text{snd} M` .. math:: :nowrap: \begin{array}{cc} - \rule{\sk{\Gamma}, \overrightarrow{\bang^x_m A, \bang^x_m A, \cdots} \vdash \sk{\Delta} }{\sk{\Gamma}, \bang^x_m A \vdash \sk{\Delta}}{\bang c_n} - & \rule{\sk{\Gamma} \vdash \overrightarrow{\whim^x_m A, \whim^x_m A, \cdots}, \sk{\Delta} }{\sk{\Gamma} \vdash \whim^x_m A, \sk{\Delta}}{\whim c_n} + \rule{\sk{\Gamma}, \overrightarrow{\bang^+_M A, \bang^+_M A, \cdots} \vdash \sk{\Delta} }{\sk{\Gamma}, \bang^+_M A \vdash \sk{\Delta}}{\bang^+ c} + & \rule{\sk{\Gamma} \vdash \overrightarrow{\bang^-_M A, \bang^-_M A, \cdots}, \sk{\Delta} }{\sk{\Gamma} \vdash \bang^-_M A, \sk{\Delta}}{\bang^- c} \end{array} We also allow quantification over subexponentials, as in :cite:`nigamAlgorithmicSpecificationsLinear2009`. @@ -168,25 +178,25 @@ We also allow quantification over subexponentials, as in :cite:`nigamAlgorithmic Modalities ---------- -Because of the equivalences :math:`\bang \bang A \equiv \bang A, \whim\whim A \equiv \whim A, \bang \whim \bang \whim A \equiv \bang \whim A, \whim \bang \whim \bang A \equiv \whim \bang A`, there are only 7 modalities created from combining exponentials. They have the relationships as follows, where an arrow :math:`A \to B` means :math:`\vdash A \to B` is provable: :cite:`coniglioEqualityLinearLogic2002` +Because of the equivalences :math:`\bang^\pm \bang^\pm A \equiv \bang^\pm A, \bang^\pm \bang^\mp \bang^\pm \bang^\mp A \equiv \bang^\pm \bang^\mp A`, there are only 7 modalities created from combining exponentials. They have the relationships as follows, where an arrow :math:`A \to B` means :math:`\vdash A \to B` is provable: :cite:`coniglioEqualityLinearLogic2002` .. graphviz:: digraph G { rankdir=LR - "!A" -> "A" - "A" ->"?A" - "!A" -> "!?!A" -> {"!?A","?!A"} -> "?!?A" -> "?A" + "+A" -> "A" + "A" ->"-A" + "+A" -> "+-+A" -> {"+-A","-+A"} -> "-+-A" -> "-A" subgraph C { rank=same - "A","!?A","?!A" + "A","+-A","-+A" } } More generally with subexponentials: -* For :math:`(x,m)\geq(y,n)`, :math:`\bang^x_m \bang^y_n A \equiv \bang^x_m A \equiv \bang^y_n \bang^x_m A`, and identically with :math:`\whim`. -* For :math:`(x,n)\leq(z,p)` and :math:`(y,o)\leq(w,m)` we can prove :math:`\bang^w_m \whim^x_n \bang^y_o \whim^z_p A \equiv \bang^w_m \whim^z_p A`. +* For :math:`M \geq N`, :math:`\bang^+_M \bang^+_N A \equiv \bang^+_M A \equiv \bang^+_N \bang^+_M A`, and identically with :math:`\bang^-`. +* For :math:`N \leq P` and :math:`O \leq M` we can prove :math:`\bang^+_M \bang^-_N \bang^+_O \bang^-_P A \equiv \bang^+_M \bang^-_P A`. With distinguished subexponentials the possible modalities become infinite, for example alternating patterns like :math:`\bang_1 \bang_2 \bang_1 \bang_2` cannot be simplified unless there is a relation in the poset between 1 and 2. But of course if we erase the index labels then :math:`\bang \bang \bang \bang A \equiv \bang A`. Due to this, I think keeping the "well-known" subexponentials in the IR is fine and informative, but distinguished subexponentials should be avoided except during a pass that reduces exponential strength and eliminates unnecessary exponentials. @@ -203,29 +213,31 @@ It seems easier to follow :cite:`lafontLinearLogicPages` and classify proofs by * :math:`x, \ldots, z \vdash A`, where :math:`x,\ldots,z` are free variables representing propositions (most likely appearing in :math:`A`), is a proof pattern of :math:`A`. Similarly :math:`x, \ldots, z, A \vdash` is a refutation pattern. * :math:`\Gamma \vdash A` is a term with result type :math:`A` -:math:`\Sigma_R` constructs a proof from a collection of proofs and refutations, while :math:`\Pi_L` constructs a refutation from a collection of proofs and refutations. +:math:`𝕁^+_R` constructs a proof from a collection of proofs and refutations, while :math:`𝕁^-_L` constructs a refutation from a collection of proofs and refutations. Polarized logic =============== -Following :cite:`lafontLinearLogicPages` we say a proposition :math:`A` is positive if :math:`A \equiv \bang A`. Of course the direction :math:`\bang A \vdash A` is trivial by dereliction. +Following :cite:`lafontLinearLogicPages` we say a proposition :math:`A` is positive if :math:`A \equiv \bang^+ A`. Dually we say :math:`A` is negative if :math:`A \equiv \bang^- A`. :math:`A` is positive iff :math:`\neg A` is negative, and vice-versa. -Going through the rules on :math:`A \vdash \bang A`: +The direction :math:`\bang^+ A \vdash A` is trivial by dereliction, so when considering a specific proposition we only must show :math:`A \vdash \bang^+ A`. + +In general, :math:`𝕁^+ [ \vec A_i \vdash \vec B_i]` is positive, and :math:`𝕁^- [ \vec A_i \vdash \vec B_i ]` is negative, if :math:`A_i` are all positive and :math:`B_i` are all negative, hence the assignment of plus and minus for :math:`𝕁`. + +Going through the rules on :math:`A \vdash \bang^+ A`: * :math:`F,1` are positive -* :math:`\bang A` is positive for any :math:`A` +* :math:`\bang^+ A` is positive for any :math:`A` * :math:`A \lor B,A \otimes B` are positive if :math:`A,B` are positive * :math:`T,\bot` are not positive -* :math:`A \land B` is positive if :math:`A` is positive and :math:`\bang A \vdash B`, e.g. if :math:`B` is a theorem +* :math:`A \land B` is positive if :math:`A` is positive and :math:`\bang^+ A \vdash B`, e.g. if :math:`B` is a theorem * :math:`A \par \bot` is positive if :math:`A` is positive. More generally, :math:`A \par B` is positive if :math:`A` is positive, :math:`B \vdash \bot` (i.e., :math:`B` is a refutation), and :math:`A\vdash A,B`. -Dually we say :math:`A` is negative if :math:`A \equiv \whim A`. :math:`A` is positive iff :math:`\neg A` is negative, and vice-versa. In general, :math:`\Sigma [ \vec A_i - \vec B_i]` is positive and :math:`\Pi [ \vec A_i \multimap \vec B_i ]` is negative if :math:`A_i` are all positive and :math:`B_i` are all negative. - We say that a proposition is polarized if it is either positive or negative. Unpolarized propositions such as :math:`T \otimes \bot` exist, although every tautology is positive and every absurdity is negative. By strong consistency, no proposition can be both negative and positive. Girard defined a syntactic rather than semantic notion of polarity, based on classifying connectives. By his definitions, :math:`A \par \bot` and :math:`A\otimes 1` remove polarity from a positive resp. negative proposition. But the semantic definition seems more useful. -Lafont also defines regular types :math:`A \equiv \whim \bang A`, but considering that there are 6 non-trivial modalities including a dual modality :math:`\bang \whim` this seems too limited. There is not a clear justification for this definition; maybe some paper in the literature explains it, but a simple search "regular linear logic" didn't uncover it. +Lafont also defines regular types :math:`A \equiv \bang^- \bang^+ A`, but considering that there are 6 non-trivial modalities including a dual modality :math:`\bang^+ \bang^-` this seems too limited. There is not a clear justification for this definition; maybe some paper in the literature explains it, but a simple search "regular linear logic" didn't uncover it. Cartesian types --------------- @@ -257,7 +269,7 @@ In addition, Filinski adds uniqueness: :math:`w : A \vdash 1` and :math:`c : A \ We can similarly define co-cartesian types :math:`A` that can be copied and discarded on the right with theorems :math:`w : 1 \vdash A` and :math:`c : A \otimes A \vdash A` and are a commutative monoid. -I'm not sure if there is a simple description of all cartesian types, but of course :math:`\bang/\whim` are cartesian/co-cartesian, and :math:`Sigma [ \vec A_i - \vec B_i]` and and :math:`\Pi [ \vec A_i \multimap \vec B_i ]` are cartesian/co-cartesian if :math:`A_i` are all cartesian and :math:`B_i` are all co-cartesian. So ADTs in general - booleans, integers, lists, trees - are all cartesian. Our earlier example :math:`C` is in general not cartesian because, although copy and discard can be proven, it uses the included functions and those won't necessarily satisfy the laws. +I'm not sure if there is a simple description of all cartesian types, but of course :math:`\bang^\pm` are cartesian/co-cartesian, and :math:`𝕁^+ [ \vec A_i \vdash \vec B_i]` and and :math:`𝕁^- [ \vec A_i \vdash \vec B_i ]` are cartesian/co-cartesian if :math:`A_i` are all cartesian and :math:`B_i` are all co-cartesian. So ADTs in general - booleans, integers, lists, trees - are all cartesian. Our earlier example :math:`C` is in general not cartesian because, although copy and discard can be proven, it uses the included functions and those won't necessarily satisfy the laws. Filinski says in 3.2 that all cartesian types are positive. Hence, for cartesian / co-cartesian types, because the proof of polarity is natural, we can be generous with exponentials and use them whenever we have such a type, without fear of changing the program semantics. This allows a more faithful representation of copy operations in the proof structure, avoiding implicit copying such as ``\x -> case x of True -> (True,True); False -> (False,False)``. @@ -291,6 +303,8 @@ We explicitly list the identity theorem where it is needed: Cut elimination for quantifiers is sound because the number of quantifiers in the sequent decreases. +The assignment of :math:`\forall` as positive is logical, based on analogy with :math:`\bang^+`. For example, we can prove :math:`A \vdash \bang^+ \forall x. A` for closed positive propositions :math:`A`. This is shown as a cut on the assumption :math:`A \vdash \bang^+ A` and on the easily provable :math:`\bang^+ A \vdash \bang^+ \forall x. A` + Logic translations ================== @@ -300,11 +314,11 @@ The translation from intuitionistic logic to linear logic decorates every propos .. math:: - \left[\prod \limits_{i} \left(\overrightarrow{A_i} \multimap \overrightarrow{B_i}\right)\right]_I &= \prod \limits_{i} \left(\overrightarrow{\bang\left[A_i\right]_I} \multimap \overrightarrow{\bang\left[B_i\right]_I}\right) + \left[𝕁^- \limits_{i} \left(\overrightarrow{A_i} \multimap \overrightarrow{B_i}\right)\right]_I &= \prod \limits_{i} \left(\overrightarrow{\bang\left[A_i\right]_I} \multimap \overrightarrow{\bang\left[B_i\right]_I}\right) - \left[\sum \limits_{i} \left(\overrightarrow{A_i} - \overrightarrow{B_i}\right)\right]_I &= \sum \limits_{i} \left(\overrightarrow{\bang\left[A_i\right]_I} - \overrightarrow{\bang\left[B_i\right]_I}\right) + \left[𝕁^+ \limits_{i} \left(\overrightarrow{A_i} - \overrightarrow{B_i}\right)\right]_I &= \sum \limits_{i} \left(\overrightarrow{\bang\left[A_i\right]_I} - \overrightarrow{\bang\left[B_i\right]_I}\right) -We can translate classical logic into intuitionistic logic by decorating every proposition and subproposition with :math:`\neg\neg` and moving the right to the left with another negation, i.e. :math:`\Gamma \vdash \Delta \Rightarrow \Gamma', \neg \Delta' \vdash`. Thus the translation of classical logic into linear logic decorates like :math:`\neg \bang (\neg \bang A) \equiv \whim \bang A`. +We can translate classical logic into intuitionistic logic by decorating every proposition and subproposition with :math:`\neg\neg` and moving the right to the left with another negation, i.e. :math:`\Gamma \vdash \Delta \Rightarrow \Gamma', \neg \Delta' \vdash`. Thus the translation of classical logic into linear logic decorates like :math:`\neg \bang^+ (\neg \bang^+ A) \equiv \bang^- \bang^+ A`. These two decoration translations preserve proof structure, in the sense that every intuitionistic/classical proof tree can be converted to a linear logic proof tree, and the reverse as well if the linear logic proof tree's sequent is the result of the proposition translation. There are other "uniform" translations, like in :cite:`danosStructureExponentialsUncovering1993`, but they aren't as simple. @@ -329,7 +343,7 @@ The naive set theory definition comes from :cite:`shirahataLinearSetTheory1994` Paradoxes --------- -It seems from playing with some examples that forbidding circular definitions is sufficient to prevent Russell's paradox and Curry's paradox. For example with :math:`R = \{x\mid \whim \neg(x \in x)\}`, :math:`\{x\mid F \} \in R` is defined (and provable) but :math:`R \in R` is circular hence not defined. So we cannot write the premise of Russell's paradox. We could try to work around this with a proposition like :math:`t\in R \land t= R`. This is not circular, but it is not sufficient to derive a paradox, as in order to apply contraction we have to use a substitution that produces :math:`R \in R`. Curry's paradox :math:`X=\{x\mid x\in x \to Y\}` similarly contains :math:`x\in x` and hence the notation :math:`X \in X` is circular and therefore forbidden as well. +It seems from playing with some examples that forbidding circular definitions is sufficient to prevent Russell's paradox and Curry's paradox. For example with :math:`R = \{x\mid \bang^- \neg(x \in x)\}`, :math:`\{x\mid F \} \in R` is defined (and provable) but :math:`R \in R` is circular hence not defined. So we cannot write the premise of Russell's paradox. We could try to work around this with a proposition like :math:`t\in R \land t= R`. This is not circular, but it is not sufficient to derive a paradox, as in order to apply contraction we have to use a substitution that produces :math:`R \in R`. Curry's paradox :math:`X=\{x\mid x\in x \to Y\}` similarly contains :math:`x\in x` and hence the notation :math:`X \in X` is circular and therefore forbidden as well. More formally, suppose the logic is inconsistent, i.e. there is a derivation :math:`\vdash \bot`. This must be derived from the empty sequent. Then what does cut elimination do? Either it completes, in which case we get a contradiction because no rule derives the empty sequent, or there's some infinite chain of cut elimination. I claim the infinite chain only happens if there is a circular definition involved and the identity or substitution rules are on the sides. Hence, forbidding circular definitions in the identity and cut rules solves the issue. @@ -349,7 +363,7 @@ Equality The axioms of reflexivity, substitution, etc. can take a variety of modalities as in :cite:`coniglioEqualityLinearLogic2002`, some of them corresponding with intuitionistic and classical notions of equality. For sets we use linear weak extensional equality. Alternatively we could use intuitionistic equality :math:`A\overset{!}{=}B \defeq !(A=B)`, then substitution is :math:`A\overset{!}{=}B, \phi \vdash \phi[A/B]`. But the linear equality seems more useful. -Proof of the substitution property: For :math:`\Pi` we use the right rule to split into cases for each tag, then we use contraction/weakening on :math:`\bang(A=B)` to match the number of A's/B's in the case, then the left rule to split into each A and B, giving each branch a copy of the hypothesis. :math:`\Sigma` is similar but with the left first. For exponentials, quantifiers, and set comprehension we simply do left/right in the correct order. Then at the end we use the hypothesis to change :math:`A[x/a]` on the left or right to :math:`B[x/b]`, or else weakening to remove the hypothesis followed by the identity. +Proof of the substitution property: For :math:`𝕁^-` we use the right rule to split into cases for each tag, then we use contraction/weakening on :math:`\bang(A=B)` to match the number of A's/B's in the case, then the left rule to split into each A and B, giving each branch a copy of the hypothesis. :math:`𝕁^+` is similar but with the left first. For exponentials, quantifiers, and set comprehension we simply do left/right in the correct order. Then at the end we use the hypothesis to change :math:`A[x/a]` on the left or right to :math:`B[x/b]`, or else weakening to remove the hypothesis followed by the identity. Recursion and infinite structures ================================= @@ -376,55 +390,21 @@ Alternately let can be encoded as a record and recursion via a fixed-point combi fix_poly fl = fix (\self -> map ($ self) fl) -To implement ``fix`` we can use the variant of the Y combinator :math:`\lambda f.(\lambda x.x x) (\lambda x.f (x x))`. To type it we need the cyclic/recursive type :math:`Wr = \Pi[(^w, Wr, r)]` (in the sense of an infinite, regular tree). BOHM uses a fan/duplication node combined with a loop. +To implement ``fix`` we can use the variant of the Y combinator :math:`\lambda f.(\lambda x.x x) (\lambda x.f (x x))`. To type it we need the cyclic/recursive type :math:`Wr = 𝕁^-[(^w, Wr, r)]` (in the sense of an infinite, regular tree). BOHM uses a fan/duplication node combined with a loop. -Graph reduction ---------------- - -A better method, following :cite:`jonesImplementationFunctionalProgramming1987` chapter 12, is to -In Stroscot, we instead simply allow (regular) infinite proof trees. We construct "infinite" as a terminal coalgebra - our proof trees turn into fixed points of systems of formal equations :cite:`karazerisFinalCoalgebrasAccessible2011`, as opposed to actually being infinite. We represent the system of equations explicitly with the use-def rules, which also allow naming computations. Although the regular restriction means we can't represent some values directly such as the Fibonacci sequence, most of the types we care about are regular, like the lambda calculus or lists, and similarly some infinite values like ``x = 1 : x``. +A better method, following :cite:`jonesImplementationFunctionalProgramming1987` chapter 12, is to simply allow (regular) infinite proof trees. We construct "infinite" as a terminal coalgebra - our proof trees turn into fixed points of systems of formal equations :cite:`karazerisFinalCoalgebrasAccessible2011`, as opposed to actually being infinite. We represent the system of equations explicitly with the use-def rules, which also allow naming computations. Although the regular restriction means we can't represent some values directly such as the Fibonacci sequence, most of the types we care about are regular, like the lambda calculus or lists, and similarly some infinite values like ``x = 1 : x``. Infinite structures can be paradoxical, e.g. we can prove :math:`\vdash\bot` using cut on the proposition :math:`A=\neg A`. Cut elimination will often fail to complete, but there is a progress property in the sense that the cut can always be pushed down and eliminate an identity rule or two matching logical rules. - is better than because it allows the graph reduction method used in GHC to work. - - -These probably aren't needed, the use-def and infinite structures and types encode recursion better and we can use GHC's graph reduction model (below). - - - give each definition node a static integer. Then the root is a distinguished definition. Assuming the static data is stored on disk and paged in/out as needed, we can minimize runtime memory use in a compiler pass by introducing as many use-def indirections as possible, one for every sequent in the derivation. This also makes the connections between rules uniform. But having lots of indirections is inefficient so a later pass would remove indirections that will be immediately used (chunkification). - -The optimal fixedpoint algorithm outlined in :cite:`shamirFixedpointsRecursiveDefinitions1976` (10.18, PDF pages 240-242) is a variation of Tarjan's strongly connected component algorithm. Cuts between two definitions ``f x`` are memoized in a list, and if the SCC algorithm finds a component ``f x -> let g = ... in g (f x)`` then this component is solved. If it has a unique solution then that's the answer, otherwise ``f x`` diverges and is replaced with a ``RecursionError`` or ``AmbiguousError``. We assume the solver allows uninterpreted "holes", so that the SCC can be solved before its sub-computations. - -For comparison, to compute the least fixed point we would maintain a "working graph" and incrementally unfold the definition when encountered. But with the optimal fixed point we first reduce the definition to a value while copying other definitions in. - -The solver is an SMT solver on the predicate ``SAT(y == g y)``, and for uniqueness ``UNSAT(y == g y && y != y0)`` where ``y0`` is the first solution found. We exclude error values as possible solutions since the recursion error will be more informative. - -The posets the paper uses appear to be pointed directed-complete partial orders `(cppo's) `__. - -Hashing -======= - -To hash the graphs we can use the tree structure of the sequent derivations. Each upward slot in a node is hashed with a fixed value and each downward slot is hashed with a value corresponding to the path through the derivation tree followed by the label of the upward slot. It is written as a single DFS traversal with the leaves as base case that stores the hashed subtree and a map from edge name to partial path. - -Hashing infinite graphs is harder, we have to hash each SCC as a unit. See :cite:`mauborgneIncrementalUniqueRepresentation2000`. - -Primitives -========== - -Primitives (integers) can be handled by hacking special cases into Cut; we add primitive functions of type PiR that use the arguments provided by PiL during a cut, and also literals, special values of type SigmaR. Alternately we can use a specialized proof trees: 64-bit integers are represented as a sigma type with 2^64 possibilities. So addition is represented as a case expression, where each case contains another case expression, and then each case constructs the integer corresponding to the addition. There is a lot of fan-out at each step, which would require 2^128 values to represent, clearly infeasible. So although this is the conceptual representation, the actual representation has no fan-out for the cases - instead the case nodes create symbolic variables ``a`` and ``b``, and the constructed value has the tag ``a+b``. - Confluent reduction =================== -Reduction of our linear logic trees is not confluent, but only because of commuting cuts. If we drop the black edges and only consider proof nets, then the system is confluent. A cut only interacts with other cuts at identity rules, but with a cut-identity-cut pattern it doesn't matter which cut reduces with the identity. (TODO: prove this formally) - -Since reduction is confluent, it does not change anything to reduce in non-normal order for a time. The reduction will still terminate when going back to normal order. So terminating reductions can always be performed and even non-terminating reductions can be reduced somewhat. Hence during compilation we want to reduce the program as much as possible - ideally the compiled core should be cut-free. We can detect diverging terms and replace them with error terms. But we can't eliminate cuts involving complex recursion, so have to create a heap or a stack allocation. For example the Fibonacci list ``let fibs = 0 :: 1 :: zipWith (+) fibs (tail fibs) in { repeat forever { n <- readInt; print (fibs !! n) } }``, this needs some kind of reduction graph or memo stack involved. +Reduction (cut-elimination) of our linear logic trees is not confluent, but only because of commuting cuts. If we drop the black sequent parent-child edges, and only consider proof nets, then the system is confluent. A cut only interacts with other cuts at identity rules, but with a cut-identity-cut pattern it doesn't matter which cut reduces with the identity. (TODO: prove this formally) Levels ====== -For the implementation of optimal reduction we can add level indices to the terms in the promotion and dereliction rules of :math:`\bangc/\whimc`, as in :cite:`martiniFineStructureExponential1995` and :cite:`guerriniTheoreticalPracticalIssues1996`. Conceptually all terms have indices, but we can recover the indices in a proof tree by propagating the indices from the promotion/dereliction rules up/down according to the criteria that the indices involved in all non-:math:`\bangc/\whimc` promotion/dereliction rules must be the same. +For the implementation of optimal reduction we can add level indices to the terms in the promotion and dereliction rules of :math:`\bang`, as in :cite:`martiniFineStructureExponential1995` and :cite:`guerriniTheoreticalPracticalIssues1996`. Conceptually all terms have indices, but we can recover the indices in a proof tree by propagating the indices from the promotion/dereliction rules up/down according to the criteria that the indices involved in all non-:math:`\bang` promotion/dereliction rules must be the same. To handle level indices in infinite trees, we store the difference function ``\a -> a + (j-i)`` and recover the levels by tracing from the root of the derivation tree (which is always level 0) and applying the difference function when encountered. @@ -433,20 +413,19 @@ The level of a context is the maximum of the levels of its terms, 0 if it is emp .. math:: \begin{array}{ccc} - \rule{\bangc\Gamma^i \vdash A^j, \whimc\Delta^i }{\bangc\Gamma^i \vdash \bangc A^i, \whimc\Delta^i}{\bangc}_{j = i+1} - & \rule{\sk{\Gamma^i}, A^i \vdash \sk{\Delta^i} }{\sk{\Gamma^i}, \bangc A^j \vdash \sk{\Delta^i}}{\bangc d}_{j\leq i} - & \rule{\sk{\Gamma}, \overrightarrow{\bangc A, \bangc A, \cdots} \vdash \sk{\Delta} }{\sk{\Gamma}, \bangc A \vdash \sk{\Delta}}{\bangc c_n} + \rule{\bang^+\Gamma^i \vdash A^j, \bang^-\Delta^i }{\bang^+\Gamma^i \vdash \bang^+ A^i, \bang^-\Delta^i}{\bang^+}_{j = i+1} + & \rule{\sk{\Gamma^i}, A^i \vdash \sk{\Delta^i} }{\sk{\Gamma^i}, \bang^+ A^j \vdash \sk{\Delta^i}}{\bang^+ d}_{j\leq i} + & \rule{\sk{\Gamma}, \overrightarrow{\bang^+ A, \bang^+ A, \cdots} \vdash \sk{\Delta} }{\sk{\Gamma}, \bang^+ A \vdash \sk{\Delta}}{\bang^+ c_n} \end{array} .. math:: \begin{array}{ccc} - \rule{\bangc\Gamma^i, A^j \vdash \whimc\Delta^i }{\bangc\Gamma^i, \whimc A^i \vdash \whimc\Delta^i}{\whimc}_{j = i+1} - & \rule{\sk{\Gamma^i} \vdash A^i, \sk{\Delta^i} }{\sk{\Gamma^i} \vdash \whimc A^j, \sk{\Delta^i}}{\whimc d}_{j \leq i} - & \rule{\sk{\Gamma} \vdash \overrightarrow{\whimc A, \whimc A, \cdots}, \sk{\Delta} }{\sk{\Gamma} \vdash \whimc A, \sk{\Delta}}{\whimc c_n} + \rule{\bang^+\Gamma^i, A^j \vdash \bang^-\Delta^i }{\bang^+\Gamma^i, \bang^- A^i \vdash \bang^-\Delta^i}{\bang^-}_{j = i+1} + & \rule{\sk{\Gamma^i} \vdash A^i, \sk{\Delta^i} }{\sk{\Gamma^i} \vdash \bang^- A^j, \sk{\Delta^i}}{\bang^- d}_{j \leq i} + & \rule{\sk{\Gamma} \vdash \overrightarrow{\bang^- A, \bang^- A, \cdots}, \sk{\Delta} }{\sk{\Gamma} \vdash \bang^- A, \sk{\Delta}}{\bang^- c_n} \end{array} - To handle level mismatches we might also need lifting operators. The conditions are unclear. .. math:: @@ -457,4 +436,4 @@ To handle level mismatches we might also need lifting operators. The conditions \rule{\Gamma^i, A^j \vdash \Delta^i }{\Gamma^i, A^i \vdash \Delta^i}{\text{lift}_L}_{j > i} \end{array} -In practice I went with a different approach that generates matching ``Dup`` nodes, so the levels aren't needed. \ No newline at end of file +In practice I have had trouble interpreting these levels operationally, it seems the levels have no relationship to actual optimal reduction. \ No newline at end of file diff --git a/docs/Commentary/Memory-Management.rst b/docs/Commentary/Memory-Management.rst index 5f66874..5dd6e7c 100644 --- a/docs/Commentary/Memory-Management.rst +++ b/docs/Commentary/Memory-Management.rst @@ -31,7 +31,7 @@ I guess it is possible that someone will have two versions of a function, one th Pointers ======== -Pointers are the low-level API, they can interface with the OS or other languages (mainly C). I did a study of memory APIs and concluded that memory is best modeled as the global mutable array ``Memory = Map (Word,BitIdx) Status``. The status allows storing metadata, it's `a complex ADT `__ which has various states like unallocated, committed, etc. The array is indexed at the bit level because that's the granularity `Valgrind's Memcheck `__ uses, but most of the status will be the same for a byte or page as the memory allocators / OS operations work at higher granularity. +Pointers are the low-level API, they can interface with the OS or other languages (mainly C). I did a study of Windows/Linux memory APIs and concluded that memory is best modeled as the global mutable array ``Memory = Map (Word,BitIdx) Status``. The status allows storing metadata, it's `a complex ADT `__ which has various states like unallocated, committed, etc. The array is indexed at the bit level because that's the granularity `Valgrind's Memcheck `__ uses, but most of the status will be the same for a byte or page as the memory allocators / OS operations work at higher granularity. It is simple enough to maintain "extra" status bits, and instrument memory functions to check the status of memory before operating. This is essentially what Valgrind does. With this it is possible to identify many common errors, like double free, use after free, access to undefined memory, and null pointer dereferencing. But there is still the possibility of overflowing a buffer into an adjacent allocation, or more generally `type punning `__ by reading some memory as a format it was not written with. These sorts of possibilities are intrinsic to the "big array of bits" model, and many low-level hacks rely on such functionality, so I would say to use references if you want to avoid such things. But of course someone can easily add bounds-checking etc. on top of the basic pointer model as a library. @@ -82,10 +82,15 @@ Eliminating a pointer write requires proving that the address is never read befo CHERI ----- -CHERI pointers are 129-bit, consisting of a 1-bit validity tag, bounds, permissions, object type, and actual pointer. Valid pointers may only be materialized in a register or memory by transforming an initial unbounded pointer obtained from the OS. This means that the simple model of pointers as integers is no longer valid. Instead, a pointer is the combination of an integer address and a capability. The `CHERI C/C++ API `__ represents capabilities as ``void*``, addresses as ``vaddr_t``, and +CHERI pointers are 129-bit, consisting of a 1-bit validity tag, bounds, permissions, object type, and actual pointer. Valid pointers may only be materialized in a register or memory by transforming an initial unbounded pointer obtained from the OS. This means that the simple model of pointers as integers is no longer valid. Instead, a pointer is the combination of an integer address and a capability. The `CHERI C/C++ API `__ represents capabilities as ``void*`` and addresses as ``vaddr_t``. I tried to read further, but the model is complicated, essentially implementing a GC to avoid dangling pointers, so I am not sure it will ever become mainstream. +Persistent memory +----------------- + +The pointer API, assembly wrapping, and OS calls cover using persistent memory via standard file APIs or memory-mapped DAX. Memory is volatile while persistent memory is not, so persistent memory is faster storage, not weird RAM. And storage is complex enough that it seems best handled by libraries. Making the memory management system memkind-aware seems possible, like memory bound to NUMA nodes. + References ========== @@ -104,21 +109,6 @@ Foreign operations like OS calls require a pointer to a memory address, because But, in the fallback case of storing a few words, where a memory allocation is appropriate, the reference translates directly to a pointer allocation. The memory is configured to trap on stray user-level access, so that only the compiler-generated code has access. Even in this case, though, the reference's internal value is not the pointer itself, rather there is a more complex strategy of using a "handle" identifier that allows moving the data around after it is allocated. -Representation --------------- - -A lot of languages have a fixed or default memory representation for values, e.g. a C struct, a Haskell ADT, and a Python object are always laid out in pretty much the same way. The more systems-level languages allow controlling the layout with flags, for example Rust has `type layout `__ which allows specifying the size and alignment/padding of its fields, and also C compatibility. But these flags are't really that powerful. Here's some examples of what can't be done: - -* specify the in-memory order of fields differently from their logical order -* turn array-of-structs into struct-of-arrays -* flattening a datatype, like ``Either Bool Int`` into ``(Bool,Int)``, or representing a linked list as a contiguous series of records. -* NaN-boxing and NuN-boxing (`ref `__ `2 `__), representing the JS ``Any`` type as a single 64-bit word. -* parsing network packets into structured data - -Maybe some of these could be addressed by flags, but from the last two, it is clear that we are really looking for a general-purpose memory serialization interface. I looked at `Data.Binary `__, `store `__, and :cite:`delawareNarcissusCorrectbyconstructionDerivation2019` and came up with the most general API I could, ``Write = Alloc (Size,Align) (Addr -> Write) | Store, Store = Map Addr MaskedWord`` and ``Unpack a = Maybe Addr -> Read -> a, Read = Map Addr Word``. This allows masked writes and multiple or fixed allocation addresses, but does not allow failing to read the value back. Also the ``pack`` function allows passing arbitrary side-band data to the ``unpack`` function. - -Maybe though, it is still not general enough, we should just have lens-like functions like ``write : Memory -> a -> Memory`` and ``read :: Memory -> a``. There still need to be constraints though, like that you get back what you wrote and non-interference of writes. - .. _finalizers: Finalizers @@ -306,40 +296,20 @@ Portable mmap: * C++: https://github.com/mandreyel/mio * Rust: https://github.com/RazrFalcon/memmap2-rs +Representation +============== -Model -===== - -For memory management we have to consider values, called objects. Pointers are manually freed and hence don't need to be managed. - - - - -An invalidate queue is more like a store buffer, but it's part of the memory system, not the CPU. Basically it is a queue that keeps track of invalidations and ensures that they complete properly so that a cache can take ownership of a cache line so it can then write that line. A load queue is a speculative structure that keeps track of in-flight loads in the out of order processor. For example, the following can occur - - CPU speculatively issue a load from X - That load was in program order after a store to Y, but the address of Y is not resolved yet, so the store does not proceed. - Y is resolved and it turns out to be equal to X. At the time that the store to Y is resolved, that store searches the load queue for speculative loads that have issued, but are present after the store to Y in program order. It will notice the load to X (which is equal to Y) and have to squash those instructions starting with load X and following. - -A store buffer is a speculative structure that exists in the CPU, just like the load queue and is for allowing the CPU to speculate on stores. A write combining buffer is part of the memory system and essentially takes a bunch of small writes (think 8 byte writes) and packs them into a single larger transaction (a 64-byte cache line) before sending them to the memory system. These writes are not speculative and are part of the coherence protocol. The goal is to save bus bandwidth. Typically, a write combining buffer is used for uncached writes to I/O devices (often for graphics cards). It's typical in I/O devices to do a bunch of programming of device registers by doing 8 byte writes and the write combining buffer allows those writes to be combined into larger transactions when shipping them out past the cache. - - -Allocator -========= - -ultimate allocator - steal features from all other allocators. It's one of those well-researched areas where a few percent lives. Substitution isn't really an option but maybe some components could be pluggable. Thread safe but values are pure and references can be determined to be thread-local so lots of optimizations. - -We want to automatically determine the number of allocation regions and their size to maximize locality. - -locate memory leaks - places where allocated memory is never getting freed - memory usage profiling - -Handling OOM gracefully - non-allocating subset of language. Should be enough to implement "Release some resources and try again" and "Save the user's work and exit" strategies. Dumping core is trivial so doesn't need to be considered. - -Layout is usually defined by its size, alignment, padding/stride, and field offsets, but this only specifies the representation of simple flat records. With enumerations, there is the question of how to encode constants. It gets even more complicated with ADTs, like JS's `value type `__, and the choices often impact performance significantly. Finally there is the use of pointers. It complicates the memory management a bit to handle non-contiguous memory layouts, but the algorithms all deal with pointer trees anyway so I don't think it's intractable. +A lot of languages have a fixed or default memory representation for values, e.g. a C struct, a Haskell ADT, and a Python object are always laid out in pretty much the same way. The more systems-level languages allow controlling the layout with flags, for example Rust has `type layout `__ and also C compatibility. Layout is then defined by its size, alignment, padding/stride, and field offsets. Now it's great to have a compact representation of the memory layout - but only if you can actually write the memory layout you want using these features. But these flags are't really that powerful. Here's some examples of what can't generally be done with the current memory DSL's: -The pack/unpack idea is similar to the `store library `__ and the encode/decode functions used by Narcissus :cite:`delawareNarcissusCorrectbyconstructionDerivation2019`. +* specify the in-memory order of fields differently from their logical order +* specifying how to encode enumeration constants (per struct it appears in) +* turn array-of-structs into struct-of-arrays +* flattening a datatype, like ``Either Bool Int`` into ``(Bool,Int)``, or representing a linked list as a contiguous series of records. +* storing some parts via pointer indirections (non-contiguous memory layout) +* NaN-boxing and NuN-boxing (`ref `__ `2 `__), representing the JS ``Any`` type as a single 64-bit word. +* parsing network packets into structured data -Narcissus is too complex IMO: +Maybe some of these could be addressed by flags, but from the last two, it is clear that we are really looking for a general-purpose memory serialization interface. I looked at `Data.Binary `__, `store `__, and :cite:`delawareNarcissusCorrectbyconstructionDerivation2019`. Narcissus is too complex IMO: :: @@ -367,13 +337,46 @@ Since writing these serialization functions all the time would be tedious, we ca The translation to use pack is pretty simple: every value is wrapped in a call to pack, the result is stored as a tuple ``(cell,unpack)``, and every usage applies unpack to the cell. The translation uses whatever pack is in scope; pack can be overridden like any other implicit parameters. The unpack functions will end up getting passed around a lot, but function pointers are cheap constants, and constant propagation is a thing, so it shouldn't be an issue. +So finally the most general API is ``Write = Alloc (Size,Align) (Addr -> Write) | Store, Store = Map Addr MaskedWord`` and ``Unpack a = Maybe Addr -> Read -> a, Read = Map Addr Word``. This allows masked writes and multiple or fixed allocation addresses, but does not allow failing to read the value back. Also the ``pack`` function allows passing arbitrary side-band data to the ``unpack`` function. Maybe though, it is still not general enough, we should just have lens-like functions like ``write : Memory -> a -> Memory`` and ``read :: Memory -> a``. There still need to be constraints though, like that you get back what you wrote and non-interference of writes. + +Now we also want to allow optimization of the memory representation. Consider some data points - if there is only one possible value, then the compiler should optimize this to a constant and not store it at all. If there are two possible values, the compiler should probably use a boolean flag and again hard-code the values as constants. If the potential values include all values of a given type (and nothing else), then the compiler should use the representation for that type. If the potential values include a given type, and also members of another type, then the compiler should use the most narrowly-defined representation that contains both of those types. And it should consider whether it can choose the representation of the union type so as to minimize the amount of conversion needed for the more commonly used type (as in NaN/NuN-boxing). If the potential values can be anything, then the compiler should use the universal representation. + +The process of fixing the memory representation of a program can be modeled as follows. We start with a program that passes around values. Then we insert conversion operations: on every declaration, we insert a conversion to binary, and on every use, we insert a conversion from binary. As the binary representation is defined so that a read of a write is is the identity, this transformation does not change the meaning of the program. Then we additionally write this binary representation to memory on the declaration, and read this binary representation from memory on use. Again this does not change the semantics due to the non-interference of writes property. Although, in reality it could change the semantics: maybe a cosmic ray or something could change what we have written. But at this point, our program operates purely on memory and does not have any values floating around. + +Model +===== + +For memory management we have to consider values, called objects. Pointers are manually freed and hence don't need to be managed. + + + + +An invalidate queue is more like a store buffer, but it's part of the memory system, not the CPU. Basically it is a queue that keeps track of invalidations and ensures that they complete properly so that a cache can take ownership of a cache line so it can then write that line. A load queue is a speculative structure that keeps track of in-flight loads in the out of order processor. For example, the following can occur + + CPU speculatively issue a load from X + That load was in program order after a store to Y, but the address of Y is not resolved yet, so the store does not proceed. + Y is resolved and it turns out to be equal to X. At the time that the store to Y is resolved, that store searches the load queue for speculative loads that have issued, but are present after the store to Y in program order. It will notice the load to X (which is equal to Y) and have to squash those instructions starting with load X and following. + +A store buffer is a speculative structure that exists in the CPU, just like the load queue and is for allowing the CPU to speculate on stores. A write combining buffer is part of the memory system and essentially takes a bunch of small writes (think 8 byte writes) and packs them into a single larger transaction (a 64-byte cache line) before sending them to the memory system. These writes are not speculative and are part of the coherence protocol. The goal is to save bus bandwidth. Typically, a write combining buffer is used for uncached writes to I/O devices (often for graphics cards). It's typical in I/O devices to do a bunch of programming of device registers by doing 8 byte writes and the write combining buffer allows those writes to be combined into larger transactions when shipping them out past the cache. + + +Allocator +========= + +ultimate allocator - steal features from all other allocators. It's one of those well-researched areas where a few percent lives. Substitution isn't really an option but maybe some components could be pluggable. Thread safe but values are pure and references can be determined to be thread-local so lots of optimizations. + +We want to automatically determine the number of allocation regions and their size to maximize locality. + +locate memory leaks - places where allocated memory is never getting freed - memory usage profiling + +Handling OOM gracefully - non-allocating subset of language. Should be enough to implement "Release some resources and try again" and "Save the user's work and exit" strategies. Dumping core is trivial so doesn't need to be considered. + A derived pointer is a reference plus an offset. When the address and layout of the object is known we can store the derived pointer as the sum of the value address and offset, allowing direct pointer dereferencing. But since the address is known we could also just store the derived pointer as the offset, so it's only useful if computing the sum is necessary and expensive. An object can be treated as an array, N[i] and N.length. The array part of shared memory is necessary because there is a double-word CAS operation on x86 (CMPXCHG16B), and also for efficiency. -Supporting persistent memory: The pointer API, assembly wrapping, and OS calls cover using persistent memory via standard file APIs or memory-mapped DAX. Memory is volatile while persistent memory is not, so persistent memory is faster storage, not weird RAM. And storage is complex enough that it seems best handled by libraries. Making the memory management system memkind-aware seems possible, like memory bound to NUMA nodes. With persistent memory only word-sized stores are atomic, hence the choice of shared memory as an array of words. https://stackoverflow.com/questions/46721075/can-modern-x86-hardware-not-store-a-single-byte-to-memory says that there are in fact atomic x86 load/store instructions on the byte level. diff --git a/docs/Commentary/Modules.rst b/docs/Commentary/Modules.rst index 79034f0..e3f6ac3 100644 --- a/docs/Commentary/Modules.rst +++ b/docs/Commentary/Modules.rst @@ -10,9 +10,9 @@ Modules are reusable components containing code that are themselves first-class Stroscot's module system is based on F#, OCaml, Standard ML, Agda, 1ML, MixML, Backpack, and Newspeak's module systems. (TODO: reread the systems to see if I missed anything. But they're mostly type systems, rather than functionality) -A module is a set of definitions - a set rather than a list because the order is not relevant. Some definitions are declared "visible" and are exposed/exported in the signature. We write: ``export a, b, c`` at the top of the module, with ``a, b, c`` a list of exported names. +A module is a set of definitions - a set rather than a list because the order is not relevant. These definitions use various symbols. Some symbols are declared "visible" and are exposed/exported in the signature. We write: ``export a, b, c`` at the top of the module, with ``a, b, c`` a list of exported names. -Definitions not in the signature are considered "internal". Normally they should not be accessed, but since every rule has an exception Stroscot allows accessing these definitions through the ``_internal`` pseudo-module, like ``a._internal.b``. Thus definitions are always accessible somehow. This avoids Java's issue of reflection allowing access to "private" data. If you really want to lock an API down then you can clear out ``_internal`` by defining an immediately invoked function expression that first computes the module as a value and then returns a separate module that only re-exports the public definitions. But it is probably better to make the methods themselves use a secret password, token, crypto, etc. to prevent unauthorized access. +Symbols not in the signature are considered "internal". Normally they should not be accessed, but since every rule has an exception Stroscot allows accessing these symbols through the ``_internal`` pseudo-module, like ``a._internal.b``. Thus definitions are always accessible somehow. This avoids Java's issue of reflection allowing access to "private" data. If you really want to lock an API down then you can clear out ``_internal`` by defining an immediately invoked function expression that first computes the module as a value and then returns a separate module that only re-exports the public definitions. But it is probably better to make the methods themselves use a secret password, token, crypto, etc. to prevent unauthorized access. Modules can be nested within other modules, to an arbitrary depth. Modules can also re-export other modules by importing the modules or by individual copy definitions using the dot syntax, ``a = C.a``. diff --git a/docs/Commentary/Intrinsics.rst b/docs/Commentary/OpPrims.rst similarity index 55% rename from docs/Commentary/Intrinsics.rst rename to docs/Commentary/OpPrims.rst index f6e939b..bce5514 100644 --- a/docs/Commentary/Intrinsics.rst +++ b/docs/Commentary/OpPrims.rst @@ -1,14 +1,39 @@ -Intrinsics -########## +Operational primitives +###################### -Stroscot should work well with existing code written in other languages, either through natively importing and using that code or through easy-to-use bridges or interfaces. +As used in Stroscot, operational primitives refer to the stateful operations that form the building blocks of imperative programs. Examples include memory operations (read, write), foreign function calls, compiler intrinsics, and OS system calls. It would also be possible to call them "primitive operations", but this term is less precise and could be read as including elements of the runtime such as lambda reduction and term rewriting. + +There is an operational interpretation of every expression in Stroscot. For example, the operational interpretation of a value is returning that value. The operational interpretation of addition on two machine integers consists of storing the integers to memory or fixed registers, executing the ``add`` assembly instruction, appropriately handling any error conditions or traps, packaging up the result as a value, and returning it. And so on. Generally, a program may be viewed as assembly instruction sequences interleaved together with higher-level "glue" code. During optimization, one goal is to convert and reduce as much of this "glue" code as possible into assembly. Each switch into "glue" code corresponds to a jump back into the interpreter, with associated overhead. + +Assembly +======== + +Operational primitives are naturally represented as assembly instruction sequences or templates. After all, the CPU cannot execute anything else. So with a built-in assembler, we have a complete set of operational primitives. By the nature of the definition, these are hardware and platform specific. + +Ideally we would model primitives as deterministic functions from input machine state to output machine state. We can use a `CRIU image `__ to model the machine state (at least on Linux - Windows is left as an exercise). This allows more control than a traditional ELF image + exit code, as it captures the complete state of a process in the middle of its execution, like a debugger would. But the behavior of almost all assembly instructions can be affected by unpredictable external factors. Therefore we model primitives as functions that take an input CRIU state and an additional "recording" of external factors and produces an output CRIU state. Examples of recorded factors: + +* signal interrupts (any interruptible instruction) +* memory order (for any instruction reading/writing shared memory) +* system call results (for syscalls) +* spurious failures (for load-linked/store-conditional) +* CPU cycle count (rdtsc) +* random numbers (rdrand) +* current CPU / core id (cpuid) +* FPU state (floating-point) +* chip ID (if the instruction outputs undefined or "reserved for future definition" registers) -Hardware operations +Generally speaking, all assembly instructions are deterministic after recording and controlling these factors - otherwise, programs would not execute reliably. For most instructions there is a "default" recording that can be assumed (no signals, no memory interactions, etc.), so it would be possible to formulate these instructions without a recording, but as rdrand etc. are also instructions it is easier to formalize all instructions and instruction sequences as taking a recording. + +Recordings are not just theoretical; there are programs that implement record/replay. They use various techniques, such as ptrace/breakpoints (rr - single threaded), intercepting DLL calls (Replay.io - mainly for JS), dynamic instrumentation (PinPlay, Undo.io - multithreaded), and machine virtualization (research-level; a bit problematic as it requires emulating a whole system). + +When we are optimizing, we often want to replace one instruction sequence with another. For example, we may want to redo the register allocation, or replace an instruction sequence with a faster one. So we need a semantics for these instruction sequences that allows us to determine if two instruction sequences are equivalent, and then we can define operational primitives as equivalence classes of instruction sequences. In general, an instruction sequence may have arbitrary effects, and may be a complete program. So it is easier to think about comparing programs, and then we can define instruction sequences as equivalent if they have the same behavior when embedded in appropriate programs. Conceptually, comparing programs is simple: run the programs and see if they do the same thing. But programs on modern systems have a lot of parts. + +Abstracted assembly =================== -Hardware instructions are a specific series of assembly instructions on a specific platform. This fixes the behavior in all cases quite strongly - we can always set up the machine to a given state and see what it does. Generally speaking the machine can be simulated deterministically as a function from machine state to machine state - otherwise programs would not execute reliably. We can examine emulator projects such as QEMU or a formal ISA semantics to get a good idea of what each instruction does. Due to out-of-order execution the execution time of each instruction is nondeterministic; this is not modeled. +The nature of assembly is that it is a bit messy; we have to deal with register allocation and recordings and so forth. It is more convenient if we assume a fixed calling convention, say that all data for the operation (including the recording or decision to record de novo) is stored and returned in memory. Since all registers/flags/etc. can be stored/loaded to memory, and record/replay can be implemented on an instruction level, this does not lose any expressiveness - it merely adds significant overhead to executing the instruction. But in return it means operations work on immutable bitstrings rather than machine states. Generally these bitstrings are of a fixed, known width, such as 1, 8, 16, 32, 64, 80, 128, 256, 512, etc. (for flags, segment registers, general-purpose registers, FPU registers, MMX/SSE/AVX). + -To abstract the ISA we consider each hardware instruction as a deterministic function from inputs to outputs - these functions are called "operations". Operations don't reference registers, the operations all take/return temporaries. Since all registers/flags/etc. can be stored/loaded to memory, temporaries are conceptually an immutable bitstring of a fixed bitwidth. These bitwidths vary by the instruction: x86 uses 1, 8, 16, 32, 64, 80, 128, 256, 512, etc. (for flags, segment registers, general-purpose registers, FPU registers, MMX/SSE/AVX). The precise definition of inputs and outputs requires some care. E.g. for floating-point the FPU state must be considered an input, and RDRAND must have the hardware RNG state as input. Some registers are left in a state that the Intel SDM calls "reserved for future definition" and "undefined" - these must be excluded from outputs, or the precise chip-specific behavior determined and the chip ID considered part of the platform ID. But with suitable munging, operations are well-defined. Operations are exposed in Stroscot as intrinsic functions. This allows using Stroscot's typical syntax. For example the operations corresponding to x86-64 DIV, ADD, and ADC with 64-bit operands look like: @@ -55,15 +80,6 @@ Accessing memory is handled by a separate operation - but in the ISA x86 has com else memory[a] -Ideally we will expose every assembly instruction as a hardware instruction, for our supported platforms. - -Portable operations -=================== - -On non-native platforms, the given assembly instructions for a hardware operation will likely not exist. Generally it is better, rather than directly translating instruction-by-instruction, to create a portable API at a higher level, such as the limb-multiply routine in libGMP. - -So the API structure is that we have "native" modules providing native hardware instructions, which give compile-time errors on attempts to use it on non-native platforms, and a portable library that provides a cross-platform interface using switch statements like ``case platform of A -> implA; B -> implB; ...``. Some hardware operations are ubiquitous, so it makes sense to expose them directly as portable operations. Addition wraps on all 64-bit hardware in the same way so only needs one portable operation. Other instructions like division have differing behavior, so we can provide 0-returning (ARM native) and ``DivideByZero`` exception-throwing (Intel native) division as portable operations. There is also the intersection of these functions with signature ``Int -> Int\{0} -> Int``, which is useful when we want zero-overhead on multiple platforms and can prove that the divisor is non-zero. But ideally the compiler will be able to optimize the conditionals out of the 0-returning/exception-throwing versions, giving the native version without requiring a separate function. - Runtime and OS calls ==================== @@ -82,6 +98,8 @@ The syscalls themselves take / modify C structs. So regardless of whether we lin FFI calls ========= +Stroscot should work well with existing code written in other languages, either through natively importing and using that code or through easy-to-use bridges or interfaces. + The semantics of a call are inherently system/ABI dependent, to the point of not being captured in a target triple. The semantics thus have to be described at the call site. But the data format doesn't really matter as the call instruction will most likely be wrapped / generated. Maybe libffi can help. basic FFI types: ``()``, ``bool``, ``int8``, ``int16``, ``int32``, ``int64``, ``float``, ``double``, ``pointer`` @@ -151,3 +169,44 @@ Interop with C/C++ is a good target feature. There are varying approaches (in in * `dragonffi `__ again uses clang but it works by compiling code snippets. This allows the full range of C/C++ to be used. I think the dragonffi approach is the best, since it's the most powerful and least error prone. There is some effort to analyze the result of the compilation and integrate it with the rest of Stroscot, but deep integration with an existing C/C++ compiler seems better than trying to write one from scratch. + + +------- + + +Usually these are modeled using primitive operations, e.g. file descriptors are allocated with the open syscall rather than declaratively as ``{ fd1 = inode 1234 }``. But the more state we model as state, the more powerful our debugging tools get. A traditional debugger has no way to undo closing a file. However, a filestate-aware debugger can reopen the file. The less we view the program as an I/O machine the easier it is to use high-bandwidth interfaces such as io_uring to perform bulk state changes - describing what rather than how is the hallmark of a high-level language. Of course, in most cases the program will use state in a single-threaded manner and it will simply be compiled to the primitive operation API by the automatic destructive update optimization. + + + + operational primitive as a function from input machine state to output machine state. This is actually a function because we can always set up the machine to a given state and see what it does. + + + Generally speaking the machine can be simulated deterministically as a function from machine state to machine state - otherwise programs would not execute reliably. We can examine emulator projects such as QEMU or a formal ISA semantics to get a good idea of what each instruction does. Due to out-of-order execution the execution time of each instruction is nondeterministic; this is not modeled. + +Yes, there are projects and tools that focus on ensuring reproducible execution, particularly by controlling and managing different aspects of the execution process. Some of these projects include: + +1. **rr (Record and Replay Debugger)**: rr is a lightweight tool that enables the recording and deterministic replaying of execution traces of multi-threaded programs. It allows for the precise replication of program execution, helping in the identification and debugging of complex issues. + +2. **Pernosco**: Pernosco provides a cloud-based collaborative debugging platform that allows developers to record, replay, and analyze the execution of complex software systems. It enables teams to collaboratively investigate and debug issues in a reproducible manner. + +3. **Pin Play**: Pin Play is an extension of the Pin dynamic binary instrumentation framework that enables the record and replay of the execution of parallel programs. It allows for the deterministic reproduction of thread schedules and memory accesses, aiding in debugging and analysis. + +4. **Deterministic Parallel Java (DPJ)**: DPJ is a programming model and runtime system that emphasizes determinism in parallel and concurrent Java programs. It provides constructs and mechanisms for controlling the execution of parallel threads, ensuring predictable and reproducible outcomes. + +5. **Chaos Engineering Tools**: While not specifically focused on reproducibility, Chaos Engineering tools such as Chaos Monkey, developed by Netflix, and similar tools aim to test the resiliency of systems by inducing controlled failures. These tools can help uncover non-deterministic behaviors in distributed systems, leading to improved reliability and predictability. + +These projects contribute to ensuring reproducible execution by providing tools and mechanisms to control and manage the concurrent execution of threads, handle I/O operations, and manage random number generation, thereby enabling the deterministic and consistent behavior of programs across different runs and environments. + + +Store state +----------- + +Most papers limit themselves to keeping the values of mutable variables in the store. But conceptually the state of a program could include the state of the computer, the stock market, quantum fluctuations, etc. - all information within the chronological past of a program. But practically we are limited to state that we can read and write deterministically. In particular the read operation must satisfy the associative array definition: + +:: + + read k (write j v D) = if k == j then v else read k D + read k emptyStore = MissingValue + +So one constraint to be a variable is that the state must be accessible. So for example the kernel limits us - we do not have full control over peripheral devices or processes not related to ours. We can represent this by shadowing access-controlled variables and returning ``WriteFailed`` for inaccessible variables. + diff --git a/docs/Commentary/Parsing.rst b/docs/Commentary/Parsing.rst index b40df13..6601293 100644 --- a/docs/Commentary/Parsing.rst +++ b/docs/Commentary/Parsing.rst @@ -136,6 +136,14 @@ It is also important to have some kind of demand-driven / streaming interface. P Another idea Per Vognsen brings up is that, for indexing, the full file-level AST is not needed. Only some relevant subsection needs to be parsed, and the rest just needs start/end of each declaration but no interior detail. He proposes "random access" to the AST, building a dictionary of name -> (src span, decl kind). But his performance hacks assume all legal top-level declarations must have a keyword in column 0, which isn't necessarily true +For purposes of the language server, we want several properties of the syntax tree: + +* holds all the source information in full fidelity - can reconstruct source text +* best-guess parse tree if the program is incomplete or malformed, representing skipped or missing tokens in the syntax tree. +* immutable and thread-safe, persistent data structure +* nodes: grammatical construct (declarations, statements, clauses, and expressions), lexical token (keywords, identifiers, literals, and punctuation), trivia (whitespace, comments, and preprocessor directives) + * structure: constructs contain constructs and tokens (never leaf node). tokens are always leaves. trivia is stored as leading/trailing attribute of tokens. +* stores source position spans (beginning/end position). can be bytes, codepoints, line/column. The convention is that zero-length is between two characters. Compilation =========== diff --git a/docs/Commentary/Profiling.rst b/docs/Commentary/Profiling.rst index 56253aa..3d4870f 100644 --- a/docs/Commentary/Profiling.rst +++ b/docs/Commentary/Profiling.rst @@ -163,52 +163,44 @@ Benchmarking A good suite of benchmarks include microbenchmarks like optcarrot, small nontrivial programs like the language shootout tasks, and large applications like a production webserver. All of them are useful for detecting regressions. -There are several timing methods: +What is the ideal length of a benchmark? :cite:`suhEMPExecutionTime2017` measured programs of different runtimes - since the tasks were very similar, it is reasonable to try to fit standard deviation / runtime as a function of runtime. Going through the equations in Google Sheets and EMPv5 data (Table XIX), the linear law has R^2=0.84, overestimates all deviations up to 1000s, and barely fits the few long runs. The quadratic has R^2 = 0.931 and fits the data past 4s pretty well but again overestimates the 1,2,4s point. Also it makes little sense as a function, why would the deviation decrease past 15,000 s? The exponential and log functions are terrible. Finally, the power law has R^2 = 0.861. On a log-log graph it has a decent pattern of over/undershoot - it goes exactly through the 1s point, underestimates the next few, over estimates the next few, and finally underestimates the least few. The graph is mostly flat from 10s to 512s, so perhaps the first few results were unusually good and the last were affected by undiscovered sources of large errors. Plotting max-min as a function of duration (again log-log) we see this is the case - 1s is unusually low, then there is a mostly linear patch up to 1000s, then 2000-16,000s jumps sharply upwards. Again the power series is the best fit, R=0.817, by a long shot, with the next highest being the quadratic with R^2=0.572. Limiting just to 4s-1024s the fit is improved and the residuals look random for a power fit to max-min. For the standard deviation, excluding the outliers, a linear fit actually looks better than a power law. So we conclude that an ideal benchmark is 4s-1024s. And practically, shorter is better, so aiming for 4-8s is probably best. In particular the standard deviation is 23.634 t^0.505 without noise mitigation and 0.589241 t^0.342 with noise mitigation, where t is the runtime in seconds. -If we can modify the program, we can use the rdtsc machine instruction to obtain the num- -ber of clock cycles required by the program (a clock cycle on modern machines is a fraction of a -nanosecond), which is quite repeatable. If we want ET (the three at the bottom right in the figure), -there are a variety of timing methods available on Linux, to be examined shortly. -If the program is not modifiable, such as if one is measuring a proprietary program, ET -can be measured by an obvious timing tool such as time or Java’s popular timing API -(System.currentTimeMillis()). Using ET, however, turns out to yield variability of about -0:9%, which is not much better than what the sophisticated TTP achieves on programs with I/O. -(The results given here and in the figure for execution time measurement protocol (EMP) concern -INC8, which as we will describe shortly is a compute-only program running for about 8 s.) -The purpose of this article is to show how to reduce this variability by over an order of magni- -tude, to under 2 ms for INC8, or 0:02%. As we discuss in Section 8, more accurate measurements -can produce better prediction models for execution time and can also reveal previously undetected -phenomena within the operating system. -Relying on ET measurement methods, therefore, may not be appropriate in circumstances in -which it is important to know exactly how much actual time was spent only for the process. What -is needed is a comprehensive timing protocol that provides both high resolution and low overhead, -while eliminating extraneous factors. +There are several metrics: - such as the CPU frequency scaling. They are noisy, to the level of 2-3%, and so must be analyzed using statistical techniques for most improvements (in the initial phase, when 10%+ speedups are easy to obtain, this can probably be skipped). Some optimizations have small effects and it requires a lot of repeated runs to detect whether they are an improvement, but this doesn't mean the optimization is not useful. It may show good effects if you specifically write a microbenchmark for the optimization. The effects may show up more clearly in other metrics, such as CPU cycle count, machine code size, or amount of time spent in optimized code. There are techniques for mitigating/normalizing noise. I had a paper but lost it, it linked to a Github repo with a harness that looked interesting, but there are many papers exploring this area. +* clock cycles, as measured by the rdtsc machine instruction or logic analyzer +* wall clock time, as measured by stopwatch, the time command, or Java’s System.currentTimeMillis() +* machine code size +* one of the above for a specific program part, such as a method (as measured by profiling or sampling) -:cite:`suhEMPExecutionTime2017` says to deactivate as many daemons as possible, activate the NTP daemon, lock the frequency and voltage, use an up-to-date Linux version, and discard runs during which a daemon ran for longer than some cutoff (as measured by increased process time). +Different methods have different resolution (reporting capability), granularity (division of report into program parts), precision (variation betwen runs), and accuracy (closeness to "actual value", a difficult concept to define in benchmarking). -What is the ideal length of a benchmark? :cite:`suhEMPExecutionTime2017` measured programs of different runtimes - since the tasks were very similar, it is reasonable to try to fit functions of runtime. Going through the equations in Google Sheets and EMPv5 data (Table XIX), the linear law has R^2=0.84, overestimates all deviations up to 1000s, and barely fits the few long runs. The quadratic has R^2 = 0.931 and fits the data past 4s pretty well but again overestimates the 1,2,4s point. Also it makes little sense as a function, why would the deviation decrease past 15,000 s? The exponential and log functions are terrible. Finally, the power law has R^2 = 0.861. On a log-log graph it has a decent pattern of over/undershoot - it goes exactly through the 1s point, underestimates the next few, over estimates the next few, and finally underestimates the least few. The graph is mostly flat from 10s to 512s, so perhaps the first few results were unusually good and the last were affected by undiscovered sources of large errors. Plotting max-min as a function of duration (again log-log) we see this is the case - 1s is unusually low, then there is a mostly linear patch up to 1000s, then 2000-16,000s jumps sharply upwards. Again the power series is the best fit, R=0.817, by a long shot, with the next highest being the quadratic with R^2=0.572. Limiting just to 4s-1024s the fit is improved and the residuals look random for a power fit to max-min. For the standard deviation, excluding the outliers, a linear fit actually looks better than a power law. So we conclude that an ideal benchmark is 4s-1024s. And practically, shorter is better, so aiming for 4-8s is probably best. +Many variables affect precision and add noise. Often, we can control the variable to a fixed value, and remove its variance. But this may mean our results are unrealistic; for example frequency scaling is not usually turned off. A more clever approach is to measure the variable or proxies for the variable, and characterize the runtime as a function of the variable. This does generally require more benchmarking runs in order to gather the necessary statistical power. And even with many variables measured or controlled, there will still be noise. Still, it brings up what we want to measure: average? minimum? some theoretical ideal? +In the initial stages of optimization, no statistical profiling is needed as 10%+ speedups are obvious. But as optimizations become more complex, speedups become smaller and powerful statistical techniques are of great help. Generally the task must be at least 5-10x larger than the measurement precision to get useful results without statistics. Thus, if the task is 10 msec, then the measurement technique must have better than 1 to 2 msec precision. This is only a rule of thumb; more precision is always better, but less precise measurements can be compensated for by doing more of them and analyzing statistics. - With this, the standard deviation goes from 23.634 t^0.505 to 0.589241 t^0.342 where t is the runtime in seconds, a 40x absolute improvement for 1s programs along with a 1.5x improvement in the nonlinear factor. -* +Variables include: + +* CPU frequency and voltage scaling (proxy: performance counters :cite:`snowdonAccurateRuntimePrediction2007`) +* execution of other programs/daemons (load) +* CPU affinity +* OS version +* CPU temperature +* filesystem cache, swap usage +* processor-specific optimizations +* Environment block contents. It contains strings which are generally consistent on one system but vary across systems, such as the username. It can lead to certain locations being aligned or misaligned, causing significant variations at run-time. It is mainly the size of the env block that matters. +* ASLR - can be disabled globally or per-process basis with ``setarch -R``. But e.g. command-line arguments affect memory layout even when ASLR is disabled. Also the seed cannot be controlled. Most people run with ASLR enabled. + -Probably it is better to do a regression of daemon runtimes and frequency scaling vs. out * stabilizer - this randomizes memory layout, allowing the measurement of a layout-independent performance. Of course, if one is optimizing the memory layout then randomization is counterproductive. * `nanoBench `__ - this has a kernel module to allow reading hardware performance counters for microbenchmarks without incurring much overhead * `benchExec `__ - this is very similar to the tool I remember, but focused on limiting resources rather than getting precise measurements * `hyperfine `__ - this does some basic warmup and statistics, useful as a baseline for the bare minimum of Benchmarking +* `temci `__ sets up an environment -The coarse-grain techniques are generally software-oriented and provide mea- surements with millisecond resolution. They are good for quick estimates of utilization. -The fine-grain techniques are more elaborate and use specialized debugging hardware or logic analyzers, to provide microsecond resolution measurements. - -Execution time can be measured in different ways. -The resolution is the limitation of the timing hardware's reporting capability - for example a stopwatch may report times with 0.01 sec resolution, /proc reports in ticks (microseconds), rtdsc reports in cycles (a few ns), and a logic analyzer might have 50 nsec resolution. There is also the granularity, whether times are reported for the progam, subroutine, loop, line of code, or assembly instruction. -Precision is the variation in measurement from one run to the next. A perfectly precise method would deliver the same result for the same benchmark on every run. Usually precision is measured as the standard deviation of a large number of measurements. Generally, the variations from run to run are much larger than the resolution. -Accuracy is the closeness of the measured value to the actual value. The actual value is generally not known in benchmarking, so it is difficult to measure accuracy. Generally it is assumed that the timing method is not systematically biased, so it can be estimated as the precision. +profiling: -Generally the task must be at least 5-10x larger than the measurement precision to get useful results. Thus, if the task is 10 msec, then the measurement technique must have better than 1 to 2 msec precision. This is only a rule of thumb; more precision is always better, but less precise measurements can be compensated for by doing more of them. +* generate approximate profile of new/modified code, guessing using heuristics +* cold functions, functions executed once, loopless functions, min/max/average number of loop iterations, branch probabilities, values of expressions in program, order of first execution of functions +* AutoFDO profile https://perf.wiki.kernel.org/ -:cite:`snowdonAccurateRuntimePrediction2007` shows that frequency scaling can be accounted for by regression equations on performance counters, but there is still noise on the level of 2-7%. \ No newline at end of file diff --git a/docs/Commentary/Programs.rst b/docs/Commentary/Programs.rst index 0e3199e..9ce567a 100644 --- a/docs/Commentary/Programs.rst +++ b/docs/Commentary/Programs.rst @@ -1,7 +1,7 @@ Exemplary programs ################## -A programming language is no good unless some significant programs have been written in it. This is mainly so that dark corners of the language have been explored and debugged, but there is also a "halo effect" of "I like program X, therefore I like programming language L that X is written in". Also, every popular PL solves a specific set of programming problems really well, in specific domains. So get the set of problems right, and the design for the PL will reveal itself naturally. Or solve all the programming problems and win the game. Once Stroscot gets users it can switch to problem-driven design: work with real-world users who have problems, abstract and implement solutions for them. +A programming language is no good unless some significant programs have been written in it. This is mainly so that dark corners of the language have been explored and debugged - unit tests don't work nearly as well as real applications. Such programs also act as a showcase for your language, making it easier for potential users to develop an understanding of what an average project in your language might look like. There is also a "halo effect" of "I like program X, therefore I like programming language L that X is written in". Also, every popular PL solves a specific set of programming problems really well, in specific domains. So get the set of problems right, and the design for the PL will reveal itself naturally. Or solve all the programming problems and win the game. Once Stroscot gets users it can switch to problem-driven design: work with real-world users who have problems, abstract and implement solutions for them. * Fibonacci: The Best Benchmark Ever. Every new language announce uses it. Bonus points if recursive is faster than iterative. diff --git a/docs/Commentary/Reduction.rst b/docs/Commentary/Reduction.rst index 4f7f58d..24c99fe 100644 --- a/docs/Commentary/Reduction.rst +++ b/docs/Commentary/Reduction.rst @@ -10,6 +10,36 @@ An extension of graph reduction is optimal reduction. In optimal reduction, ter Although BOHM and Lambdascope have created interaction net machines where a Levy family beta-reduction is an atomic operation, it does not seem that these interaction net machines are actually that efficient. In :cite:`guerriniOptimalImplementationInefficient2017` they find a quadratic slowdown of the fan-fan swap interactions, relative to normal order reduction on lambdas. Furthermore in :cite:`aspertiParallelBetaReduction2001` they find that a Levy family beta-reduction step can take time that is not elementary recursive. It seems like an easier path to optimal reduction is to track the families directly and perform family reductions individually using an efficient non-sharing algorithm such as :cite:`shiversBottomupVreductionUplinks2004` or a limited-sharing algorithm such as graph reduction. Such an approach gives up the exponential speedups of sharing contexts, but has a much more understandable cost model. It is still "optimal" in the sense that outermost family reduction performs the minimal number of family reductions, and the overall reduction follows this strategy, hence the only inefficiency is the need to evaluate some unnecessary individual reductions as part of the family. Potentially, there is the possibility to use the homogeneity of the redexes within a family to speed up within-family reduction, but most likely not. +Since reduction is confluent, it does not change anything to reduce in non-normal order for a time. The reduction will still terminate when going back to normal order. So terminating reductions can always be performed and even non-terminating reductions can be reduced somewhat. Hence during compilation we want to reduce the program as much as possible - ideally the compiled core should be cut-free. We can detect diverging terms and replace them with error terms. But we can't eliminate cuts involving complex recursion, so have to create a heap or a stack allocation. For example the Fibonacci list ``let fibs = 0 :: 1 :: zipWith (+) fibs (tail fibs) in { repeat forever { n <- readInt; print (fibs !! n) } }``, this needs some kind of reduction graph or memo stack involved. + + +Def/use +======= + +Graph reduction using def/use is not hard - assign each definition node a static integer identifier. Then the root is a distinguished definition, which is mutable, while everything else is copied into the root graph when reduction goes past its use point. Assuming the definition data is stored on disk and paged in/out as needed, we can minimize runtime memory use in a compiler pass by introducing as many use-def indirections as possible, one for every sequent in the derivation. This also makes the connections between rules uniform. But having lots of indirections is inefficient so a later pass would remove indirections that will be immediately used (chunkification). + +The optimal fixedpoint algorithm outlined in :cite:`shamirFixedpointsRecursiveDefinitions1976` (10.18, PDF pages 240-242) is a variation of Tarjan's strongly connected component algorithm. Cuts between two definitions ``f x`` are memoized in a list, and if the SCC algorithm finds a component ``f x -> let g = ... in g (f x)`` then this component is solved. If it has a unique solution then that's the answer, otherwise ``f x`` diverges and is replaced with a ``RecursionError`` or ``AmbiguousError``. We assume the solver allows uninterpreted "holes", so that the SCC can be solved before its sub-computations. + +For comparison, to compute the least fixed point we would maintain a "working graph" and incrementally unfold the definition when encountered. But with the optimal fixed point we first reduce the definition to a value while copying other definitions in. + +The solver is an SMT solver on the predicate ``SAT(y == g y)``, and for uniqueness ``UNSAT(y == g y && y != y0)`` where ``y0`` is the first solution found. We exclude error values as possible solutions since the recursion error will be more informative. + +The posets the paper uses appear to be pointed directed-complete partial orders `(cppo's) `__. + + +Hashing +======= + +To hash the graphs we can use the tree structure of the sequent derivations. Each upward slot in a node is hashed with a fixed value and each downward slot is hashed with a value corresponding to the path through the derivation tree followed by the label of the upward slot. It is written as a single DFS traversal with the leaves as base case that stores the hashed subtree and a map from edge name to partial path. + +Hashing infinite graphs is harder, we have to hash each SCC as a unit. See :cite:`mauborgneIncrementalUniqueRepresentation2000`. + +Primitives +========== + +Primitives (integers) can be handled by hacking special cases into Cut; we add primitive functions of type PiR that use the arguments provided by PiL during a cut, and also literals, special values of type SigmaR. Alternately we can use a specialized proof trees: 64-bit integers are represented as a sigma type with 2^64 possibilities. So addition is represented as a case expression, where each case contains another case expression, and then each case constructs the integer corresponding to the addition. There is a lot of fan-out at each step, which would require 2^128 values to represent, clearly infeasible. So although this is the conceptual representation, the actual representation has no fan-out for the cases - instead the case nodes create symbolic variables ``a`` and ``b``, and the constructed value has the tag ``a+b``. + + Linear logic ============ diff --git a/docs/Commentary/State.rst b/docs/Commentary/State.rst index 6017ccf..6c0a734 100644 --- a/docs/Commentary/State.rst +++ b/docs/Commentary/State.rst @@ -169,7 +169,7 @@ For ``mfix`` vs ``mdo``, there are many implicit laws promoted by the ``mdo`` no Unfortunately, the only known monad satisfying right shrinking is the lazy state monad ``s -> (a,s)`` (and its restrictions, the output monad ``Monoid w => (a,w)``, reader monad ``p -> a``, identity monad ``a``, and trivial monad ``Void``). Erkok has a proof 3.1.6 that the trivial monad ``const Void`` is the only monad satisfying left-strictness ``undefined >>= f = undefined``, right shrinking, strictness 2.1.1, and various other properties. The setup is the above right shrinking rule where ``f xs = return (1 : xs); g xs = case xs of [x] -> return x; _ -> return 1``. He says "It is easy to see that [the LHS mdo block] must yield bottom by the strictness property". Expanding on this, if we start with ``(z,w) = undefined``, then after one loop we get ``z=1:undefined`` and ``g z = undefined``, so the overall function returns ``undefined`` by the left-strictness property, so therefore by strictness 2.1.1 the overall ``mfix`` is also undefined. But of course, if we start with the fixed point of the RHS, ``(z,w) = (repeat 1,1)``, we get that back even in the LHS. So Erkok's proof relies on strictness and ``mfix`` producing the least fixed point. Using similar arguments about bottom, there is a proof that Haskell `cannot have a State monad `__. Really, the discussion should focus on the monad's behavior for defined values and total functions, and not discuss bottom or seq at all. I think it's best to suppose that the premises of the proof are incorrect, but the fact remains that the only known right-shrinkable monad is the lazy state monad. Absent other monads, it seems the lazy state monad is really the canonical implementation of ``MonadFix``, similar to how lists are the only non-trivial implementation of ``MonadZip`` and everything else is just lifting. -But Erkok had to write a thesis, so of course he can't just say "lazy state is MonadFix" and leave it there. Erkok proposes to leave out right shrinking and other properties to obtain a more general definition. The main issue with this is that ``mfix`` for the lazy state monad is no longer unique - there is a "shallow" mfix operations which simply apply the function to the bottom state. Erkok's definition of ``mfix`` for I/O is shallow in this sense. For ADT-style monads, ``mfix`` simply follows the structure of the monad (c.f. `GHC.Generics instances `__). These operations are kind of useful for +But Erkok had to write a thesis, so of course he can't just say "lazy state is MonadFix" and leave it there. Erkok proposes to leave out right shrinking and other properties to obtain a more general definition. The main issue with this is that ``mfix`` for the lazy state monad is no longer unique - there is a "shallow" mfix operations which simply apply the function to the bottom state. Erkok's definition of ``mfix`` for I/O is shallow in this sense. ``mfix`` cannot be implemented in general for ADT-style monads. (c.f. `GHC.Generics instances `__). These operations are kind of useful for data U1 p = U1 -- lifted version of () data (:+:) f g p = L1 (f p) | R1 (g p) -- lifted version of Either @@ -326,8 +326,7 @@ Due to the quantification, the operations on ``Ran`` are restricted. In particu We call the values in ``M m`` continuations, and the values in ``m r`` actions. A continuation represents "the future of the program". Executing a continuation plugs this future into a program description with a hole - usually there is one hole, but the continuation can discard the future or run it multiple times. The implementation can compile continuations to jumps under most circumstances and closures otherwise, so the execution model is also conceptually simple. Continuations are the basis in formal denotational semantics for all control flow, including vanilla call flow, loops, goto statements, recursion, generators, coroutines, exception handling, and backtracking. This allows a uniform and consistent interface. Continuations are more powerful than goto. - -``Codensity`` is quite efficient - the case analysis is pushed to the monad's operations, and there is no pile-up of binds - all uses of the underlying monad's bind are right-associated. It converts the computation to continuation-passing style. In particular free tree-like monads :cite:`voigtlanderAsymptoticImprovementComputations2008` and `MTL monad stacks `__ are much cheaper when implemented via Codensity. As a contrary point, in the `case `__ of the Maybe monad an ADT version seemed to be faster than a Church encoding. Unfortunately hpaste is defunct so the code can't be analyzed further. It's not clear if the "CPS" version mentioned was actually Codensity. +``Codensity`` is quite efficient compared to most ADT-style monads - the case analysis is pushed to the monad's operations, and there is no pile-up of binds - all uses of the underlying monad's bind are right-associated. It converts the computation to continuation-passing style. In particular free tree-like monads :cite:`voigtlanderAsymptoticImprovementComputations2008` and `MTL monad stacks `__ are much cheaper when implemented via Codensity. As a contrary point, in the `case `__ of the Maybe monad an ADT version seemed to be faster than a Church encoding. Unfortunately hpaste is defunct so the code can't be analyzed further. It's not clear if the "CPS" version mentioned was actually Codensity. SPJ also suspects that deeply nested continuations will not optimize properly compared to the world tokens. :cite:`meyerovichSocioPLTPrinciplesProgramming2012` mentions that generators and coroutines (one-shot continuations) have been preferred to multi-shot continuations, and if you read :cite:`elizarovKotlinCoroutinesDesign2021`, they say "The main reason for this is believed to be the inherent complexity of the continuation-based code, and the difficulty of making it performant." But here we are simply implementing continuations as lambdas, so there is not really any more complexity added, and it seems safe to assume that an efficient lambda implementation (e.g. using optimal reduction) will also lead to efficient continuations, although perhaps it will need some tweaking. @@ -538,6 +537,8 @@ Haskell uses a state monad ``IO a = s -> (# s, a #))`` for implementing I/O, whe * It is not clear what the token represents: a thread? a core? a state? The semantics of an infinite program like ``x = write "x" >> x`` is tricky to specify - it is not really a function at all. * An I/O operation is an abstract function which makes it quite difficult to inspect IO values or implement simulations of I/O such as `PureIO `__. +The one advantage it has (per SPJ) is speed - the world token is 0-bytes so has no calling convention overhead. + Logic programming ================= @@ -621,20 +622,35 @@ Still though, a gap is a gap, so to get performance we must provide laziness or Automatic destructive update ============================ -Although pure programs do not have operators for destructive update, they can still express similar programs using a copying update operation that returns a new data structure with the relevant index modified. For `example `__ counting the frequency of byte values within a block of binary data: +Although non-side-effectful programs do not have operators for destructive update, they can still express similar programs using a copying update operation that returns a new data structure with the relevant index modified. For `example `__ counting the frequency of byte values within a block of binary data: :: - freq (b : Binary) = scanr (\arr x -> update x (+1) arr) (repeat 256 0) b + freq (b : Binary) = foldr (\arr x -> update x (+1) arr) (repeat 256 0) b - -- expands to: +Maybe you are not familiar with ``scanr`` - in the core language, this would expand to a recursive function: + +:: freq (b : Binary) = go b (repeat 256 0) where go (x:rest) arr = go rest (update x (+1) arr) go [] arr = arr -The issue is that a naive implementation of "update" copies the entire array, using O(n) memory and time. :cite:`hudakAggregateUpdateProblem1985` shows that with a compiler analysis (hereafter called "automatic destructive update") a language can provide O(1) update-copy operations. The compiler searches through possible evaluation orders for an evaluation order that never accesses the old version of data after updating, and transforms such "single-threaded" programs to destructively update, giving the speedup. Programming with pure arrays in a "single-threaded" style is at least as expressive as imperative arrays - per Hudak, all the natural translations of imperative algorithms are single-threaded. Some of :cite:`okasakiPurelyFunctionalData1998`'s lazy data structures have a similar single-threaded use condition for amortized good performance, so the single-threaded condition seems reasonable. Also well-defined Ocaml programs that use side effects must be single-threaded, else there is a race condition. +This can be compared with a solution in Java - in Stroscot the function parameter's modifying takes the place of the mutable array: + +.. code-block:: Java + + int[] calculateFrequency(byte[] data) { + int[] arry = new int[256]; + for(int i = 0; i < data.length; i++) { + arry[data[i]]++; + } + return arry; + } + + +The issue with the functional is that a naive implementation of "update" copies the entire array, using O(n) memory and time. :cite:`hudakAggregateUpdateProblem1985` shows that with a compiler analysis (hereafter called "automatic destructive update") a language can provide O(1) update-copy operations. The compiler searches through possible evaluation orders for an evaluation order that never accesses the old version of data after updating, and transforms such "single-threaded" programs to destructively update, giving the speedup. Programming with pure arrays in a "single-threaded" style is at least as expressive as imperative arrays - per Hudak, all the natural translations of imperative algorithms are single-threaded. Some of :cite:`okasakiPurelyFunctionalData1998`'s lazy data structures have a similar single-threaded use condition for amortized good performance, so the single-threaded condition seems reasonable. Also well-defined Ocaml programs that use side effects must be single-threaded, else there is a race condition. Roc and Koka seem to be going down the automatic destructive update route via alias analysis and ref-counting optimizations. It seems like a great optimization and it does not seem too hard to allow marking array uses as single-threaded so the compiler warns if it has to copy. @@ -717,7 +733,7 @@ Most papers limit themselves to keeping the values of mutable variables in the s read k (write j v D) = if k == j then v else read k D read k emptyStore = MissingValue -So one constraint to be a variable is that the state must be accessible. So for example the kernel limits us - we do not have full control over peripheral devices or processes not related to ours. We can represent this by using shadowing access-controlled variables and returning ``WriteFailed`` for inaccessible variables. +So one constraint to be a variable is that the state must be accessible. So for example the kernel limits us - we do not have full control over peripheral devices or processes not related to ours. We can represent this by shadowing access-controlled variables and returning ``WriteFailed`` for inaccessible variables. Conveniently the CRIU project has a `list `__ of what's in the state of a Linux user process. We reproduce it here: diff --git a/docs/Commentary/Symbolic.rst b/docs/Commentary/Symbolic.rst index 1272ccb..c676876 100644 --- a/docs/Commentary/Symbolic.rst +++ b/docs/Commentary/Symbolic.rst @@ -1,42 +1,63 @@ Symbolic computing ################## -Symbolic computation is mainly distinguished by what it is not: it is not "numerical computation" with machine precision numbers, where one arithmetic operation translates to one machine code instruction. Symbolic computation can provide exact solutions, perform algebraic manipulation, prove mathematical properties, and solve equations. In education, it allows using mathematical numbers without complications such as approximation error and number formats. In industry, it allows manipulating and analyzing complex equations in robust ways, and provides more accuracy and precision in cases where it is required. +Symbolic computation is mainly distinguished by what it is not: it is not "numerical computation" with machine precision numbers, where one arithmetic operation translates to one machine code instruction. Symbolic computation can provide exact solutions, perform algebraic manipulation, prove mathematical properties, and solve equations. In education, it allows using mathematical numbers without complications such as approximation error and number formats. In industry, it allows manipulating and analyzing complex equations in robust ways, and provides more accuracy and precision in cases where required. -Numerical computation is typically much faster. In many tasks such as scientific simulations, data analysis, and engineering design, the approximation errors cancel out or are negligible and numerical computation delivers acceptable results in orders of magnitude less time. But often, these numerical equations are obtained with the help of symbolic computation. Many real-world problems require both numerical and symbolic analysis. Symbolic computing is thus a valuable tool, and Stroscot would be lacking if it left symbolic computing out of its functionality. +Machine precision numerical computation is typically much faster. In many tasks such as scientific simulations, data analysis, and engineering design, the approximation errors cancel out or are negligible and numerical computation delivers acceptable results in orders of magnitude less time. But often, these numerical equations are obtained with the help of symbolic computation. Many real-world problems require both numerical and symbolic analysis. Symbolic computing is thus a valuable tool, and Stroscot would be lacking if it left symbolic computing out of its functionality. Features ======== -One simple way of characterizing features is by the values (normal forms) they add to the language and then the operations on those values. There is a wide range of values supported by CAS's such as Mathematica, Maple, Magma, and SAGE: symbolic expressions (terms, including symbols, variables, and binders so as to write derivatives, integrals, and summations), numbers to a specified (arbitrary) precision, machine-precision numbers, rational numbers, (computable) real numbers, complex numbers, strings, lists, sets, polynomials, truncated power series, systems of equations, probability distributions, special functions, records, tables, units, arrays, matrices, tensors (including sparse), functions, graphics, images, plots, files, directories, directed and undirected graphs and networks, dates and times, geometric objects, piecewise and interpolated numerical functions, sound and audio data, signal (time-series) data, algebraic number fields/rings/ideals (finite fields, algebraic extensions, and elements thereof), crytographic keys, combinatorial objects (permutations, combinations, and partitions), generating functions, modular forms, p-adic numbers, combinatorial designs (block designs, Latin squares, orthogonal arrays), databases, logical connectives (and, or, not). +One simple way of characterizing features is by the values (normal forms) they add to the language and then the operations on those values. There is a wide range of values supported by CAS's such as Mathematica, Maple, Magma, and SAGE: symbolic expressions (terms - trees, including symbols, variables, and constants, and binders so as to write derivatives, integrals, and summations), numbers to a specified (arbitrary) precision, machine-precision numbers, rational numbers, (computable) real numbers, complex numbers, strings, lists, sets, polynomials, truncated power series, systems of equations, probability distributions, special functions, records, tables, units, arrays, matrices, tensors (including sparse), functions, graphics, images, plots, files, directories, directed and undirected graphs and networks, dates and times, geometric objects, piecewise and interpolated numerical functions, sound and audio data, signal (time-series) data, algebraic number fields/rings/ideals (finite fields, algebraic extensions, and elements thereof), crytographic keys, combinatorial objects (permutations, combinations, and partitions), generating functions, modular forms, p-adic numbers, combinatorial designs (block designs, Latin squares, orthogonal arrays), databases, logical connectives (and, or, not, forall, exists). -Then we need algorithms and operations on these values. For the numbers, it is mainly arithmetic and special functions. For expression trees, there is simplification (and related transformations such as distributing, combining, and factoring), applying algebraic rules or equations (pattern matching and substitution, such as automatic differentiation), and equation solving. More complex algorithms such as automatic integration combine these in interesting ways. +Then we need algorithms and operations on these values. For the numbers, it is mainly arithmetic and special functions. For expression trees, there is simplification (and related transformations such as distributing, combining, and factoring), applying algebraic rules or equations (pattern matching and substitution, such as automatic differentiation), and equation solving. More complex algorithms such as automatic integration combine these in interesting ways. Then there are SMT solvers which can take almost any formula and determine if it is satisfiable. -"Robust" numerical evaluation is bounding a result to within a given accuracy or tolerance. This is contrasted with arbitrary-precision or machine-precision evaluation where each step is done within a given accuracy or tolerance, but numerical error may accumulate over the course of the computation. In this latter case, augmenting the computation with interval arithmetic or other error tracking methods may reveal a wide range for the actual answer and the produced answer may be quite different from the actual answer. Generally robust calculations use arbitrary-precision calculations internally but adaptively increase the precision so as to avoid accumulating too much error in intermediate calculations. Robust numerics should be formally verified that they produce accurate results. +Design +====== -:: +Following Stroscot's philosophy of "minimal core", as little as possible of the implementation should be in the language core. We will need to represent symbolic expressions - the easiest thing is to make these syntax trees, so that algorithms are macros that operate on syntax. But of course specific algorithms will operate on restricted subsets of syntax trees. - e = cos(1)^2 + (sin(1)^2 - 1) +Oscar Benjamin has been working on speeding up SymPy and has written about `how he would redesign SymPy `__. It is certainly worth taking his design points into consideration. According to Benjamin, fast symbolic computation depends on implementing three things efficiently: robust numerical evaluation, symbolic manipulation, and computational algebra. - double e # Machine precision (double) floating point - --> 5.551115123125783e-17 +Calculations +============ - mpf e {digits=50} # Arbitrary precision floating point - --> mpf('1.3363823550460978230702682335757409394075984400525243e-51') +Consider the expression ``e = cos(1)^2 + (sin(1)^2 - 1)``. Evaluation can be done in several ways. - robust_evalf e {digits=50} # Robust floating point - --> sympy.Float(-(1 +- 1)*2^(-531)) +* The simplest and most common is machine-precision evaluation, where each operation is done to within a given tolerance (1 ulp usually). Numerical error can accumulate and there is no recourse other than to restructure the computation. ``evalDouble e = double 5.551115123125783e-17`` +* In arbitrary-precision evaluation (mpmath), each operation is done to a specified, configurable tolerance / precision. But error may still accumulate and cause catastrophic cancellation and other issues. This can be addressed by manually increasing the precision. ``mpf e {digits=50} = mpf('1.3363823550460978230702682335757409394075984400525243e-51')`` +* In interval arithmetic, each number's upper and lower bounds are tracked. Accumulated errors within computations are visible as large bounds relative to the number. The computation can be retried with more precision if it is not sufficiently accurate. ``interval e {digits=3} = {low = 0, high = 1e-3 }`` +* In robust numerical evaluation (sympy.evalf, Arb), the result is guaranteed to be within a given accuracy or tolerance. The precisions needed for each intermediate computation are calculated using back-propagation of error and estimates of the actual value. Robust numerics are generally formally verified to produce accurate results. ``robust_evalf e {digits=50} = Float(-(1 +- 1)*2^(-531))`` An algorithm to determine if an expression is equal to 0 would also fall under robust numerics. +* In symbolic computation, the expression is simplified through rewrite rules. The expression may be returned unchanged, or reduce to a number, depending on the expression and simplification rules. ``simplify e = 0`` - simplify e # exact/symbolic - --> 0 +Typically, operations on tree-expressions and polynomials use a combination of robust numeric evaluation and symbolic computation. Robust numerical evaluation uses a mixture of machine-precision, arbitrary-precision, and interval arithmetic. +Polynomials +=========== +Polynomials consist of a finite summation of terms, where each term is a product of a coefficient and powers of variables with integer exponents (generally positive). The coefficients are drawn from some domain (set), which can be integers, rationals, or symbolic expressions, while the variables are drawn from a different (finite) set of expressions. A univariate polynomial can be stored as a dense list of coefficients, while for multivariate polynomials it is generally more efficient to use a sparse representation. It is better to store all the coefficients using a uniform representation, so that operations do not have to check types. There is a conversion function that takes an arbitrary expression and converts it to the most efficient polynomial representation. -Another good feature is logic programming - it would be nice to allow equation solving to be seamlessly integrated with logic programming using a satisfiability-modulo-theories (SMT) solver. Similarly, one should be able to write a loop that sums the numbers from 1 to n, and then solve for the n that produces a desired value. This leads nicely into theorem proving and verification tools. +One question is whether the variables should form a basis. For example, with ``sin(t)`` and ``cos(t)``, the expression ``sin^2(t)+cos^2(t)-1`` is zero, even though its coefficient matrix ``{(2,0): 1,(0,2): 1,(0,0): -1}`` is not. It certainly makes it easier if the algorithms can assume that only the 0-coefficient polynomial is 0, but in general this doesn't hold. The conclusion is that polynomials with orthogonal variables should be a subset. Arbitrary expressions should be usable as the "variables" of the polynomial as well as the coefficients. -Design -====== +Staging +======= -Following Stroscot's philosophy of "minimal core", as little as possible of the implementation should be in the language core. We will need to represent symbolic expressions - the easiest thing is to make these syntax trees, so that algorithms are macros that operate on syntax. But of course specific algorithms will operate on restricted subsets of syntax trees. +Many systems do "automatic" simplification in that for example ``print(cos(pi/4)) = sqrt(2)/2``. Maybe this simplification is what you wanted, but if it is not (e.g. you wanted to evaluate the cosine with machine precision) you are stuck - the information about the original expression has been lost. Simplification and conversion is in general an expensive, lossy operation. Therefore the main API should be staged: first you construct an expression tree / polynomial / etc., and then you apply one or more functions to specify how to transform and evaluate it. This sort of API is precise and allows fine control over evaluation via tree manipulation, avoiding unexpected costs and unnecessarily repeated simplifications. Auto-simplifying operations can be defined on top of the staged API, like ``(AutoSimplify a) + (AutoSimplify b) = AutoSimplify $ simplify (a + b)``, and are useful if that's the desired behavior. But it would be a mistake to expose only auto-simplifying operations. -Oscar Benjamin has been working on speeding up SymPy and has written about `how he would redesign SymPy `__. It is certainly worth taking his design points into consideration. According to Benjamin, fast symbolic computation depends on implementing three things efficiently: robust numerical evaluation, symbolic manipulation, and computational algebra. +To look at +========== + +* SMTLIB2 +* internal datatypes of z3 +* logic programming - solve for input to program that produces value +* matrix manipulation library +* pde solver +* https://github.com/uwplse/ruler +* http://www.sc-square.org/CSA/welcome.html They're trying to bridge cas and smt. +* linear system in a finite field +* BLAS - just because both CAS and BLAS have A in them, does not mean they are the same thing. +* https://herbie.uwplse.org/ +* https://egraphs-good.github.io/egglog/?example=herbie +* https://flintlib.org/ - well-designed per Oscar Benjamin +* Maxima, Axiom and its two (!) forks, SymPy, REDUCE, Symbolics.jl, Mathics, SageMath, FORM, Yacas, Xcas, Oscar.jl, Macaulay2, Singular, `emmy `__ - Axiom (or one of its forks) is probably the best-designed "winner" that everybody should copy (Lisp-based) +* Mathlib - could lead to a standardized language/format for machine-readable mathematics, like JSON or SMTLIB https://leanprover-community.github.io/ diff --git a/docs/Commentary/Syntax.rst b/docs/Commentary/Syntax.rst index e012d90..9ce0a82 100644 --- a/docs/Commentary/Syntax.rst +++ b/docs/Commentary/Syntax.rst @@ -3,6 +3,8 @@ Syntax Stroscot's syntax should be clear, readable, friendly, and consistent. It should be easy to learn and use for those who are new to programming. The language should be accessible to a wide range of users, including those with disabilities. Furthermore the syntax should be flexible, extensible, and customizable, so that developers can tailor aspects of the language to their specific needs and preferences. +For now S-expressions are probably sufficient, there is not a huge need for nice syntax. Syntax is one of the most boring aspects of building a language - there's not a lot you can innovate upon (except maybe using a large language model for parsing) + Customization ============= @@ -40,7 +42,7 @@ Stefik used a mixture of CS university students (freshman year novices and junio * programming experience. Stefik found that programmers will list experience with individual languages even when reporting no experience total, so it seems best to skip an overall experience question and ask individually with a language matrix. Since not all languages will be listed we'll have an "All other programming languages" catch-all at the end. Stefik presumably asked an empty-box "how many years of experience do you have with X" question. But a matrix requires choices. Taking his mean experience reported as the data points, for experienced programmers, dividing into 5 buckets we have 0-0.03,0.03-0.09,0.9-0.22,0.28-0.48,0.67-1.76. For inexperienced programmers, we have 0-0 as a large bucket and then the remaining range is split into 0-0.01,0.01-0.03,0.05-0.09,0.12-0.23. Combining experienced and inexperienced, the 0-0 bucket stays and we have 4 more buckets 0.01-0.03 (±0.11-0.16), 0.03-0.09 (±0.18-0.31), 0.11-.28 (±0.41-1.06),0.39-1.76 (±.0.7-1.87) Translating into familiar units, these buckets are no experience, <11 days, 11-33 days, 40-102 days (1.32-3.36 months), 4.68+ months. Given the wide uncertainties we can round to no experience, < 2 weeks, 2 weeks to 1 month, >1 month and <4 months, or 4+ months. It's not as accurate as the empty-box but hitting a circle on mobile is much easier. In terms of the survey of :cite:`siegmundMeasuringModelingProgramming2014`, it is a quantified version of the self-estimation that rates experience as 1-5. We can sum the (approximate) years of experience to obtain overall years of experience, which should lead to Siegmund's "professional programming experience" factor, which IMO is the correct factor to use to measure programming experience. But we would have to do another validation study with university students to verify that this metric correlates with the ones in Siegmund's study. In fact though I am mainly going to use the metric as a binary variable, novice vs. not, so it's not too important. As far as languages, Study 1 had HTML, C++, Java, Matlab, JavaScript, Basic, COBOL reported with experience for non-programmers and Study 2 was similar with the addition of PHP. Considering programmers among both studies, we would add C#, Python, Ruby, and COBOL as <1 month languages, and Perl and FORTRAN as <2 week languages. Meanwhile the SO language list top 15 is JavaScript, HTML/CSS, Python, SQL, Java, Node.js, TypeScript, C#, Bash, C++, PHP, C, Powershell, Go, Kotlin, with a clear break in popularity from C at 21% to Powershell at 10%. The question asked for "extensive development work in the past year" though so is not really a question about which languages are most likely to have beginner exposure. Contrariwise TIOBE does not consider HTML a programming language because it is not Turing complete, but does list SQL. We do not want to list too many languages, because marking lots of languages as "no experience" is tedious, but a good selection is key to defining the question and ensuring the catch-all is answered accurately. One design point would be to preselect "no experience" for all but the catch-all, solving the tedium issue, but the survey tool would have to support this. -* reading experience: as an alternative to age, being able to gauge the vocabulary of the +* reading experience: as an alternative to age, being able to gauge the vocabulary of the * Have you heard of the Stroscot or Quorum programming languages before this survey? Yes/no. This is a question Stefik says he wished he had asked, to avoid confounding results. (2 questions actually) diff --git a/docs/Commentary/TermRewriting.rst b/docs/Commentary/TermRewriting.rst index f16f01d..a932382 100644 --- a/docs/Commentary/TermRewriting.rst +++ b/docs/Commentary/TermRewriting.rst @@ -80,7 +80,7 @@ Infinite reduction is useful because it is "smoother" than finite reduction - no The idea is to extend our set of terms to include infinite terms, defined as the `metric completion `__ of finite terms with a distance function :math:`2^{-n}` if the n-th level of the terms is the first level where a difference appears and 0 if the terms are equal. By convention the top level is level zero. This definition is equivalent to a co-inductive definition of terms, i.e. the largest set consisting of term-parts whose subterms are co-inductive terms. -This set, like the real numbers, is uncountably large and includes terms with no finite description. A more tractable subset is the rational terms, finite cyclic structures (called rational because rational numbers have the similar property that their decimal expansion repeats when written out), but these can't represent non-repeating structures like the list of fibonacci numbers. The implementation will have to use some computable approximation. Practically most programs will only deal with finite or rational terms so performance on other types of terms is not critical. +This set, like the real numbers, is uncountably large and includes terms with no finite description. Actually we should be able to map the set of finite and infinite terms to the real interval [0,1]: do a breadth-first traversal of the term tree and encode the sequence of symbols using arithmetic encoding. Properly defined, the mapping should be a computable, continuous bijection that imposes a total order on terms. TODO: verify. There is then a canonical representation of each real number, such as `this continued fraction representation `__, so also a canonical representation of each infinite term. Although, again like the reals, it is computationally more tractable to use an unevaluated-expression-tree symbolic representation and than an infinite sequence of numbers. There are various extensions of the transitive closure to infinitary reduction, so the question arises as to which one to use. :cite:`kahrsInfinitaryRewritingClosure2013` discusses several and provides an ordering so that each is mostly a proper subset of the next (not sure about P* subset bi-infinite). Many of these use the monotonic closure X*, which is the least fixedpoint of the function G defined as G(R) = X(R) union R, which by the (transfinite) Kleene fixed-point theorem exists and is the limit/union of the sequence :math:`X^0 = \emptyset, X^{n+1} = G(X^n), X^\delta = \bigcup_{\alpha < \delta} X^\alpha`. diff --git a/docs/Commentary/Time.rst b/docs/Commentary/Time.rst index d5dc33c..40e73b9 100644 --- a/docs/Commentary/Time.rst +++ b/docs/Commentary/Time.rst @@ -46,7 +46,7 @@ UTC's only real advantage is its ubiquity - there are lookup tables (leap second Epochs ====== -The simplest method of specifying an instant is to give it a name. A named instant is called an epoch. There are numerous epochs, too many to list here. The epochs may be a fixed instant in a given standard, such as the GPS epoch, or be dependent on a time standard to interpret as an instant, such as Rata Die. +The simplest method of specifying an instant is to give it a name. A named instant is called an epoch. There are numerous epochs, and more being added (e.g. Jonathan Blow used the date of the Apollo 13 landing), but `Wikipedia `__ lists some. The epochs may be a fixed instant in a given standard, such as the GPS epoch, or be dependent on a time standard to interpret as an instant, such as Rata Die. :: diff --git a/docs/Commentary/Tools.rst b/docs/Commentary/Tools.rst index a3a29f1..9e248ea 100644 --- a/docs/Commentary/Tools.rst +++ b/docs/Commentary/Tools.rst @@ -72,14 +72,27 @@ Language server VSCode comes with an extensive protocol for language integration, LSP. Other IDEs do support LSP to some extent, but generally not fully and you have to install plugins. There's a `specification `__. As an overview, an LSP server can provide: -* syntax highlighting (the docs say you have to use TextMate grammars too, but I think LSP alone is actually performant enough) +* syntax highlighting (the docs say you have to use TextMate grammars too, but from some examples it looks like LSP alone can be performant enough) * tooltips * autocomplete suggestions -* navigation outline +* navigation outline / object browser * debugger integration -* navigate to definition +* "navigate to definition" command +* find all references +* quick information / signature help * compiler errors/warnings/fixes -* renaming or refactoring actions +* rename symbol +* refactoring actions - extract method +* edit and continue +* execute code snippets (REPL) + +Per `this post `__ a language server serves two goals: quickly processing new edits to source files, while also answering queries quickly. There are several potential designs: + +* Map Reduce - split analysis into a relatively simple indexing phase (per file, in parallel), and a separate full analysis phase (global). Example: Java, the indexer. +* Precompiled Headers - declaration before use, all declaration in headers or equivalent interface files, snapshot the compiler's state immediately after imports for each compilation unit. +* Query-based - all function calls inside the compiler are instrumented to record which other functions were called during their execution. The recorded traces are used to implement fine-grained incrementality. If after modification the results of all of the dependencies are the same, the old result is reused. If a function is re-executed due to a change in dependency, the new result is compared with the old one. If despite a different input they are the same, the propagation of invalidation stops. + +As Stroscot is dynamic, only a query-based approach is sufficiently general to work. The main drawback is extra complexity and slower performance (fine-grained tracking of dependencies takes time and memory). The performance can be alleviated by fine-tuning cache invalidation details and omitting some items from the cache, while the complexity is here to stay. Notebooks --------- @@ -116,6 +129,25 @@ Let's assume we have symbols, then there are lots of operations available from a * REPL / patching: evaluate pure expression in context of state, evaluate arbitrary code in current state (e.g. set variable to value), replace definition, hot-reload code changes, jump to address, return early from function. Pedantically, the patched state has no real history so the debugger should only be able to run forward from the state, but we can graft the patched state onto the old state to avoid losing context. * IPC: send signal, modify files + +searching for watchpoints/breakpoints +build database of everything that happened +replaying different segments of the execution in parallel +applying different kinds of instrumentation to the same segments + +instruction-accurate recording of your software - syscalls, shared memory, signal timing +this can be played forward and will always behave the same way +several approaches - Just-In-Time instrumentation of machine code (Undo.io), ptrace (rr, https://pernos.co), https://replay.io + +to run backwards, we need more information - like if memory is overwritten, what was the value before? Unfortunately, if we recorded all memory changes explicitly, it would be slow and use up a lot of storage. Therefore debuggers use "Finnegan search" ("poor old Finnegan had to begin again...") - they start one or more forks of the process, and run these forks forward up until the desired target is reached. Usually there is a recent snapshot available so only that slice has to be re-executed during reverse stepping. The parallelism is mainly useful for larger breakpoints/watchpoints where the event of interest can be further back and multiple snapshots may have to be examined. + + +we only have snapshots and the minimal information needed to run forward deterministically. So for most tasks, like breakpoints, we need to recompute intermediate states, like memory contents. + +JIT instrumentation +hardware performance counters - not available in containers/virtual machines + + Debugging by querying a database of all program state by Kyle Huey The State Of Debugging in 2022 by Robert O’Callahan Debugging Backwards in Time (2003) by Bil Lewis diff --git a/docs/Commentary/Types.rst b/docs/Commentary/Types.rst index fcaf82e..21a5f02 100644 --- a/docs/Commentary/Types.rst +++ b/docs/Commentary/Types.rst @@ -23,7 +23,7 @@ Overall, while I would be justified in calling Stroscot's sets types, as it is o Is Stroscot typed? ================== -By a "typed" programming language, someone generally means that the language has a type system or type discipline. Wikipedia defines a type system as "a set of rules that assigns a property called a type to the various constructs of a computer program". This phrasing assumes each construct has exactly one "principal" type. But more complex type systems don't have this property. For example with a Haskell GADT ``data R where R : { unR : Int } -> R Int``, ``unR`` may have the type ``forall a. R a -> a`` or ``forall a. R a -> Int``, and these have no unifying type. Haskell just picks a type ``unR : R Int -> Int`` in this case. Really the mapping from expressions to types is many-to-many. Particularly, with Stroscot's definition of types as sets, any set containing the value is a valid type. For example, ``1 : Int``, but also ``1 : {1}`` and ``1 : Any``. So certainly Stroscot has some ambient notion of "the valid types for a value", but in Stroscot's case this is just set membership so it is a little vague in determining whether Stroscot has a type system. ChatGPT lists some features of type systems we can look for: type annotations, type errors, type compatibility, type conversion, type hierarchy. +By a "typed" programming language, someone generally means that the language has a type system or type discipline that ensures type safety. Wikipedia defines a type system as "a set of rules that assigns a property called a type to the various constructs of a computer program". This phrasing assumes each construct has exactly one "principal" type. But more complex type systems don't have this property. For example with a Haskell GADT ``data R where R : { unR : Int } -> R Int``, ``unR`` may have the type ``forall a. R a -> a`` or ``forall a. R a -> Int``, and these have no unifying type. Haskell just picks a type ``unR : R Int -> Int`` in this case. Really the mapping from expressions to types is many-to-many. Particularly, with Stroscot's definition of types as sets, any set containing the value is a valid type. For example, ``1 : Int``, but also ``1 : {1}`` and ``1 : Any``. So certainly Stroscot has some ambient notion of "the valid types for a value", but in Stroscot's case this is just set membership so it is a little vague in determining whether Stroscot has a type system. ChatGPT lists some features of type systems we can look for: type annotations, type errors, type compatibility, type conversion, type hierarchy, type safety. The purpose of type annotations is clear if we consult `Robert Harper `__. Per him, a type system should allow "stating and enforcing the invariant that the value at a particular program point must be a certain type", in the form of a type declaration. Stroscot's set membership assertions have a specific syntax for type declarations and type restrictions, that accomplish exactly this purpose. There are also annotations for declaring the type of a value, for example ``int8 1`` is different from ``int32 1``. @@ -31,6 +31,8 @@ Type errors are also clear. Stroscot's type declarations have an operational sem Type compatibility follows naturally from type errors, for example assembly instructions can only operate on certain bitwidths. Stroscot has a natural form of type conversion and type hierarchy, subtype inclusion, but this is actually not too powerful and most conversions are explicit. Still though, we can talk about the normal data types like integers, floats, etc., so most programmers will find it natural. +Type safety prevents data corruption or unexpected behavior due to type mismatches. Stroscot's dynamic semantics ensures that no strange errors like this occur, so Stroscot is type safe. + It is also worth looking for "untyped" programming languages - ChatGPT lists assembly language, machine code, and Brainfuck. The particular feature of these languages is that all data is represented as bits, and there are no other types. Stroscot does have a related property of "untyped memory access", where for example a floating point number can be stored to memory and read back as an integer. But I would say that because Stroscot has types for non-bit data, like structs for example, it is not untyped. The conclusion therefore is that Stroscot, like almost every language made these days, is a typed language. @@ -58,43 +60,37 @@ Dynamic programming languages allow flexible and adaptable coding practices. But * Late-binding: Choosing methods at the latest possible moment during program execution. For example, method selection may depend on the real-time types of involved objects, the current state of the source code files (hot-reloading), complex dispatch conditions based on properties of the data, and/or values of unrelated variables in the code. * Flexible variables: Allowing variables to accommodate any possible data value during program execution. * Direct execution: Executing source code with a single command, without the need for an intermediate compilation step -* Metaprogramming: Writing macros (code that manipulates other code), and executing code with eval. -* Runtime modification: adding and modifying methods and properties, monkey patching, module loading at runtime +* Metaprogramming, reflection and runtime modification: Writing macros (code that manipulates other code). Executing code with eval. Viewing, adding, and modifying methods and properties. Monkey patching. Module loading at runtime. * Good: As a rule of thumb, dynamic languages are good, all other things being equal. -8. **Aspect-Oriented Programming:** Some dynamic languages support aspect-oriented programming, where you can modify the behavior of specific code segments (aspects) independently, allowing for cross-cutting concerns like logging or error handling to be added dynamically. - -9. **Dynamic Class Creation:** Dynamic languages often allow you to create new classes at runtime. This is particularly useful for scenarios where you want to create new classes based on data-driven specifications or user inputs. - -10. **Dependency Injection and IoC Containers:** Some dynamic languages support runtime dependency injection and inversion of control (IoC) containers, allowing you to change the behavior and dependencies of objects without modifying their source code. - -It's important to note that while these capabilities can offer significant flexibility and power, they also come with potential complexities and challenges, including maintenance difficulties and potential performance impacts. Additionally, misuse of runtime modification features can lead to code that is harder to understand and debug. Therefore, careful consideration and judicious use are recommended when employing such dynamic features. - -For example, you can add or remove methods and attributes from objects dynamically. - -2. **Limited Runtime Modifications:** While Java's reflection allows you to access and invoke methods and constructors at runtime, it doesn't provide the same level of flexibility as dynamic languages for modifying behavior, such as adding or removing methods dynamically. - -In summary, dynamic programming languages go beyond the introspection and access capabilities offered by static languages' reflection features. They allow you to modify and extend behavior more freely at runtime, making them particularly powerful for scenarios where adaptability and runtime customization are crucial. On the other hand, static languages like Java focus on providing information about class structures and allow for limited runtime interactions, but these interactions are still subject to compile-time type checking and restrictions. - - -Here are some key characteristics of dynamic programming languages: +Stroscot has these features, so can be considered dynamic. -3. **Reflection:** Dynamic languages often provide features that allow programs to inspect and manipulate their own structure and behavior at runtime. This is known as reflection and enables powerful meta-programming techniques. - -Some examples of dynamic programming languages include Python, Ruby, JavaScript, PHP, and Perl. These languages are well-suited for tasks where rapid development and flexibility are important, but they might sacrifice some performance optimizations that statically-typed languages offer through their compile-time checks and optimizations. - - Dynamic Typing: On the other hand, "typed" languages can also refer to languages with dynamic typing, where variable types are determined at runtime. While dynamic typing offers flexibility, it can lead to runtime errors if incompatible types are used together. - - Strong Typing: "Typed" languages typically exhibit strong typing, which means that type conversions and interactions between different types are strictly controlled. This prevents unintended type-related errors and promotes safer code. +Is Stroscot strong? +=================== - Weak Typing: In contrast, some programming languages are weakly typed, allowing for implicit type conversions and interactions between different types with less strict control. "Typed" languages usually lean towards strong typing to ensure code robustness. +Per `Plover `__, strong typing has at least 8 definitions: - Type Inference: Many "typed" languages feature type inference, a mechanism that allows the compiler or interpreter to automatically deduce the data type of a variable based on its initialization and usage. This reduces the need for explicit type annotations while maintaining the benefits of static typing. +#. type annotations are associated with variable names, rather than with values. +#. it contains compile-time checks for type constraint violations, rather than deferring checking to run time. +#. it contains compile or run-time checks for type constraint violations, as opposed to no checking. +#. conversions between different types are forbidden, as opposed to allowed. +#. conversions between different types must be indicated explicitly, as opposed to being performed implicitly. +#. there is no language-level way to disable or evade the type system, such as casts. +#. it has a complex, fine-grained type system with compound types, as opposed to only a few types, or only scalar types +#. the type of its data objects is fixed and does not vary over the lifetime of the object, as opposed to being able to change. - Type Annotations: In "typed" languages, developers often annotate variables and functions with explicit type information. These annotations serve as documentation and aid in code readability, while also enabling the type checker to catch potential errors. +Going through these: - Type Safety: "Typed" programming languages prioritize type safety by preventing operations that could lead to data corruption or unexpected behavior due to type mismatches. This contributes to more predictable and reliable software. +#. Stroscot allows type annotations on values, and specifying the set of values allowed in a mutable variable. It also allows type annotations on functions (by name). But, types are defined as sets of values, so overall it is more correct to say they are associated with values. Verdict: Not satisfied +#. Stroscot does check at compile-time. Verdict: Satisfied +#. Stroscot also allows deferring checks to run-time. But it always performs checks. Verdict: Satisfied +#. It would make no sense to not allow conversion from float to double or similar. You could always write such a function yourself. Verdict: Not satisfied +#. Stroscot does not do implicit conversions - either a value is a member of a type unmodified, or it is not a member. Verdict: Satisfied +#. Stroscot follows its runtime semantics in all cases. Verdict: Satisfied +#. Stroscot allows unions, intersections, dependent types, and so on. Verdict: Satisfied +#. Values are immutable and do not change. Stroscot does allow stateful types though. If the value of a reference is modified, it may stop being a member of a stateful type. This is solved by not using stateful types. Verdict: Satisfied +Overall, Stroscot meets 6/8 definitions so you could say it is 75% strongly typed. Type inference ============== @@ -181,16 +177,11 @@ In general, trying to prove any non-trivial property will find all the bugs in a "Soft typing" is similar to model checking, but uses failure of type inference instead of model checking. This means it cannot prove that it actually found an error, and it must stay within the boundaries of type systems, an open research problem. The verification approach is well-explored and its algorithm produces three types of outcomes: hard errors, passing programs, or verification algorithm failure. Similar to Haskell's "deferred type errors" flag, hard errors can still be ignored, but they will trigger an error at runtime. Similar to soft type checking, verification algorithm failure can be ignored - these may or may not trigger an error. -Roles +Notes ===== GHC's roles are just an optimization for ``coerce``. There are better ways to implement optimizations. It seems like a dirty hack to solve a pressing problem. I think Stroscot can get by without them. - - -* strongly typed - `8 definitions `__, all different. It's the semantic equivalent of "amazing", i.e. "My language is strongly typed" == "My language is amazing". Again discussed solely in the "Types" page. - - `This post `__ says "a [gradual] type is something that describes a set of values that have a bunch of operations in common", i.e. value space plus behavior. Stroscot's sets don't have behavior so are not gradual types. But I would also add that Stroscot is optionally typed, because in Stroscot, the `Zero one infinity rule `__ applies. A program can run without any type declarations, with one declaration for the root of the program, or with any amount of type declarations scattered through the program. The no type declarations is an "untyped" setting and ensures there is a complete operational semantics distinct from the type system. The one type declaration enables checking the program for bad behavior, and ruling out common errors such as typos. The unlimited/infinite declarations allows using the power of static verification to its fullest, and may require many iterations of tweaked declarations to get right. diff --git a/docs/Commentary/Values.rst b/docs/Commentary/Values.rst new file mode 100644 index 0000000..bcb4d99 --- /dev/null +++ b/docs/Commentary/Values.rst @@ -0,0 +1,75 @@ +Values +###### + +A value represent an element within the `universe of discourse `__. As Stroscot aims to maximize functionality, the goal is to have an unfettered universe of discourse in which all applications of words are understood according to their common conditions and meanings. The notion of value in Stroscot is thus more a linguistic property, like "proper noun", than a semantic property. Specifically we can define a value in Stroscot as follows: + +* Canonical representation - A value has a canonical representation, usually via text (UTF-8) but perhaps a binary file format like a PNG or network packet. This representation should generally be used consistently for writing and displaying the value. It can also be hashed and compared for equality. Generally, equality is obvious, but for lambdas etc. I will define it as observable equality. + +* Variations - A value may have minor variations in syntax or notation that do not change its meaning. For example, "1/3" and "⅓" represent the same rational number value, but they differ in orthography, or a list (tuple) may be written as ``[1,2]`` or ``(1,2)``, or a regex value may be written as ``new Regex("[0-9]+")``. + +* Context-Independence: Values maintain their meaning and significance across different contexts and languages. Beyond variations in syntax, they are not dependent on the context in which they appear, the specific programming language used, or compilation options. + +* Non-Reducibility: Ideally, values are normal forms, meaning they are not subject to evaluation, reduction, or computation. For example, "1/3" and "let x=3 in 1/x" may be equivalent under reduction, but only "1/3" is considered a value because it cannot be further reduced or evaluated. WHNF is not sufficient to ensure a value, e.g. ``[1,undefined]`` reduces to ``undefined`` hence is not a value. In practice, reduction rules are dependent on context, so we take a liberal perspective and accept values as "values" even if they are not normal forms in all contexts. + +On account of their canonical representation, the simplest concrete representation of a value is as a string, but in practice values will generally be represented in memory as more efficient formats, for example a list as a contiguous array of pointers - see :ref:`Memory`. This representation must be lossless, in the sense that converting from a string to the format and then back to the string should be the identity. + +Values are immutable (as `Rich Hickey says `__). In terms of memory management, values are just data - they can be copied freely, and discarded once they will no longer be used. + +Colors +====== + +So each value has at least two parts (conceptually): there is the "human-readable" value, like for example the integer 1. Then there is the "color" or user annotated type, and that's generally a symbol like ``MyAmazingInteger64``. So when we talk about a value in Stroscot, we are talking about the human-readable value plus the color - ``MyAmazingInteger64 1``. Stroscot has no way to talk about the human-readable value alone, because it would be ambiguous; every value in Stroscot has a color. Would we be talking about the ``1`` stored as the standard library's color (written as just ``1``)? Or about a more specific library's colored ``1``? The question is somewhat pedantic, because we can generally easily convert between values of different colors, and most of the time this conversion is done automatically. But, for purposes of the language semantics, values of different colors are not "equal", because it is easy to distinguish the color. + +Once we get into optimization, we may start using identical memory representations for values of different colors, erasing checks and conversions and so on, because checking the color at each point is slow. For example, SQLSafeString and String will most likely have the same memory representation (it's not like there's an amazingly efficient way of representing sanitized strings that's different from representing normal strings). But during optimization, we are already considering equivalence classes of programs rather than programs - simply by optimizing, we are improving the speed of the program, which is an observable property, and therefore we are changing the behavior of the program by optimizing. The reason this is allowed is because optimization considers programs with the same output value to be equivalent. + +Bare symbols +============ + +Bare symbols are perhaps the simplest value. Bare symbols are simply a sequence of characters. Vaguely like `Raku `__, an ordinary bare symbol is a Unicode word, and an extended symbol uses an escape sequence ``@"sym str"`` to allow spaces etc. following the syntax for a string. An extended symbol matches a corresponding ordinary symbol if that form exists, i.e. ``null == @"null"``. It is a bit tricky to write a bare symbol in proper Stroscot as most symbols are understood to be qualified. + +Term +==== + +A term is a value applied to other values, with no reduction rules applicable. + +:: + + tree (combine (leaf 1 2) (leaf 1 2)) + (++++) a b + (*) (some (weird thing)) 12 + 1 2 + +Terms subsume algebraic data types. Whereas an ADT value is a symbol applied to other values, with restrictions on the types of those values, a term is "untyped" and has no restrictions on its constructor or arguments. + +Term application also includes the other function syntax - keyword arguments, implicit arguments, and so on - so is more than just positional application. + +Functions/Sets +============== + +Ostensibly, a function is a set consisting of tuple pairs ``(a,b)`` with the property that for each value of ``a`` there is at most one value of ``b``. But this definition is circular, so we also hack in many special "base" cases for functions, such as lambdas, rewrite rules, and primitive symbols. + +Similarly, a set is ostensibly a function ``isElemOf : Any -> {Present|Absent}``, but it also has hacked in special instances. + +Binders/Logical proofs +====================== + +Conceptually a binder is a structure with some number of "slots" and numbered references to these slots in the body (a value but with "holes" for the slots). In practice, the binder is specified using variables (symbols), and these variable names are preserved for debugging purposes. Nonetheless, renaming the variables should not change the meaning of the binder (alpha-equivalence). The easiest way to ensure this is to use a nameless graph representation but to preserve variable names at the site of the binder as metadata. For example, a lambda expression ``\x. x x`` is really more like ``(\. 1 1, x)``. + +The sequent calculus is used to represent binders via derivation trees built up using the various rules. We use the logic described in :ref:`Logic`. + +* Jumbo: The Jumbo break rule contains a set of values and a function from those values to a derivation tree. It also specifies a target in each derivation for each side formula. The Jumbo build rule takes a value and a list of left/right derivation trees and combines them, specifying a target for each tree. +* Exponentials: Promotion specifies a target for each formula in the sequent. Dereliction has one target, Weakening has no target, contraction has n (n>2) targets. +* Identity: This is a unique value. +* Quantifiers: This has quantifier build/break, a bit similar to promotion. No-ops at the untyped level. +* Use/def: We must have a global map from identifiers to definitio0ns, and use nodes then list these identifiers +* Syntax definition (equality, sets): These don't even have build/break duality. Again no-ops at the untyped level. + +These rules are not included in values of the sequent calculus: + +* Cut: Since cut-reduction can eliminate all cuts, a proof tree containing cut is a reducible expression. +* Exchange: This is represented by the target pointers in the derivation tree. + +Rewriting system +================ + +A rewriting system consists of a set of rewrite rules. A (conditional) rewrite rule has the form ``l -> r | C1, ..., Cn`` where ``l`` and ``r`` are both (open) values, i.e. some set of symbols is declared as pattern variables in ``l``/``r``. The conditions are combined with ``and``, and take the form of predicates ``Pi(x1, ..., xm, ->)``, where the ``xi`` are the pattern variables of ``l`` and ``r``, and ``->`` is the rewrite relation of the system. Example predicates are type predicates ``a : B``, and ``a`` joins with, rewrites to, or is convertible to ``b``. diff --git a/docs/Reference/Logic.rst b/docs/Reference/Logic.rst index 4b8b60b..affa057 100644 --- a/docs/Reference/Logic.rst +++ b/docs/Reference/Logic.rst @@ -15,18 +15,18 @@ Jumbo connectives \newcommand{\with}{\mathbin{\mathrm{\&}}} \newcommand{\par}{\mathbin{\mathrm{⅋}}} \newcommand{\multimapboth}{\mathbin{\mathrm{⧟}}} - \newcommand{\bang}{{\mathrm{!}}} - \newcommand{\whim}{{\mathrm{?}}} + \newcommand{\bang}{{\mathrm{!}^+}} + \newcommand{\whim}{{\mathrm{!}^-}} \newcommand{\ocin}{\mathrel{\raise{-1pt}{\mathrm{!}}\mathord{\in}}} \definecolor{mygray}{RGB}{156,156,156} \newcommand{\sk}[1]{{\color{mygray} #1}} \rule{\overrightarrow{ \Gamma, \overrightarrow{A_{i j}} \vdash \overrightarrow{B_{i k}}, \Delta }} - {\Gamma \vdash \prod \limits_{i} \left(\overrightarrow{A_i} \multimap \overrightarrow{B_i}\right), \Delta } - {\Pi_R} + {\Gamma \vdash \mathop{𝕁^-} \limits_{i} \left(\overrightarrow{A_i} \vdash \overrightarrow{B_i}\right), \Delta } + {𝕁^-_R} & \rule{\overrightarrow{ \sk{\Gamma_j} \vdash A_{i j}, \sk{\Delta_j} } \quad \overrightarrow{ \sk{\Theta_k}, B_{i k} \vdash \sk{\Lambda_k} }} - {\sk{\overrightarrow{\Gamma}}, \sk{\vec \Theta}, \prod \limits_{i} \left(\overrightarrow{A_i} \multimap \overrightarrow{B_i}\right) \vdash \sk{\overrightarrow{\Delta}}, \sk{\vec\Lambda}} - {\Pi_{i} {}_{L}} + {\sk{\overrightarrow{\Gamma}}, \sk{\vec \Theta}, \mathop{𝕁^-} \limits_{i} \left(\overrightarrow{A_i} \vdash \overrightarrow{B_i}\right) \vdash \sk{\overrightarrow{\Delta}}, \sk{\vec\Lambda}} + {𝕁^-_{L} {}_{i}} \end{array} @@ -34,19 +34,19 @@ Jumbo connectives :nowrap: \begin{array}{cc} - \rule{\overrightarrow{ \sk{\Gamma_k}, B_{i k} \vdash \sk{\Delta_k} } \quad \overrightarrow{ \sk{\Theta_j} \vdash A_{i j}, \sk{\Lambda_j} } } - {\sk{\overrightarrow{\Gamma}}, \sk{\overrightarrow{\Theta}} \vdash \sum \limits_{i} \left( \overrightarrow{A_i} - \overrightarrow{B_i} \right), \sk{\overrightarrow{\Delta}}, \sk{\overrightarrow{\Lambda}}} - {\Sigma_{i} {}_{R}} + \rule{\overrightarrow{ \sk{\Theta_j} \vdash A_{i j}, \sk{\Lambda_j} } \quad \overrightarrow{ \sk{\Gamma_k}, B_{i k} \vdash \sk{\Delta_k} }} + {\sk{\overrightarrow{\Gamma}}, \sk{\overrightarrow{\Theta}} \vdash \mathop{𝕁^+} \limits_{i} \left( \overrightarrow{A_i} \vdash \overrightarrow{B_i} \right), \sk{\overrightarrow{\Delta}}, \sk{\overrightarrow{\Lambda}}} + {𝕁^+_{R} {}_{i}} & \rule{\overrightarrow{ \Gamma, \overrightarrow{A_{i j}} \vdash \overrightarrow{B_{i k}}, \Delta } } - {\Gamma, \sum \limits_{i} \left ( \overrightarrow{A_i} - \overrightarrow{B_i} \right ) \vdash \Delta } - {\Sigma_L} + {\Gamma, \mathop{𝕁^+} \limits_{i} \left ( \overrightarrow{A_i} \vdash \overrightarrow{B_i} \right ) \vdash \Delta } + {𝕁^+_L} \end{array} Common connectives ------------------ -All of the standard operators 01⊕⊗⊤⊥&⅋⊸⧟ in linear logic can be expressed using :math:`\Sigma` and :math:`\Pi`. We use our notation for them. +All of the standard operators 01⊕⊗⊤⊥&⅋⊸⧟ in linear logic can be expressed using :math:`𝕁^+` and :math:`𝕁^-`. We use our notation for them. .. list-table:: :header-rows: 1 @@ -60,67 +60,67 @@ All of the standard operators 01⊕⊗⊤⊥&⅋⊸⧟ in linear logic can be ex * - :math:`F` - :math:`0` - False (Zero) - - :math:`\Sigma []` + - :math:`𝕁^+ []` * - :math:`1` - :math:`1` - One - - :math:`\Sigma [(\#s,[] - [])]` + - :math:`𝕁^+ [(\#s,[] - [])]` * - :math:`A \lor B` - :math:`A \oplus B` - Plus (coproduct, or) - - :math:`\Sigma [(\#l,[A] - []),(\#r,[B] - [])]` + - :math:`𝕁^+ [(\#l,[A] - []),(\#r,[B] - [])]` * - :math:`A \otimes B` - :math:`A \otimes B` - Times (tensor product) - - :math:`\Sigma [(\#s,[A,B] - [])]` + - :math:`𝕁^+ [(\#s,[A,B] - [])]` * - :math:`A^{\otimes n}` - - `Tensor power `__ - - :math:`\Sigma [(\#s,\overbrace{[A,\ldots,A]}^n - [])]` + - :math:`𝕁^+ [(\#s,\overbrace{[A,\ldots,A]}^n - [])]` * - :math:`\smash{\stackrel{+}{\neg}} A` - :math:`A^{\bot}` - Positive Negation - - :math:`\Sigma [(\#s,[] - [A])]` + - :math:`𝕁^+ [(\#s,[] - [A])]` * - :math:`{↑}A` - :math:`A` - Up shift - - :math:`\Sigma [(\#s,[A] - [])]` + - :math:`𝕁^+ [(\#s,[A] - [])]` * - :math:`T` - :math:`\top` - True (Top) - - :math:`\Pi []` + - :math:`𝕁^- []` * - :math:`\bot` - :math:`\bot` - Bottom (contradiction) - - :math:`\Pi [(\#s,[] \multimap [])]` + - :math:`𝕁^- [(\#s,[] \multimap [])]` * - :math:`A \land B` - :math:`A \with B` - With (product, and) - - :math:`\Pi [(\#l,[] \multimap [A]),(\#r,[] \multimap [B])]` + - :math:`𝕁^- [(\#l,[] \multimap [A]),(\#r,[] \multimap [B])]` * - :math:`A \par B` - :math:`A \par B` - Par ("unless", classical or, parallel product, dual of tensor) - - :math:`\Pi [(\#s,[] \multimap [A,B])]` + - :math:`𝕁^- [(\#s,[] \multimap [A,B])]` * - :math:`A^{\par n}` - - Par power - - :math:`\Pi [(\#s,[] \multimap \overbrace{[A,\ldots,A]}^n)]` + - :math:`𝕁^- [(\#s,[] \multimap \overbrace{[A,\ldots,A]}^n)]` * - :math:`A \to B` - :math:`A \multimap B` - Lollipop (implication, internal hom) - - :math:`\Pi [(\#f,[A] \multimap [B])]` + - :math:`𝕁^- [(\#f,[A] \multimap [B])]` * - :math:`A \leftrightarrow B` - :math:`A \multimapboth B` - Equivalence - - :math:`\Pi [(\#l,[A] \multimap [B]),(\#r,[B] \multimap [A])]` + - :math:`𝕁^- [(\#l,[A] \multimap [B]),(\#r,[B] \multimap [A])]` * - :math:`\smash{\stackrel{-}{\neg}} A` - :math:`A^{\bot}` - Negative Negation - - :math:`\Pi [(\#s,[A] \multimap [])]` + - :math:`𝕁^- [(\#s,[A] \multimap [])]` * - :math:`{↓}A` - :math:`A` - Down shift - - :math:`\Pi [(\#s,[] \multimap [A])]` + - :math:`𝕁^- [(\#s,[] \multimap [A])]` The negations and shifts have identical derivation rules for each polarity, so we write :math:`\neg A` and :math:`\smash{\updownarrow}A` unless there is a need for the distinction. @@ -177,18 +177,18 @@ We can also write some types common from programming: - Type * - :math:`\text{Bool}` - Booleans - - :math:`\Sigma [(\#F,[]-[]),(\#T,[]-[])]` + - :math:`𝕁^+ [(\#F,[]-[]),(\#T,[]-[])]` * - :math:`\text{Int}` - 32-bit integers - - :math:`\Sigma [(\#{-2}^{31},[]-[]),\ldots,(\#0,[]-[]),\ldots,(\#2^{31}-1,[]-[])]` + - :math:`𝕁^+ [(\#{-2}^{31},[]-[]),\ldots,(\#0,[]-[]),\ldots,(\#2^{31}-1,[]-[])]` * - :math:`L_A` - Linked list of A - - :math:`\Sigma [(\text{#nil},[]-[]),(\text{#cons},[A,L_A]-[])]` + - :math:`𝕁^+ [(\text{#nil},[]-[]),(\text{#cons},[A,L_A]-[])]` * - :math:`Arr_A` - Arbitrarily-sized tuple of A - - :math:`\Sigma [(\text{#0},[]-[]),(\text{#1},[A]-[]),(\text{#2},[A,A]-[]),\ldots]` + - :math:`𝕁^+ [(\text{#0},[]-[]),(\text{#1},[A]-[]),(\text{#2},[A,A]-[]),\ldots]` -In general :math:`\Sigma` can represent any algebraic data type. +In general :math:`𝕁^+` can represent any algebraic data type. Exponentials ============ @@ -200,9 +200,9 @@ Promotion :nowrap: \begin{array}{cc} - \rule{\overrightarrow{\bang \Gamma_i } \vdash A, \overrightarrow{\whim\Delta_i} }{\overrightarrow{\bang \Gamma_i } \vdash \bang A, \overrightarrow{\whim\Delta_i}}{\bang} + \rule{\overrightarrow{\bang \Gamma_i } \vdash A, \overrightarrow{\whim\Delta_i} }{\overrightarrow{\bang \Gamma_i } \vdash \bang A, \overrightarrow{\whim\Delta_i}}{\bang_R} & - \rule{\overrightarrow{\bang \Gamma_i } , A\vdash \overrightarrow{\whim\Delta_i} }{\overrightarrow{\bang \Gamma_i }, \whim A \vdash \overrightarrow{\whim\Delta_i}}{\whim} + \rule{\overrightarrow{\bang \Gamma_i } , A\vdash \overrightarrow{\whim\Delta_i} }{\overrightarrow{\bang \Gamma_i }, \whim A \vdash \overrightarrow{\whim\Delta_i}}{\whim_L} \end{array} @@ -341,7 +341,7 @@ Substitution Quantifiers ----------- -For these the variable :math:`x` must have no free occurrence in :math:`\Gamma` or :math:`\Delta`. +For these the variable :math:`x` must have no free occurrence in :math:`\Gamma` or :math:`\Delta`. In code we simply refer to quantifiers :math:`Q` and write :math:`Q^+ = \forall, Q^- = \exists`. .. math:: :nowrap: diff --git a/docs/Reference/Values.rst b/docs/Reference/Values.rst index 1c424f8..95d0990 100644 --- a/docs/Reference/Values.rst +++ b/docs/Reference/Values.rst @@ -63,14 +63,9 @@ A term is a symbol applied to other values. Note that if there is an appropriate syntax rule the second example could also be written as ``a ++++ b``, but it would be an equivalent value. -Terms subsume algebraic data types, since an ADT value is a symbol applied to other values. An ADT is a "free" term that has no reduction rules defined. +Terms subsume algebraic data types, since an ADT value is also a symbol applied to other values. An ADT is a "free" term that has no reduction rules defined, or conversely a term is an "untyped" ADT data constructor with no restrictions on its arguments. -Terms also include the other function syntax, keyword arguments and so on. - -Infinite terms --------------- - -Sometimes it is useful to deal with terms that are solutions to a system of equations, like ``let x=cons 1 x in x``. These are also values. For terms with no reduction rules, there is a way to compute the (unique) minimal graph representation, where loops in the graph represent infinite cyclic repetitions. There are also infinitely reducible terms, e.g. ``let fib x y = cons x (fib y (x+y)) in fib 0 1`` requires infinitely many reductions to reduce to an infinite normal form. +Term application also includes the other function syntax - keyword arguments, implicit arguments, and so on. This argument syntax is not usually reduced away, so is part of the term's value. Lists ====== @@ -298,6 +293,11 @@ Modules Modules are also first class, they are discussed in their own page. +Infinite values +=============== + +Sometimes it is useful to deal with values that are solutions to a system of equations, like ``let x=cons 1 x in x``. These are also values. For terms with no reduction rules, there is a way to compute the (unique) minimal graph representation, where loops in the graph represent infinite cyclic repetitions. There are also infinitely reducible expressions, e.g. ``let fib x y = cons x (fib y (x+y)) in fib 0 1`` requires infinitely many reductions to reduce to an infinite normal form (infinite list value). + Rewriting system ================