ddlt is Lua module implementing a generic lexer to help write parsers. It includes a tokenizer capable of recognizing C++, BASIC, and Pascal comments, identifiers, and number and string literals. A template engine is also included to ease the development of transpilers.
ddlt is inspired by DDLParser and my previous work with serialization for Insomniac.
The tokenizer recognizes:
- C++
- Line comments from
//
to the end of the line. - Block comments from
/*
to*/
. Nested comments are not supported. - C preprocessor directives, from
#
(as long as its the first non-space character in the line) to the end of the line. - Identifiers in the form
[A-Za-z_][A-Za-z_0-9]*
. - Numbers in the form:
0x[0-9A-Fa-f]+
as hexadecimal literals0b[01]+
as binary literals[1-9][0-9]*
as decimal literals0[0-7]+
as octal literals[0-9]*\.[0-9]+([Ee][+-]?[0-9]+)?
as float literals[0-9]+\.[0-9]*([Ee][+-]?[0-9]+)?
as float literals[0-9]+[Ee][+-]?[0-9]+
as float literals- Integer literals can be suffixed with
u
,ul
,ull
,l
,lu
,ll
, orllu
, in either lower and upper case - Float literals can be suffixed with
f
orl
, in either lower and upper case - All numbers can have the decimal separator
'
at any place, except at the beginning of digit sequences or at the end of the literal
- Strings, including the following escape sequences:
\a
,\b
,\f
,\n
,\r
,\t
,\v
,\\
,\'
,\"
,\?
\x
followed by at least one hexadecimal digit\u
followed by exactly four hexadecimal digits\U
followed by exactly eight hexadecimal digits- A
\
followed by at least one octal digit
- Strings literals other than the narrow multibyte:
L"..."
wide stringsu8"..."
UTF-8 encoded stringsu"..."
UTF-16 encoded stringsU"..."
UTF-32 encoded stringsR"..."
raw stringsLR"..."
andRL"..."
raw wide stringsuR"..."
andRu"..."
raw UTF-16 encoded stringsUR"..."
andRU"..."
raw UTF-32 encoded strings
- Character and multicharacter literals
'...'
characterL'...'
wide characteru8'...'
UTF-8 encoded characteru'...'
UTF-16 encoded characterU'...'
UTF-32 encoded character
- Number, string, and character literals can have user-defined suffixes in the form
_[A-Za-z_0-9]*
- Line comments from
- BASIC
- Line comments from
'
to the end of the line. - Line comments from
REM
, independent of case, to the end of the line. - Identifiers in the form
[A-Za-z_][A-Za-z_0-9]*
. - Numbers in the form:
&[Hh][0-9A-Fa-f]+
as hexadecimal literals[0-9]+
as decimal literals&[Oo][0-7]+
as octal literals&[Bb][01]+
as binary literals[0-9]*\.[0-9]+([Ee][+-]?[0-9]+)?
as float literals[0-9]+\.[0-9]*([Ee][+-]?[0-9]+)?
as float literals[0-9]+[Ee][+-]?[0-9]+
as float literals- Integer literals can be suffixed with
%
,&
,s
,us
,i
,ui
,l
, orul
, in either lower and upper case - Float literals can be suffixed with
@
,!
,#
,f
,r
, ord
, in either lower and upper case
- Strings, where
""
is interpreted as a single quote inside the string.
- Line comments from
- Pascal
- Line comments from
//
to the end of the line. - Block comments from
(*
to*)
. Nested comments are not supported. - Block comments from
{
to}
. Nested comments are not supported. - Directives from
{$
to}
, or from(*$
to*)
. - Numbers in the form:
$[0-9a-fA-F]+
as hexadecimal literals[0-9]+
as decimal literals&[0-7]+
as octal literals%[01]+
as binary literals[0-9]+\.[0-9]+
as float literals[0-9]+[Ee][+-]?[0-9]+
as float literals[0-9]+\.[0-9]+[Ee][+-]?[0-9]+
as float literals
- Strings, where
#[0-9]+
can appear anywhere outside the single quotes to denote a character corresponding to the given number, i.e.#65'B'
is equivalent to'AB'
.
- Line comments from
The tokenizer can also recognize and return freeform blocks, using user-defined delimiters, and which can have any content inside these delimiters.
make
should do the job. It will generate a shared object that can be required in Lua code.
If you have a Lua file with
local ddlt = require 'ddlt'
local parse = function(file)
local inp = assert(io.open(file, 'rb'))
local source = inp:read('*a')
inp:close()
local lexer = ddlt.newLexer{
source = source,
file = file,
language = 'cpp',
symbols = {'{', '}', ',', ';', '='},
keywords = {'enum', 'struct', 'string', 'int', 'float'},
freeform = {{'[{', '}]'}}
}
local tokens = {}
local max = 0
repeat
local la = {}
assert(lexer:next(la))
la.lexeme = la.lexeme:gsub('\n', '\\n')
tokens[#tokens + 1] = la
max = math.max(max, #la.token)
until la.token == '<eof>'
tokens.max = max
return tokens
end
local template = [[
/*! local tkfmt = '%-' .. args.max .. 's' */
/*! for i = 1, #args do */
/*! local la = args[i] */
line = /*= string.format('%3d', la.line) */ token = /*= string.format(tkfmt, la.token) */ lexeme = /*= la.lexeme */
/*! end */
]]
if #arg ~= 1 then
error('missing input file\n')
end
local res = {}
local tokens = parse(arg[1])
local templ = assert(ddlt.newTemplate(template, '/*', '*/'))
templ(tokens, function(line, out) res[#res + 1] = out end)
res = table.concat(res):gsub('\n+', '\n')
io.write(res)
and an input file with
// The weapons available in the game
enum Weapons {
kFist,
kChainsaw,
kPistol,
kShotgun,
kChaingun,
kRocketLauncher,
kPlasmaGun,
kBFG9000
};
/* The player */
struct Hero {
string name = "John \"Hero\" Doe";
int health = 100;
int armour = 0x0;
float speed = 14.3;
isAlive = [{
[{
// Nested freeform block
}]
return health > 0;
}]
};
the result of running the parser will be
$ lua test.lua test.ddl
line = 1 token = <linecomment> lexeme = // The weapons available in the game\n
line = 2 token = enum lexeme = enum
line = 2 token = <id> lexeme = Weapons
line = 2 token = { lexeme = {
line = 3 token = <id> lexeme = kFist
line = 3 token = , lexeme = ,
line = 4 token = <id> lexeme = kChainsaw
line = 4 token = , lexeme = ,
line = 5 token = <id> lexeme = kPistol
line = 5 token = , lexeme = ,
line = 6 token = <id> lexeme = kShotgun
line = 6 token = , lexeme = ,
line = 7 token = <id> lexeme = kChaingun
line = 7 token = , lexeme = ,
line = 8 token = <id> lexeme = kRocketLauncher
line = 8 token = , lexeme = ,
line = 9 token = <id> lexeme = kPlasmaGun
line = 9 token = , lexeme = ,
line = 10 token = <id> lexeme = kBFG9000
line = 11 token = } lexeme = }
line = 11 token = ; lexeme = ;
line = 13 token = <blockcomment> lexeme = /* The player */
line = 14 token = struct lexeme = struct
line = 14 token = <id> lexeme = Hero
line = 14 token = { lexeme = {
line = 15 token = string lexeme = string
line = 15 token = <id> lexeme = name
line = 15 token = = lexeme = =
line = 15 token = <string> lexeme = "John \"Hero\" Doe"
line = 15 token = ; lexeme = ;
line = 16 token = int lexeme = int
line = 16 token = <id> lexeme = health
line = 16 token = = lexeme = =
line = 16 token = <decimal> lexeme = 100
line = 16 token = ; lexeme = ;
line = 17 token = int lexeme = int
line = 17 token = <id> lexeme = armour
line = 17 token = = lexeme = =
line = 17 token = <hexadecimal> lexeme = 0x0
line = 17 token = ; lexeme = ;
line = 18 token = float lexeme = float
line = 18 token = <id> lexeme = speed
line = 18 token = = lexeme = =
line = 18 token = <float> lexeme = 14.3
line = 18 token = ; lexeme = ;
line = 19 token = <id> lexeme = isAlive
line = 19 token = = lexeme = =
line = 19 token = <freeform> lexeme = [{\n [{\n // Nested freeform block\n }]\n return health > 0;\n }]
line = 25 token = } lexeme = }
line = 25 token = ; lexeme = ;
line = 25 token = <eof> lexeme = <eof>
See the examples
and tests
folders for unit tests, a simple tokenizer, and a simple Finite State Machine compiler written with ddlt.
Your parser can require ddlt to access functions to tokenize an input source code, and to create templates to generate code, as well as some functions to help deal with the file system.
absolute = realpath(path)
: returns the absolute path for the given path.dir, name, ext = split(path)
: splits a path into its constituents, dir, file name, and extension.path = join(dir, name, ext)
: joins the path constituents, forming a whole path. Any part can be nil, and will not appear in the resulting path.entries = scandir(path)
: returns a table with all the entries in the specified path.info = stat(path)
: returns a table with information about the object at path, as returned by stat containingsize
,atime
,mtime
,ctime
,sock
,link
,file
,block
,dir
,char
, andfifo
.lexer = newLexer(options)
: returns a new tokenizer (see below).template = newTemplate(code, open_tag, close_tag, name)
: returns a function that, when called, will execute the template (see below).
newLexer
returns a tokenizer for a given soure code. It accepts a table with the following fields:
source
: a string with the entire source code that will be tokenized.file
: a string with the name of the object used to create the source code (usually the file name from where the source code was read, this is used in error messages).symbol
: an array of valid symbols.language
: a string containing the language used to parse identifiers, string and number literals, and comments. Supported languages are'cpp'
for C++,'bas'
for BASIC, and'pas'
for Pascal.freeform
: an array where each element is a table with two elements, the freeform delimiters to recognize freeform blocks.
Optionally, the table can have these fields:
startline
: the starting line for the source code, useful for creating a lexer from a freeform block. Defaults to 1 if not given.keywords
: an array of valid keywords, which will then be returned instead of the generic<id>
token.
The resulting object only has one method, next
. It takes a table where the information of the next produced token is stored:
token
: a string describing the lookahead. It can be:<id>
when the lookahead is an identifier<float>
when it's a floating point literal<binary>
when it's a binary literal<octal>
when it's an octal literal<decimal>
when it's a decimal literal<hexadecimal>
when it's a hexadecimal literal<string>
when it's a string literal<eof>
when there are no more tokens in the source code<linecomment>
and<blockcomment>
when it's a line or a block comment<linedirective>
and<blockdirective>
when it's a C pre-processor or Pascal directive<freeform>
when it's a freeform block as configured innewLexer
- A symbol, as identified by the
symbol
array - A keyword, as identified by the
keywords
array when provided
lexeme
: a string with the value of the token as found in the source codeline
: the line number where the token is in the source codeindex
: the 1-based index of the current lexeme from the start of the source code
next
will also return the same table passed to it as an argument if successful. In case of errors, it will return nil
, plus a string describing the error. The error message will always be in the format <file>:<line>: message
, which is the standard way to describe errors in compilers.
Line and block comments and directives, being returned by the tokenizer, allow for iteresting things like copying preprocessor directives to the output or processing them as they appear. If they are not wanted, remove them from the token stream in your match
parser method, i.e.:
local lexer = newLexer{
-- ...
}
local la = {}
local parser = {
-- ...
match = function(self, token)
if token and token ~= la.token then
error(string.format('%u: %s expected', la.line, token))
end
repeat
lexer:next(la)
until la.token ~= '<linecomment>' and la.token ~= '<blockcomment>'
-- ...
}
Nested freeform blocks are allowed, so it's easy to i.e. process C++ class declarations in header files to generate code to create Lua bindings for them.
Templates can be used to make it easier to generate code. The newTemplate
method accepts three of four arguments:
newTemplate(code, open_tag, close_tag, name)
code
: the template source codeopen_tag
: the open tag that delimits special template instructionsclose_tag
: the close tagname
: an optional template name, which is used in error messages;'template'
is used if this argument is not provided
There are two template tags, one to emit content to the output, and another to execute arbitrary Lua code. To emit content, use the open tag followed by =
. To execute code, use the open tag followed by !
.
As an example, if you use /*
and */
as delimiters:
/*= ... */
causes...
to be generated in the output/*! ... */
causes...
to be executed as Lua code
The return value of newTemplate
is a Lua function that will run the template when executed. This returned function accepts two arguments, args
, which is used to send arbitrary data to the template, and emit
, a function which receives the template line number and the code produced by running the template (i.e. emit(line, code)
).
Example:
local template, code = ddlt.newTemplate(source, '#', '#', path)
local source = {}
local last_line = 0
local function emit(line, code)
if line ~= last_line then
if #source ~= 0 and source[#source]:sub(-1, -1) ~= '\n' then
source[#source + 1] = '\n'
end
source[#source + 1] = string.format('#line %d "%s"\n', line, path)
last_line = line
end
source[#source + 1] = code
end
template(args, emit)
print(table.concat(source, ''))
- 5.1.1
- Don't generate empty
switch
statements infsmc.lua
- Don't generate empty
- 5.1.0
- Added the
index
field to the table returned bynext
- Added the
- 5.0.0
- Now all languages correctly detect a symbol starting with a dot in the middle of a number literal
- 4.0.1
- Fixed the Pascal number lexer when a number is followed by the range symbol
..
- Fixed two apostrophes in Pascal strings denoting a single apostrophe
- Fixed keywords in BASIC and Pascal not being case-insensitive
- Fixed the Pascal number lexer when a number is followed by the range symbol
- 4.0.0
- Major rewrite:
- New code style
- Simplyfied code
- New error generation using
lua_error
to simplify the code path - Use
const
throughout the code - Use C99
- Removed usage of LuaSrcDiet
- Fixes to the number lexers
- Major rewrite:
- 3.0.0
- Added the template line as a parameter to the
emit
function newTemplate
now returns the compiled template code as a second return value
- Added the template line as a parameter to the
- 2.8.2
- Added
stateName(State)
to the FSM generated byfsmc.lua
- Fixed building the transitions argument list while generating code in
fsmc.lua
- Added
- 2.8.1
- Added an optional starting line to use with the source code
- Added a Finite State Machine compiler to the
examples
folder
- 2.8.0
- Breaking change: the
freeform
field of the lexer options table can now specify multiple delimiters, i.e.{{'[{', '}]'}, {'(', ')'}}
- Breaking change: the
- 2.7.0
- Fixed not erroring on unterminated strings
- Added decimal separator to C++
- Better handling of suffixes
- Added user-defined number literal suffixes to C++ numbers and strings
- Run
boot.lua
through LuaSrcDiet to save space
- 2.6.0
- Added binary number literals to C++
- Simpler code to deal with directives
- Diferentiate line and block directives
- Removed debug code from
lexer.c
- 2.5.0
- Added Pascal directives
- Paved the way to allow for custom directives
- 2.4.1
- Added a correct rockspec and bumped up the version
- Fixed the older rockspecs
- 2.4.0
- Removed the
isSymbol
function that identifies symbols, now thenewLexer
options must contain just thesymbols
array - Removed the
maxSymbolLength
field from the lexer options - Some code improvements
- Removed the
- 2.3.3
- Fixed an issue where the lexer only returns a symbol if all its prefixes are also valid symbols
- 2.3.2
- Better code to check the end of a raw C++ string
- 2.3.1
- Fixed line comments and C preprocessor directives at the end of the file without a trailing newline
- 2.3.0
- Added support for C preprocessor directives
- 2.2.1
- Fixed raw strings not accepting a ')'
- 2.2.0
- Added support for C++ character and multicharacter literals
- 2.1.0
- Added support for string literal prefixes, wide, UTF-8, UTF-16, UTF-32, and for all raw string prefixes
- 2.0.0
- Made ddlt a module
- 1.3.1
- Allow nested freeform blocks
- 1.3.0
- Comments are now returned by the tokenizer
- 1.2.0
- Added support for Pascal
- 1.1.0
- Removed square brackets from keywords
- 1.0.0
- First proper release