In the final section of this two-part tutorial we will tackle the creation of our actual parser which checks
if our matched input tokens (coming from flex
) are emitted in a syntactically correct order against our ptuc
specification. We will use (gnu) bison
in order to create our parser and this implementation was tested
with versions of bison >= 3.0.4
. Although bison
originates from yacc
it's a superset of its features
so I don't know if the code here is compatible with yacc
-- I would assume that it isn't by default, but
I have not tested against it.
As with flex
, bison
has again a similar layout which is as follows:
%{
user code
%}
bison declarations
%%
bison grammar rules
%%
more user code
The code that resides here is copied verbatim in the resulting .c
file; this might include custom libraries,
constants, includes and so on. Most includes and function prototype declarations happen at the top while their actual
definition happens at the bottom user code section.
bison
tokens
and types
are allowed to have an accompanied value; that value can be used for emitting text or
treating it as a return value in order to do a specific task later on. If we need this functionality we have to tell
bison
the type of that particular tag
by declaring it inside a special union
like this:
%union
{
char* tagA;
double tagB;
}
This union
tells bison
that when we use a token
or a type
that is accompanied by tagA
then we can use its value
like a normal C
string literal; on the other hand if its accompanied by tagB
then that value will be treated as a
double
instead. It has to be noted that being a union
and not a struct
it also means the values use
shared memory and not separate segments as opposed to struct
fields. We can have as many different tags as we
want inside the union and both tokens
and types
can have the same tag.
As we previously said in our flex
part the tokens
must be firstly defined in our parser so that we can
then generate the include .tab.h
which we use in flex
. Tokens are also called terminal symbols because
we cannot break them into simpler entities. Token declarations must be inside bison
's declaration
section. Additionally tokens can have "tags", which basically says to bison
that this token can accompanied by
a particular value; hence a token definition has the following syntax:
%token <tagA> TOK_NAME_A
%token <tagB> TOK_NAME_B
You can also group for convenience tokens
that have the same tag
inside the same declaration like so:
%token <tagA> TOK_NAME_A TOK_NAME_C TOK_NAME_D
%token <tagB> TOK_NAME_B
The tagB
, tagB
must be defined inside the union which was described in the previous section. So using the union
shown in the previous section we can use inside rules tagA
as a string while tagB
as a double
.
bison
types
are used to indicate that a bison
state can have a return value, much like tokens
. The same rules
apply but these have to be declared using a different directive like so:
%type <tagA> TYPE_NAME_A
%type <tagB> TYPE_NAME_B
Again as with tokens
, you can also group for convenience types
that have the same tag
inside the same declaration
like so:
%type <tagA> TYPE_NAME_A
%type <tagC> TYPE_NAME_B TYPE_NAME_C TYPE_NAME_F
The tags
are again defined inside the union as is the case with the tags
used by tokens
-- basically the same rules
apply here as well.
In bison
we create a grammar which syntactically evaluates if the tokens
that flex
generated are emitted in a
syntactically correct order. In order to do that we will have to take a good look at our language specification and break
it down to building blocks that we then have to express using bison
rules. In ptuc
abstractly we have a program
which can have modules
, declarations
and a body
ending with a dot
. This skeleton already gives us an
intuition of how to express our language in a rule.
Concretely the rule for our whole program is the following:
program:
incl_mods program_decl decls body KW_DOT
;
We see that the grammar rule program
is satisfied if and only if incl_mods
, program_decl
, decls
, body
and
KW_DOT
are satisfied in that specific order. In bison
a rule is matched from left-to-right and all of the
sub-rules must be satisfied as well. So for this particular example the matching order is the following:
incl_mods
program_decl
decls
body
KW_DOT
These rules can be tokens
or other grammar rules that have their own constraints, which as said previously have to be
satisfied as well -- think of it like a pre-order tree traversal. Grammar rules in general have the following syntax:
rule_name:
caseA
| caseB
.
.
| caseN
;
A rule starts by typing its unique name then a colon (:
) followed by a number of cases which are separated
with a dash |
; the last rule must be followed by a semicolon (;
). Also rules don't have to be separated
by lines so this would be perfectly legal as well:
rule_name: caseA | caseB | ... | caseN;
Additionally, grammar rules can have return values and (indirectly) take arguments. In order for a rule to return a
value of type tag
it has to be declared in the stack type union
and use the type
directive to inform bison
that
we expect that particular rule to return a value of type tag
. With that in mind, let's go and show how to create a
simple rule.
A simple rule, is one that comprises only our of terminal symbols... so an example of a simple rule would be one
that describes all of the scalar values in ptuc
, so let's go ahead and implement it. By our language specification
(here) we see that the scalar values are comprised out of the following:
- positive constant integers (
token
:POSINT
) - and real numbers (
token
:REAL
)
So a rule that would match these is the following:
scalar_vals:
POSINT {$$ = $1;}
| REAL {$$ = $1;}
;
A lot is happening here, but the main thing to note is that if flex
returns to bison
either a POSINT
or REAL
token
this rule will be matched, this or
is indicated by
the dash (|
) separating each case. The other important thing is that this particular rule
returns a value; this is indicated by the assignment of $1
to $$
as $$
is a symbol that indicates
the return value of that rule (if any) and the $1
indicates the value that accompanies the first
rule or token
. Let's look at another example.
// union entry
%union
{
int add_tag;
}
// token entry
%token <add_tag> POSINT REAL
// type entry
%type <add_tag> pos_real_add
// actual rule
pos_real_add:
POSINT REAL {$$ = $1 + $2;}
;
Assuming that this rule returns a double and both POSINT
and REAL
are accompanied by numeric arguments
then the above is perfectly valid. Please be careful, this rule is only matched if POSINT
is followed
by a REAL
token
-- not the other way around. So if flex
returns a POSINT
, then a REAL
this rule will
be matched and the resulting value would be the addition of the two.
We talked above on how to construct really basic rules, now we will see how to create more complex rules -- which (normally) will comprise most of your real world grammars. I personally separate these rules into two main categories, which are:
- composition rules
- lists (or recursive rules)
Composition rules are basically rules that are composed from one or more rules (complex or simple). We have already seen one such rule previously, which was:
program:
incl_mods program_decl decls body KW_DOT
;
This a composition rule as it has a case which is comprised out of four complex rules and one
terminal symbol (KW_DOT
). Generally speaking a complex rule is one that has at least one case which
is composed with at least one non-terminal terminal symbol.
Lists are a special composition rule type which is very interesting due to it property of letting us do recursive matches. In order to understand this a bit better let's look at the definition of one such rule:
/* identifiers (left recursion) */
ident_list:
IDENT
| ident_list KW_COMMA IDENT
;
Usually lists have two (2) cases, one that is responsible for the recursion and one handles the singular (terminal) case
of that rule. The recursion happens in the above example when we match the second case. That particular case allows us
to match the terminal symbols KW_COMMA
and IDENT
as well as ourselves -- so that essentially means that we can
reduce
the terminal symbols and re-enter the same rule until we can match the first case which ends the recursion.
It has to be noted that there two types of recursion, left and right; the above rule we just showed is an example of a left recursion. The same identical rule using right recursion would the be following:
/* identifiers (right recursion) */
ident_list:
IDENT
| IDENT KW_COMMA ident_list
;
Although you have the option to use right instead of left recursion, due to how bison
works internally you should
avoid that and opt to use only left recursion. That's because in the case of a right recursion in order to parse
that rule bison
has to shift
onto its stack all of the tokens
that might be a potential match in the current
context for that rule without even applying it once (thus performing a reduce
operation, which "eats up" the stack).
When parsing tokens bison
tries to match them into one of the given grammar rules that can be potentially applied
in the current context. To do that bison
supports two (2) basic operations, shift
and reduce
; shift
means that
the tokens
are shifted to the stack while reduce
means that tokens
are consumed from the stack. Depending on
how you have created your rules you might have introduced ambiguities in the grammar; this means that bison
in
at least one occasion does not know precisely which singular rule to apply. This creates a conflict, which
bison
(thankfully) reports at compile time as a warning. There can be two (2) different types of conflicts:
- shift/reduce
- reduce/reduce
This type of conflict happens when bison
does not know which operation to perform, shift
or reduce
. An example
of such rule would the following:
/* identifiers */
ident_list:
IDENT
| IDENT KW_COMMA
| ident_list KW_COMMA IDENT
;
The introduction of this second case, creates an ambiguity in our grammar which in turn creates a shift/reduce
conflict. In order to better understand that we have to explain a bit more how bison
parses the tokens
and
selects to either shift
or reduce
. As bison
parses the tokens
even if it matches a rule it does not immediately
reduce
, even if a rule is matched; what instead happens is that bison
always has stored one token
ahead to a
special variable called the lookahead token
. The value of the lookahead token is stored in yychar
and (if any)
its semantic value and its location in yylval
and yylloc
respectively.
Now in our case when we parse the IDENT
token and we have in our lookahead token
the KW_COMMA
, we can either
use the second case and reduce
or use the third rule and shift
as we can match the first two parts of the third case.
This would happen if we matched IDENT
as ident_list
, reduced
it using the first case and shifted
the result,
then matched the KW_COMMA
and we would expect an IDENT
in order to complete the rule. Normally shift/reduce
conflicts should be resolved and this can be done by either specifying precedence (will be discussed later) or
altering your grammar.
This type of conflict is more serious than the previous one as bison
knows that is has to perform a reduce
operation but has more than one way of doing so -- this very bad and indicates a serious problem in the grammar. An
example of that rule would be the following:
/* identifiers */
ident_list:
%empty
| IDENT
| ident_list KW_COMMA IDENT
;
/* data-types using no cast */
cdata:
%empty
| KW_BOOLEAN
| KW_CHAR
| KW_INTEGER
| KW_REAL
;
/* combine them */
combs:
%empty
| combs ident_list;
| combs cdata;
Now in the above segment we have a serious issue; let's assume that we have to parse an empty input, if we try to
match it using the above rules we can see that is can be reduced in multiple ways; thus bison
does not know reliably
which rule to use in order to reduce
, hence the conflicts. This can be resolved as with shift/reduce conflicts by
altering the grammar itself or using precedence rules, which will be discussed later on.
In some grammar it is normal to have a small amount of shift/reduce conflicts (but no reduce/reduce conflicts); in
order to indicate that this is normal and suppress the warnings generated by bison
upon compilation one can use the
%expect
directive. This is placed inside the bison
declaration section and takes an argument of the precise amount of
shift/reduce conflicts that we are expecting, like so:
%expect n
If the number of shift/reduce conflicts is not equal to n
(less or more) or we have a *reduce/reduce conflicts an
error is thrown and compilation is stopped.
Precedence is a really important aspect of your grammar; that is... if you want to do something meaningful with it you
are bound to be affected by it. Let's first show an example; suppose that we have to add three numbers a
, b
and
c
(a
+ b
+ c
). Now also suppose that we have the following grammar rule for addition:
add_op:
num `+` num
;
This rule expects two numbers and a plus sign (+
) between them, but upon parsing the input we have the following:
num1 -> `+` -> num2 -> `+` -> num3
This would create a shift/reduce conflict as bison does not now how to precisely parse the input as there is more than one way of parsing the tokens received. This is because we can either parse the input as:
(num1 `+` num2) `+` num3
Which would result in the following operations from bison
:
(num1 + num2)
match- reduce using
add_op
- shift result
(res1_2 + num3)
match- reduce using
add_op
- shift result
On the other hand, we could also parse it like this (perfectly legal) way:
num1 `+` (num2 `+` num3)
Which would result in the following operations from bison
:
(num2 + num3)
match- reduce using
add_op
- shift result
(num1 + res2_3)
match- reduce using
add_op
- shift result
The first example is how we parse the expression using left operator associativity while the second case shows how
we would parse the expression using right associativity. The actual problem lies when we have more than one operators
to process in sequence as bison
does not know if it should reduce or shift; by default bison
elects to
shift instead of reduce.
There are four (4) types of precedence types in bison
three (3) of which declare both precedence as
well as associativity while the last (as its name suggests) declares only precedence. The complete list is
the following:
- left (
%left
): Indicates that this operator has left associativity (e.g. (a + b) + c is preferred)- syntax is:
%left symbols
- syntax is:
- right (
%right
): Indicates that this operator has right associativity (e.g. a + (b + c) is preferred)- syntax is:
%right symbols
- syntax is:
- nonassoc (
%nonassoc
): Indicates that this operator cannot be seen in sequence and is considered a syntax error if that's encountered (e.g. a + b + c would throw an error).- syntax is:
%nonassoc symbols
- syntax is:
- precedence (
%precedence
): Indicates just precedence not associativity.
In the previous cases, precedence did not actually affect the result but there are some cases that precedence does affect the result; one such example is if we had the following expression:
a * b + c
Here we don't have precedence for sequential operations using the same operator but different ones; again if no
precedence is set for the multiplication (*
) and addition (+
) operators, bison
would not be certain which
operation to perform first, a * b
or b + c
and a shift/reduce conflict would again occur. Of course we
can easily see that doing (a * b) + c
is not the same as a * (b + c)
. This is solved by determining the
precedence of the operator itself against the others. This actually took me quite a while to figure out as I did not
spot it right away when reading the documentation.
When declaring precedence groups the group with the higher line number has greater precedence than
the previous ones. It's also good practice to group operator of equal precedence in the same declaration.
As a final example, the complete precedence rule list for ptuc
follows.
%left KW_SEMICOLON
/* Class 8 prec. group */
%left KW_OP_OR KW_OR
/* Class 7 prec. group */
%left KW_OP_AND KW_AND
/* Class 6 prec. group */
%left KW_EQ KW_DIFF KW_LESS_EQ KW_LESS KW_GREATER_EQ KW_GREATER
/* Class 5 prec. group */
%left KW_OP_PLUS KW_OP_MINUS
/* Class 4 prec. group */
%left KW_OP_MUL KW_OP_DIV KW_DIV KW_MOD
/* Class 3 prec. group (for casting) */
%right TYPE_CAST_PREC
/* Class 2 prec. group */
%right UNARY_PREC
/* Class 1 prec. group (highest precedence) */
%right KW_NOT KW_OP_NOT
/* dangling else lookalikes */
%precedence IF_THEN
%precedence KW_ELSE
Again, sharp readers will see notice in the above code something that I have yet to explain; and they'd be right! As I
left that for last. Notice that there is no token
defined for IF_THEN
or TYPE_CASE_PREC
, so it turns out that one
can declare precedence for arbitrary symbols and use the term %prec
to enforce that precedence into a rule, like so:
add_op:
num `+` num %prec IF_THEN
;
So using the add_op
rule we can modify it as follows to enforce that particular rule to have the precedence (and
associativity) of IF_THEN
or any other symbol. These can be used to also enforce precedence in order to resolve
conflicts in our grammar, by using our previous example that created a shift/reduce conflict:
/* in decl. section */
%precedence lower
%precedence higher
/* identifiers */
ident_list:
IDENT
| IDENT KW_COMMA %prec higher
| ident_list KW_COMMA IDENT %prec lower
;
Since higher
is declared lower than lower
it has higher precedence, thus by using %prec
directive we enforce
that precedence in our rules, solving the previous ambiguity that was present in our grammar by always favoring the
second case. This technique can also be used for resolving reduce/reduce conflicts as well. The reasoning behind the
precedence specification for non-token symbols IF_THEN
and TYPE_CASE_PREC
will be explained when we tackle our
ptuc
grammar implementation.
This is a very important topic as if you want to produce an AST (Abstract Syntax Tree) you'd probably
allocate some memory to do that; in languages like C
or C++
if you dynamically allocate memory you have
to free it yourself; bison
has no reasonable obligation to clean-up your mess, that's your job.
Unfortunately most parsers I've seen that allocate memory assume that since that's done inside bison
itself somehow
they expect that bison
(or flex
) will take care of that but unfortunately that's not the case. Let's look an
example that would produce a definite memory-leak in our program. We will use the above rule we created using the
POSINT
and REAL
tokens
; inside flex they have the return the following when they are matched:
{SNUMBER} {
yylval.crepr = strdup(yytext);
return POSINT;
}
{REAL} {
yylval.crepr = strdup(yytext);
return REAL;
}
The line yylval.crepr = strdup(yytext);
means that we copy the value of yytext
into the yylval.crepr
(which is a tag
in our bison
%union { ... }
). This means that both of these tokens
are accompanied by a
dynamically allocated vale stored in their tag
. Hence in the following rule:
// our union decl.
%union
{
char* crepr;
}
// later on...
pos_real_add:
POSINT REAL {$$ = strtol($1, NULL, 10) + strtol($2, NULL, 10);}
;
The $1
and $2
contain the values of POSINT
and REAL
respectively. The rule does its job (after first
converting them to actual numbers using strtol
function), which is to add the two and return their result by
assigning it to $$
-- there is a catch however. After the rule finishes (returns the $$
) memory for
$1
and $2
is not free'ed at any point, as after the calls to strtol
they are not used anymore; thus they create
a definite memory-leak. The correct way to handle dynamically allocated symbols or tags
is to free them as soon as
we have no use for them -- which is this case is after the calls to strtol
; hence a correct implementation that does
not create any leaks would be the following:
// our different union
%union
{
char* crepr;
double val;
}
// token and types
%token <crepr> POSINT REAL
%type <val> pos_real_add
// later on.
pos_real_add:
POSINT REAL {$$ = strtol($1, NULL, 10) + strtol($2, NULL, 10); free($1); free($2);}
;
Now let's look at yet another example of where handling these free
's isn't that obvious.
// assume that above we had: %type <crepr> pos_int_num and %type <val> pos_add
pos_int_num:
POSINT {$$ = $1; free($1);}
;
pos_add:
POSINT `+' POSINT {$$ = strtol($1, NULL, 10) + strtol($3, NULL, 10); free($1); free($3);}
;
Here we have two rules, one that recognizes POSINT
's and one that simply performs an addition if the output is a
POSINT
followed by a plus sign (+
) and another POSINT
. Notice that that we should free
up the values right after
we are finished with them, so naturally one would say that in the pos_int_num
rule we are
done with the value of POSINT
so we should free it. That's incorrect and would most likely cause a
segmentation-fault; this is the case because the value of $$
points to that particular string, since we assign $$
to be equal with $1
-- but $1
is a pointer to the string. Thus the value of $$
for each POSINT
is the actual
value that $1
and $3
will have in the pos_add
rule; so if we free it there when strtol
tries to use that
will most likely fail -- not to mention that free will attempt to free an already free'ed location. The correct way of
handling the above scenario would be the one below:
// assume that above we had: %type <crepr> pos_int_num and %type <val> pos_add
pos_int_num:
POSINT {$$ = $1;}
;
pos_add:
POSINT `+' POSINT {$$ = strtol($1, NULL, 10) + strtol($3, NULL, 10); free($1); free($3);}
;
Sharp readers will immediately spot that the above segments regardless of the tips provided would be quite prone to double-free, or invalid-free errors and they would be perfectly right. This is why when dealing with such issues I have constructed a wrapper function in order to take care of these issues -- but it requires the user to obey a few simple rules. Let's illustrate the function first and then explain how it actually works.
/* handy clean-up function */
void
tf(char *s)
{if(s != NULL && strcmp(s, "") != 0) {free(s);}}
This is simple but very effective because we can call it freely on any symbol that we might want to free, even those
that have already being released. This is done through the first if
that checks if s
is NULL
or equal to ""
.
The first is straightforward as that checks if s
is a valid pointer while the second one is a bit weird at first;
why check for that particular value you might ask? This is simple, we use that value to indicate that we have an
empty string or a symbol that might not have a value inside that rule. Thus the rule is that we always
expect s
to be a pointer and have a value of NULL
or ""
at any given point that we don't want to call
free
on that particular symbol. It's pretty neat and works quite well in practice (and doesn't clutter the
codebase as well!).
Another thing that important (that is neglected quite often I am afraid) is how to handle things when
something goes bad during parsing -- and oh boy in compilers isn't that usual. Thankfully bison
has the ability
to clean its own stuff when something does go wrong -- but what about our dynamically allocated symbols. Thankfully,
bison
has a facility that can help us free-up own resources by calling a destructor
on each of the discarded
symbols -- but what does bison
consider as a discarded symbol? The following list is mostly an
extract from bison
's manual which defines that bison
considers as discarded symbols.
Valid discarded symbols are the symbols that fall into at least one of the following categories:
- stacked symbols popped during the first phase of error recovery
- incoming terminals during the second phase of error recovery
- the current lookahead and the entire stack (except the current right-hand side symbols) in the case the parser is fail-fast
- the current lookahead and the entire stack (including the current right-hand side symbols) when the
C++
parser catches an exception inparse
- the start symbol when the parser succeeds
Additionally, bison
only calls the destructors for user-defined symbols. If you call a destructor for a
tag
-less type
or token
when their destructor expects one to be present a warning notifying the user of
that fact should pop during compilation. One such example is when you use the <*>
as a tag
,
which indicates that all of your tokens
, types
are expected to return a value. Let's first look at the
destructor syntax:
%destructor { /* C code */ } symbol_list
The destructor is placed into the declaration section of the bison file and between the brackets {
, }
any valid
C
or C++
code can be entered followed by a list of symbols that this destructor should be executed for. So this
would be a perfectly legal destructor for the crepr
tag:
%destructor { tf($$); } <crepr>
Destructors have the ability to be executed on a per-tag basis or globally; that means we can assign different
destructors for different tags
. That's the case because some tags
might have different types
declared for
them inside the union
, hence a different way of freeing them might be required. For our needs and purposes the
above destructor will suffice, but for reference here are a couple more valid examples:
%union
{
void *aptr;
char* crepr;
}
/* destructor specific for <crepr> tag */
%destructor { tf($$); } <crepr>
/* destructor specific for <a> tag */
%destructor { if(a) {free(a);}; } <a>
%destructor { /* any tag destructor */ } <*>
%destructor { /* tagless symbol destructor */ } <>
Notice the tag
-less destructor, which is called on every symbol that is present. This might raise the question
on how to select which destructor to call since there might be more than one. Should bison
detects that there is
a tag
-specific destructor for a discarded symbol it will call that and ignore the more generic one, so in case
of discarding a symbol that has a semantic value of <crepr>
then only the destructor for <crepr>
will be called.
In case we encounter a lexer (flex
) or parser (bison
) error under default conditions our parser will terminate
after seeing the first error -- these are called fail-fast parsers, but nobody actually wants them... (except maybe
lazy students). This means that if our input has more than one mistake we would only detect the first and exit! This
process would get quite tedious as repeated compilations would be required in order to detect and fix all errors instead
of reporting all errors in our source at the first parse.
Thankfully bison
has error recovery capabilities! They are also quire simple to define and explain but require a
lot of skill, practice and a hellish insight in order to be used correctly. Say for example that we have the following
rule:
program_decl:
KW_PROGRAM IDENT KW_SEMICOLON
;
What if an error was encountered inside that rule, say for example we had a valid KW_PROGRAM
token as well as
an IDENT
but not a semicolon (;
); then if we didn't have error recovering capabilities our parser would end as we
previously said.
In order to allow bison
to recover from a syntax error we have to create a rule that recognizes a special bison
token
called error
. It is defined by default and there is no need to explicitly define it. Thus the above rule
would be modified as follows in order to allow for error recovery:
program_decl:
KW_PROGRAM IDENT KW_SEMICOLON
| error KW_SEMICOLON
;
The above addition does a couple of things, firstly if we detect an error then we will immediately switch to the
second (and error handling) case and secondly we will ignore and discard everything encountered up the
next semicolon (;
). This switch happens by discarding both the value of the current semantic tags in current
context (if any) as we as all the tokens on the bison
stack until we reach a point where the rule which contains
the error
token is acceptable -- in our case this happens when we encounter the next semicolon (;
). Of course
for all of the discarded symbols the appropriate destructors will be run as well, if any.
Additionally, after encountering an error it's likely that this fact will then create much more consecutive errors; to
avoid this console spam bison
suppresses error messages until three (3) consecutive tokens
have been parsed
and shifted successfully. If you don't like (or want) this behavior by default you can put the yyerrok
inside the error
rule like so:
program_decl:
KW_PROGRAM IDENT KW_SEMICOLON
| error KW_SEMICOLON { yyerrok; }
;
Later on when defining our ptuc
grammar we will give insights regarding the places we elected to put our error
recovering states.
In this section I will briefly describe ptuc
grammar rules and how they are structured while also touching in greater
detail some implementation related topics that I find to be quite intriguing. As a general rule you should first
understand the grammar you want to create rules for and that's really important (ptuc
spec. ref. here); as if you
"get-it" then translating these rules into a working grammar should be quite easy. This section will follow the natural
structure of the grammar, defining small primitives and using them to build our grammar. Also please be aware that we
only copy values when we have to do so (e.g. change the printed format using template
function), otherwise we just
pass the pointer for the already allocated value to the return value ($$
). If in doubt, look at the complete source
(ptucc_parser.y).
Primitives are the building blocks that we will use in order to create more complex rules; in this category I included the following language elements:
- strings
- scalar/boolean values
- data-types
- identifiers
- expressions
- statements
The first three are simple rules to implement as well as understand, things start to get a little bit tricky when
dealing with statements and expressions. Most rules return a value, and in this language we only have one type,
<crepr>
that's of type char *
; so all of the return values are assumed to be of that type.
Strings are using a very simple rule that just groups the terminal symbols for the two types of strings ptuc
supports.
The only catch is that when we have a single quoted string we change the single quotes to double using
string_ptuc2c
function defined in cgen.c
.
/* string values */
string_vals:
STR_LIT {$$ = $1;}
| STRING {$$ = string_ptuc2c($1);}
;
Scalar values rule is again very simple and similar to strings:
scalar_vals:
POSINT {$$ = $1;}
| REAL {$$ = $1;}
;
Boolean values are the same as well, but here we don't just pass the accompanied value as the output but rather we
use strdup
to copy the string "true"
or "false"
to the output.
bool_vals:
KW_BOOL_TRUE {$$ = strdup("true");}
| KW_BOOL_FALSE {$$ = strdup("false");}
;
Then both of these are joined into a complex rule that can describe both in one go:
/* scalar/bool values */
lit_vals:
scalar_vals {$$ = $1;}
| bool_vals {$$ = $1;}
;
The data-types here are just a collection of the available ones that ptuc
has without the custom types.
/* data-types using no cast */
cdata:
KW_BOOLEAN {$$ = "bool";}
| KW_CHAR {$$ = "char";}
| KW_INTEGER {$$ = "int";}
| KW_REAL {$$ = "double";}
;
Here we compose a data-type rule that has all of the default ones and can also accept any custom ones that are treated as
an identifier (IDENT
).
/* data-types that might use cast to type */
cdata_with_type:
cdata {$$ = strdup($1);}
| IDENT {$$ = $1;}
;
Notice here that in the case of cdata
we have to copy the string due to the fact that they were statically allocated
inside the cadata
rule.
In ptuc
we support type casting, that is we can cast one type into another; this is done in the same way as
it's done in C
. Additionally we also support infinite nesting, so we allow ((((type)))) var_to_cast
as well as
(type) var_to_cast
, they are treated as equivalents. To allow this we use recursion, the first case recursively
strips away the surplus parenthesis pairs up to the point we reach one pair and one type. The template
function here
is used to return a formatted string and performs a dynamic allocation, thus we have to free up the resources using
tf
.
/* expression that we use to allow casting with inf. nested pars */
type_cast:
KW_LPAR type_cast KW_RPAR
{$$ = template("(%s)", $2); tf($2);}
| KW_LPAR cdata_with_type KW_RPAR
{$$ = template("(%s)", $2); tf($2);}
;
Previously we talked about identifiers, so here is the rule that implements them, which basically joins the terminal
case along with a second case that allows us to recursively parse more than one identifiers separated by commas (,
):
/* identifiers */
ident_list:
IDENT {$$ = $1;}
| ident_list KW_COMMA IDENT
{$$ = template("%s, %s", $1, $3); tf($1); tf($3);}
;
Another important identifier rule is the one we have to use for arrays as they are followed by a number of brackets
equal to the number of their dimension, so a 3-dimensional array like: array[2][3][4]
is followed by three
bracket pairs; in our rule this is represented in the brackets_list
rule that is also shown below.
/* this is to allow ident[index] scheme */
ident_with_bracket:
IDENT {$$ = $1;}
| IDENT brackets_list
{$$ = template("%s%s", $1, $2); tf($1); tf($2);}
;
/*
handle variable array size, eat up brackets in a
padding fashion e.g. [] -> [][] -> [][][] ...
*/
brackets_list:
KW_LBRA POSINT KW_RBRA
{$$ = template("[%s]", $2); tf($2);}
| brackets_list KW_LBRA POSINT KW_RBRA
{$$ = template("%s[%s]", $1, $3); tf($1); tf($3);}
;
Inside expression we group the three types of expressions we have one-side expressions such as !ident
, or +1
and
two-side expressions such as a + b
, b % c
etc. We intuitively composed them into a rule that follows:
expression:
one_side_exp {$$ = $1;}
| two_side_exp {$$ = $1;}
;
Additionally we consider as expressions also strings, literal values, function calls and so on... so another rule was created to accommodate that:
basic_exp:
type_cast exp_join %prec TYPE_CAST_PREC
{$$ = template("%s %s", $1, $2); tf($1); tf($2); }
| KW_LPAR basic_exp KW_RPAR {$$ = template("(%s)", $2); tf($2);}
| proc_call
{$$ = $1;}
| lit_vals
{$$ = $1;}
| string_vals
{$$ = $1;}
| expression
{$$ = $1;}
;
The important part is that we use precedence here to indicate that the first case, where we match and reduce
in
case we have a type-cast and an expression using the %prec
directive we enforce the precedence and associativity
of TYPE_CAST_PREC
overriding the the bison
inferred precedence; this eliminates a potential ambiguity in our
grammar as if we did not enforce that precedence rule bison
would not be certain whether to shift
using the first
case or reduce
using the second case.
The final step is to describe exp_join
state, which is the following:
exp_join:
ident_with_bracket {$$ = $1;}
| basic_exp {$$ = $1;}
;
This is a clever way to pack the identifiers that are followed with an array without causing any ambiguities in our grammar.
Statements form the block of our logic, they contain most useful building blocks, that are really not all that interesting in their implementation, just a straight up translation of the rules in the spec. into tokens, then all of them are composed into the complex rule that follows:
statement:
common_stmt {$$ = $1;}
| proc_call {$$ = template("%s;\n", $1); tf($1);}
| while_stmt {$$ = $1;}
| for_stmt {$$ = $1;}
| if_stmt {$$ = $1;}
| label_stmt {$$ = $1;}
| ret_stmt {$$ = $1;}
| body {$$ = $1;}
;
Then we can define the following two rules, which allows us to parse a lot of statements using recursion while also handling empty input correctly.
statements:
{$$ = "";}
| statement_list { $$ = $1; }
;
statement_list:
statement
| statement_list KW_SEMICOLON statement
{ $$ = template("%s%s", $1, $3); tf($1); tf($3); }
;
Our command enclosure (body) is comprised out of zero or many statements, these include assignments, function/procedure
calls and so on. The rule is simple, as the statements are expanded into their respective rules. The body
rule is the
following:
body:
KW_BEGIN statements KW_END
{$$ = template("{%s}", $2); tf($2);}
| KW_BEGIN error KW_END
{$$ = "";}
;
This is handy as we can use it in all places where we want to have multiple statements, like inside if
, while
statements, function/procedure definitions, modules and so on. There is a catch though, in functions
we have to use a
different body rule as they require to support the result
keyword, this happens by just using the same structure
albeit using the special statements rule we introduced for functions above. The resulting rule follows.
func_body:
KW_BEGIN func_stmts KW_END
{$$ = template("{%s}", $2); tf($2);}
| KW_BEGIN error KW_END
{$$ = "";}
;
Modules (as we previously said in our lexer here) are stitched to the input so bison
has no idea that we are
reading from another file; thus we just see a token
sequence. From our spec. inside our modules we only allow
declarations of variables, functions and so on thus inside incl_mod
we make sure that happens, if not an error is
thrown. The 'incl_mods' rule allows us to parse multiple modules while also handling empty input correctly.
incl_mods:
{$$ = "";}
| incl_mods incl_mod KW_SEMICOLON
{$$ = template("%s\n%s", $1, $2); tf($1); tf($2);}
;
incl_mod:
KW_MODULE IDENT incl_mods KW_BEGIN decls KW_END KW_DOT
{
$$ = template("// included module %s\n%s\n%s", $2, $3, $5);
tf($2); tf($3); tf($5);
}
| error KW_SEMICOLON {$$ = "";};
;
This is a very simple rule, it just checks if the program header is aligned with the rules in our spec., if not the second case is matched and an error is reported.
program_decl:
KW_PROGRAM IDENT KW_SEMICOLON { $$ = $2; }
| error KW_SEMICOLON {$$ = "";}
;
This rule allows us to declare variables, functions/procedures and custom data-types; we can declare none or many of each and at any order -- gotta love recursive rules :).
/* flexible decls allow in any order variable, type, function decls */
decls:
/* in case of no decls */
{$$ = "";}
| decls error KW_SEMICOLON {$$ = $1;}
| decls type_decl
{$$ = template("%s\n%s", $1, $2); tf($1); tf($2);}
| decls var_decl
{$$ = template("%s\n%s", $1, $2); tf($1); tf($2);}
| decls func_decl
{$$ = template("%s\n%s", $1, $2); tf($1); tf($2);}
| decls proc_decl
{$$ = template("%s\n%s", $1, $2); tf($1); tf($2);}
;
Variables are parsed using a recursive rule again, but as with function/procedure arguments that we will see below if we
have multiple variables chained against one data-type we cannot transparently convert from ptuc
to C
thus we created
two helper functions ident_to_pointer
and ident_to_array
which are used to solve this problem. All of these use a
similar technique and I'll only explain one of them, ident_to_pointer
. In ptuc
we have have the following variable
declaration:
var
a: integer;
b, c, d: array of char;
Which would translate to C
as follows:
int a;
char *b, *c, *d;
But if we see the rule for parsing the variables, due to the order of expansion the value of the data-type is not available upon identifier expansion so it cannot be padded. The rule is the following:
/* variables */
var_decl:
KW_VAR var_decl_list
{ $$ = template("%s\n", $2); tf($2); }
;
var_decl_list:
var_decl_single {$$ = $1;}
| var_decl_list var_decl_single
{$$ = template("%s\n%s", $1, $2); tf($1); tf($2);}
;
Hence the logic behind ident_to_pointer
is to pad the pointer into each identifier that needs it, like so:
/* pointerize identifiers */
char *
ident_to_pointer(char *s) {
if(s == NULL || strcmp(s, "") == 0)
{return "";}
/* find the number of variables */
uint32_t vars = find_counts(s, ',');
size_t buf_len = strlen(s)+(4*vars)+1;
char *cur_tok = cur_tok = strtok(s, ","),
*new_buf = calloc(buf_len, sizeof(char));
if(new_buf == NULL)
{return NULL;}
while(cur_tok != NULL) {
strcat(new_buf, " *");
strcat(new_buf, cur_tok);
cur_tok = strtok(NULL, ",");
if(cur_tok != NULL)
{strcat(new_buf, ",");}
}
return new_buf;
}
Procedures are like our main body
, albeit they take arguments; we again use the template
function in order to
translate ptuc
syntax to C
as per our spec.
/* proc. decl. */
proc_decl:
KW_PROCEDURE IDENT
KW_LPAR type_only_arguments KW_RPAR KW_SEMICOLON
decls body KW_SEMICOLON
{
$$ = template("void %s(%s) {\n%s\n%s}\n",
$2, $4, $7, $8);
tf($2); tf($4); tf($7); tf($8);
}
;
Interestingly enough, there is a catch here; due to the way arguments are structured in ptuc
there is not an elegant
way of performing the conversion using just bison
's magic. Let's illustrate this with an example; imagine that we had
the following ptuc
procedures:
procedure evalA(f: real; v: integer);
begin
(* proc. body *)
end
procedure evalB(f, c: real; v, l: integer);
begin
(* proc. body *)
end
Both are perfectly valid; but C
requires the data-type of each variable that is part of a function argument to have
its type as well, so the above functions would translate into C
like this:
void evalA(double f, int v) {
// proc. body
}
void evalB(double f, double c, int v, int l) {
// proc. body
}
The following rule correctly and reliably parses the ptuc
arguments but fails to translate into C
transparently;
that's because ident_list
does not know what the value of cdata_with_type
is and the time of expansion and inside
the type_only_arguments
rule we have the complete parsed strings from both ident_list
as well as cdata_with_type
.
This becomes an issue when we have more than one variable assigned to the same data-type (like in evalB
), so in order
to remedy this issue we created ident_to_data
that does exactly that... combines the identifiers with their type and
returns the correct C
equivalent syntax.
/* type only arguments */
type_only_arguments:
{$$ = "";}
| ident_list KW_COLON cdata_with_type
{
char *s = ident_to_cdata($1, $3);
if(s == NULL) {
tf($3); yyerror("No memory to pad cdata to ident");
} else
{$$ = s; tf($1); tf($3);}
}
| type_only_arguments KW_SEMICOLON ident_list KW_COLON cdata_with_type
{
char *s = ident_to_cdata($3, $5);
if(s == NULL) {
tf($3); tf($5); yyerror("No memory to pad cdata to ident");
} else {
$$ = template("%s, %s", $1, s);
tf($1); tf($3); tf($5); tf(s);
}
}
;
Functions are like procedures, with one exception -- they have a return value; as per our spec. this return requires
us to have also a special variable inside our function called result
which we assign our resulting value to. In the end
we can either result the result value or one of our own. The func_body
rule is altered to accommodate for that fact as
it uses a modified statements rule.
/* function-decl. */
func_decl:
KW_FUNCTION IDENT
KW_LPAR type_only_arguments KW_RPAR
KW_COLON cdata_with_type KW_SEMICOLON
decls func_body KW_SEMICOLON
{
$$ = template("%s %s(%s) {%s result;\n%s\n%s\nreturn result;}\n",
$7, $2, $4, $7, $9, $10);
tf($2); tf($4); tf($7); tf($9); tf($10);
}
;
Here is a simple summary of the bison
special commands and a brief description on
what they do -- a cheat-sheet if you'd like to call it that.
%union
: indicates the availabletags
and their types.%token
: a numerical value that indicates a specific input is matched -- they originate usually fromflex
.%type
: is used to indicate that abison
state returns a value.%right
: used to indicate right precedence (e.g. a + b + c -> a + (b + c)).%left
: used to indicate left precedence (e.g. a + b + c -> (a + b) + c).%nonassoc
: used to indicate that the operators are not-associated and using it like so would indicate a syntax error.%start
: used to indicate our starting rule (or you could call it state as well).%expect
: used to indicate that we expect our grammar to have a specific number of shift/reduce conflicts (not reduce/reduce conflicts). If the number of errors differ, then a compilation error is thrown (e.g.%expect n
if shift/reduce count is not exactlyn
, we fail to compile).%pure_parser
: used to indicate that we want our parser to be a reentrant; this means that it will not use global variables to communicate withflex
. This option is useful when we want to create a thread-safe parser.%raw
: this is to enabletoken
compatibility withyacc
;yacc
starts numbering multi-charactertokens
sequentially from257
whilebison
normally starts from3
. Single charactertokens
inyacc
use their ANSI value.%no_lines
: this option doesn't inform theC
compiler regarding the line numbering scheme (thus they don't generate the#lines
pre-processor commands). Don't turn this off, as that enables to map your source code to the resulting.c
file produced bybison
; this helps debugging a lot.%token_table
: this generates an array oftoken
names in the parser file. The generated array is namedyytname
and eachtoken
name is located asyytname[i]
wherei
is the numberbison
has assigned to thattoken
. Along with the tablebison
generates some macros that are provide some useful information, these are:YYNTOKENS
: Returns the number of definedtokens
plus one (for theEOF
).YYNNTS
: Returns the number of non-terminal symbols (basically is the number oftokens
).YYNRULES
: Returns the number of grammar rules.YYNSTATES
: Returns the number of (generated) parser states.
%define
: allows us to tweak some parser parameters by defining some configuration properties (more here); the most common of which would be:%define parse.error verbose
that enables more verbose syntax error information output.
I had a blast creating this project and writing this guide. Due to my recent dabblings I don't get to play much with these tools so it was a nice escape. Hopefully you learned something through all of this ;).