- How To Compile Code
- ABOUT LEX
- ABOUT YACC
- If you don't have lex or yacc compiler first install it using following command :-
sudo apt-get install flex sudo apt-get install byaac
- Then first compile the yacc file and then lex file by using command :-
yacc -d filename.y lex filename.l gcc gcc lex.yy.c y.tab.c -o lex_yacc ./lex_yacc
- The '-d' is used to generate y.tab.h file.
- Lex is a program generator designed for lexical processing of character input streams. It accepts a high-level, problem oriented specification for character string matching, and produces a program in a general purpose language which recognizes regular expressions. The regular expressions are specified by the user in the source specifications given to Lex. The Lex written code recognizes these expressions in an input stream and partitions the input stream into strings matching the expressions. At the boundaries between strings program sections provided by the user are executed. The Lex source file associates the regular expressions and the program fragments. As each expression appears in the input to the program written by Lex, the corresponding fragment is executed.
- Although Lex is often used as a front-end to a parser, it has been designed such that it can be used stand-alone. Used in this fashion, Lex makes for a simple but very powerful text processing tool.
- Lex is not a complete language, but rather a generator representing a new language feature which can be added to different programming languages, called ``host languages.'' Just as general purpose languages can produce code to run on different computer hardware, Lex can write code in different host languages. The host language is used for the output code generated by Lex and also for the program fragments added by the user. Compatible run-time libraries for the different host languages are also provided. This makes Lex adaptable to different environments and different users. Each application may be directed to the combination of hardware and host language appropriate to the task, the user's background, and the properties of local implementations.
- At present, the only supported host language is C.
A lex program has the following basic structure:
%{
C declarations and includes
%}
Lex macro definitions and directives
%%
Lex Specification
in the form of pattern/action statements like this:
keyword { my_c_code(yytext); }
%%
C language program (the rest)
However, the only mandatory part is the first %%
-
Lex turns the user's expressions and actions (called source in this memo) into the host general-purpose language; the generated program is named yylex. The yylex program will recognize expressions in a stream (called input in this memo) and perform the specified actions for each expression as it is detected.
+-------+ Source -> | Lex | -> yylex +-------+ +-------+ Input -> | yylex | -> Output +-------+ An overview of Lex
-
For a trivial example, consider a program to delete from the input all blanks or tabs at the ends of lines.
%% [ \t]+$;
is all that is required. The program contains a %% delimiter to mark the beginning of the rules, and one rule. -
This rule contains a regular expression which matches one or more instances of the characters blank or tab (written \t for visibility, in accordance with the C language convention) just prior to the end of a line. The brackets indicate the character class made of blank and tab; the + indicates one or more ...; and the $ indicates end of line, as in QED. No action is specified, so the program generated by Lex (yylex) will ignore these characters. Everything else will be copied. To change any remaining string of blanks or tabs to a single blank, add another rule:
%% [ \t]+$ ; [ \t]+ printf(" ");
-
The finite automaton generated for this source will scan for both rules at once, observing at the termination of the string of blanks or tabs whether or not there is a newline character, and executing the desired rule action. The first rule matches all strings of blanks or tabs at the end of lines, and the second rule all remaining strings of blanks or tabs.
-
Lex can be used alone for simple transformations, or for analysis and statistics gathering on a lexical level. Lex can also be used with a parser generator to perform the lexical analysis phase; it is particularly easy to interface Lex and Yacc [3]. Lex programs recognize only regular expressions; Yacc writes parsers that accept a large class of context free grammars, but require a lower level analyzer to recognize input tokens. Thus, a combination of Lex and Yacc is often appropriate. When used as a preprocessor for a later parser generator, Lex is used to partition the input stream, and the parser generator assigns structure to the resulting pieces.
lexical grammar rules rules | | v v +---------+ +---------+ | Lex | | Yacc | +---------+ +---------+ | | v v +---------+ +---------+ Input -> | yylex | -> | yyparse | -> Parsed input +---------+ +---------+ Lex with Yacc
-
Yacc users will realize that the name yylex is what Yacc expects its lexical analyzer to be named, so that the use of this name by Lex simplifies interfacing.
-
So far, we have been using lex in a "stand-alone" mode, and linking in the (hitherto mysterious) libl library using LDLIBS=-ll
-
In fact, lex does not generate a complete program. Lex generates a single function,
(int) yylex()
and some associated global variables. -
yylex() reads the input indicated by the global variable (FILE*) yyin_ .
-
yyin defaults to the standard input.
-
When lex reaches the end of the file it is reading, it calls a function (int) yywrap() If yywrap() returns non-zero, yylex() returns a zero value. If yywrap() returns zero, yylex() keeps scanning, from where it left off, with whatever input is available on yyin. This is only useful if yywrap() has changed yyin to provide for additional input.
-
The library libl (or libfl for flex) provides two functions which are needed to complete our stand-alone lex program:
main()
...which simply calls yylex()
yywrap()
...which always returns non-zero.
- Let's rewrite rip-url such that we do not need the standard libl, and add a few more features along the way.
%{
#include <stdio.h>
#include <errno.h>
int file_num;
int file_num_max;
char **files;
extern int errno;
%}
%%
(ftp|http):\/\/[^ \n<>"]* printf("%s\n",yytext);
.|\n ;
%%
int main(int argc, char *argv[]) {
file_num=1;
file_num_max = argc;
files = argv;
if ( argc > 1 ) {
if ( (yyin = fopen(argv[file_num],"r")) == 0 ) {
perror(argv[file_num]);
exit(1);
}
}
while( yylex() )
;
return 0;
}
int yywrap() {
fclose(yyin);
if ( ++file_num < file_num_max ) {
if ( (yyin = fopen(files[file_num],"r")) == 0 ) {
perror(files[file_num]);
exit(1);
}
return 0;
} else {
return 1;
}
}
- We now have a function main() which opens the first file (if specified) and calls yylex() When yylex() finished with the first file, it calls yywrap(), which opens the next file, and yylex() continues. When yywrap() has exhausted all the command line arguments, it returns 1, and yylex() returns with value 0 (but we don't use the return value).
- Moreover, since we have now provided both main() and yywrap(), we no longer need the libl library, and we can compile rip-url using simply:
make rip-url
- Notice that the libl library (lex library) is required only if we do not provide main() or yywrap(). The libl library is not required for any of the text-processing - this is all done by the lex-generated C-code.
-
Although none of our examples have so far done so, it is valid to execute a return() statement within a lex rule.
-
yylex() is of type int, so a non-zero integer value would normally be returned. Returning zero would be ambiguous, because the zero value is what is returned by yylex() when it encounters and end-of-file, and yywrap() returns a non-zero.
-
After yylex() has returned, it is possible to call it again and again, and the scanner will continue exactly where it left off each time. If any start-condition was in force when the return() was executed, it will still apply when yylex() is called again.
-
This aspect of yylex() plays a key role when lex is being used as a front-end to a parser, such as yacc. When writing a stand-alone lex program, it is generally not required to have a return() statement within a lex rule.
- Yacc provides a general tool for imposing structure on the input to a computer program. The Yacc user prepares a specification of the input process; this includes rules describing the input structure, code to be invoked when these rules are recognized, and a low-level routine to do the basic input.
- Yacc then generates a function to control the input process. This function, called a parser, calls the user-supplied low-level input routine (the lexical analyzer) to pick up the basic items (called tokens) from the input stream. These tokens are organized according to the input structure rules, called grammar rules; when one of these rules has been recognized, then user code supplied for this rule, an action, is invoked; actions have the ability to return values and make use of the values of other actions.
- Yacc is written in a portable dialect of C[1] and the actions, and output subroutine, are in C as well. Moreover, many of the syntactic conventions of Yacc follow C.
- A yacc specification is structured along the same lines as a Lex specification.
%{
/* C declarations and includes */
%}
/* Yacc token and type declarations */
%%
/* Yacc Specification
in the form of grammer rules like this:
*/
symbol : symbols tokens
{ $$ = my_c_code($1); }
;
%%
/* C language program (the rest) */
- The Yacc Specification rules are the place where you "glue" the various tokens together that lex has conviniently provided to you.
- Each grammar rule defines a symbol in terms of:
- other symbols
- tokens (or terminal symbols) which come from the lexer.
- Each rule can have an associated action, which is executed after all the component symbols of the rule have been parsed. Actions are basically C-program statements surrounded by curly braces.
- Yacc rules define what is a legal sequence of tokens in our specification language. In our case, lets look at the rule for a simple, executable menu-command:
menu_item : LABEL EXEC
;
-
This rule defines a non-terminal symbol, menu_item in terms of the two tokens LABEL and EXEC. Tokens are also known as "terminal symbols", because the parser does not need to expand them any further. Conversely, menu_item is a "non-terminal symbol" because it can be expanded into LABEL and EXEC.
-
You may notice that People using UPPER CASE for terminal symbols (tokens), and lower-case for non-terminal symbols. This is not a strict requirement of yacc, but just a convention that has been established. We will follow this convention throughout our discussion.
- Actions within yacc rules take the form:
menu_item : LABEL EXEC '\n'
{ C-program statements }
;
-
So far, we have only considered the tokens LABEL and EXEC as single-valued integers which are passed from the lexer to the parser. What we really need, is access to the text-strings associated with these tokens (ie their semantic value).
-
We could do this using a global variable (like token_txt in our spam-checking program), except that yacc executes the action after it has read all the tokens up to that point.
-
Consider the MENU keyword, in our case. Yacc has to check whether it is followed by another string or a newline, before it can decide whether it is being used to introduce a sub-menu within the same file, or an external menu-file.
-
In any case, yacc provides a formal method for dealing with the semanitic value of tokens. It begins with the lexer. Every time the lexer returns a value, it should also set the external variable yylval to the value of the token. Yacc will then retain the association between the token and the corresonding value of yylval.
-
In order to accomodate a variety of different token-types, yylval is declared as a union of different types.
- Token types are declared in yacc using the yacc declaration %union, like this
%union {
char *str;
int num;
}
-
This defines yylval as being a union of the types (char) and (int). This is a classical C-program union, so any number of types may be defined, and the union may even contain struct types, etc. For now, we'll just have these two types.
-
We also need to tell yacc which type is associated with which token. This is done by modifying our %token declarations to include a type, like this:
%token <str> LABEL
%token <str> EXEC
%token <num> INT
We do not need to modify any other %token declarations, because they are all for keywords, which do not require any associated value.
Now we need to modify the lexer. This is done by including the line:
yylval.str = strdup(yytext);
just before the lexer returns the LABEL and EXEC tokens. We'll also include the line:
yylval.num = atoi(yytext);
just before the lexer returns the __INT__ token.
-
Now that we understand the hows and whys of yacc rules, we can try our hand at writing a basic parser. Before we can successfully compile it, there's a few more things we'll need.
- A main() function that calls our parser, yyparse()
- A yyerror() function
- A yywrap() function (for the lexer).
- Debugging output (optional? :-))
- Yacc generates yyparse()
-
Yacc generates a single function called yyparse(). This function requires no parameters. and returns either a 0 on success, and 1 on failure. "Failure" in the case of the parser means "if it encounters a syntax error".
The yyerror() function is called when yacc encounters an invalid syntax. The yyerror() is passed a single string (char*) argument*. Unfortunately, this string usually just says "parse error", so on its own, it's pretty useless. Error recovery is an important topic which we will cover in more detail later on. For now, we just want a basic yyerror() function like this:
yyerror(char *err) {
fprintf(stderr, "%s\n",err);
}
See the section "Interface" in the Bison documentation for more information.