Skip to content

zeek/binpac

Repository files navigation

BinPAC

BinPAC is a high level language for describing protocol parsers and generates C++ code. It is currently maintained and distributed with the Zeek Network Security Monitor distribution, however, the generated parsers may be used with other programs besides Zeek.

Download

You can find the latest BinPAC release for download at https://www.zeek.org/download.

BinPAC's git repository is located at https://github.com/zeek/binpac

This document describes BinPAC 0.61.0-23. See the CHANGES file for version history.

Prerequisites

BinPAC relies on the following libraries and tools, which need to be installed before you begin:

  • Flex (Fast Lexical Analyzer)

    Flex is already installed on most systems, so with luck you can skip having to install it yourself.

  • Bison (GNU Parser Generator)

    Bison is also already installed on many system.

  • CMake 2.8.12 or greater

    CMake is a cross-platform, open-source build system, typically not installed by default. See http://www.cmake.org for more information regarding CMake and the installation steps below for how to use it to build this distribution. CMake generates native Makefiles that depend on GNU Make by default

Installation

To build and install into /usr/local:

./configure
cd build
make
make install

This will perform an out-of-source build into the build directory using the default build options and then install the binpac binary into /usr/local/bin.

You can specify a different installation directory with:

./configure --prefix=<dir>

Run ./configure --help for more options.

Glossary and Convention

To make this document easier to read, the following are the glossary and convention used.

  • PAC grammar - .pac file written by user.
  • PAC source - _pac.cc file generated by binpac
  • PAC header - _pac.h file generated by binpac
  • Analyzer - Protocol decoder generated by compiling PAC grammar
  • Field - a member of a record
  • Primary field - member of a record as direct result of parsing
  • Derivative field - member of a record evaluated through post processing

BinPAC Language Reference

BinPAC language consists of:

  • analyzer
  • type - data structure like definition describing parsing unit. Types can built on each other to form more complex type similar to yacc productions.
  • flow - "flow" defines how data will be fed into the analyzer and the top level parsing unit.
  • Keywords
  • Built-in macros

Defining an analyzer

There are two components to an analyzer definition: the top level context and the connection definition.

Context Definition

Each analyzer requires a top level context defined by the following syntax:

analyzer <ContextName> withcontext {
... context members ...
}

Typically top level context contains pointer to top level analyzer and connection definition like below:

analyzer HTTP withcontext {
   connection : HTTP_analyzer;
   flow     : HTTP_flow;
};

Connection Definition

A "connection" defines the entry point into the analyzer. It consists of two "flow" definitions, an "upflow" and a "downflow".

connection <AnalyzerName>(optional parameter) {
 upflow = <UpflowConstructor>;
 downflow = <DownflowConstructor>;
}

Example:

connection HTTP_analyzer {
   upflow = HTTP_flow (true);
   downflow = HTTP_flow (false);
};

type

A "type" is the basic building block of binpac-generated parser, and describes the structure of a byte segment. Each non-primitive "type" generates a C++ class that can independently parse the structure which it describes.

Syntax:

type <typeName>{(<optional type parameter(s)>)} = <compositor or primitive class>{
  cases or members declaration.
} <optional attribute(s)>;

Example:

PAC grammar:

type myType = record {
   data:uint8;
};

PAC header:

class myType{
public:
   myType();
   ~myType();
   int Parse(const_byteptr const t_begin_of_data, const_byteptr const t_end_of_data);
   uint8 data() const  { return data_; }
protected:
   uint8 data_;
};

Primitives

Primitive type can be treated as #define in C language. They are embedded into other type which reference them but do not generate any parsing code of their own. Available primitive types are:

  • int8
  • int16
  • int32
  • uint8
  • uint16
  • uint32
  • Regular expression ( type HTTP_URI = RE/[[:alnum:][:punct:]]+/; )
  • bytestring

Examples:

type foo = record { x: number; };

is equivalent to:

type foo = record { x: uint8[3]; };

(Note: this behavior may change in future versions of binpac.)

record

A "record" composes primitive type(s) and other record(s) to create new "type". This new "type" in turn can be used as part of parent type or directly for parsing.

Example:

type SMB_body = record {
   word_count  : uint8;
   parameter_words : uint16[word_count];
   byte_count  : uint16;
}

case

The "case" compositor allows switching between different parsing methods.

type SMB_string(unicode: bool, offset: int) = case unicode of {
   true  -> u: SMB_unicode_string(offset);
   false -> a: SMB_ascii_string;
};

A "case" supports an optional "default" label to denote none of the above labels are matched. If no fields follow a given label, a user can specify an arbitrary field name with the "empty" type. See the following example.

type HTTP_Message(expect_body: ExpectBody) = record {
       headers:     HTTP_Headers;
       body_or_not: case expect_body of {
               BODY_NOT_EXPECTED -> none: empty;
               default           -> body: HTTP_Body(expect_body);
       };
};

Note that only one field is allowed after a given label. If multiple fields are to be specified, they should be packed in another "record" type first. The other usages of case are described later.

array

A type can be defined as a sequence of "single-type elements". By default, array type continue parsing for the array element in an infinite loop. Or an array size can be specified to control the number of match. &until can be also conditionally end parsing:

# This will match for 10 element only
type HTTP_Headers = HTTP_Header [10];

# This will match until the condition is met
type HTTP_Headers = HTTP_Header [] &until(/*Some condition*/);

Array can also be used directly inside of "record". For example:

type DNS_message = record {
 header:      DNS_header;
 question:    DNS_question(this)[header.qdcount];
 answer:      DNS_rr(this, DNS_ANSWER)[header.ancount];
 authority:   DNS_rr(this, DNS_AUTHORITY)[header.nscount];
 additional:  DNS_rr(this, DNS_ADDITIONAL)[header.arcount];
}&byteorder = bigendian, &exportsourcedata

flow

A "flow" defines how data is fed into the analyzer. It also maintains custom state information declared by %member. flow is configured by specifying type of data unit.

Syntax:

flow <Flow name>(<optional attribute>) {
  <flowunit|datagram> = <top level data unit> withcontext (<context constructor parameter>);
};

When "flow" is added to top level context analyzer, it enables use of &oneline and &length in "record" type. flow buffers data when there is not enough to evaluate the record and dispatches data for evaluation when the threshold is reached.

flowunit

When flowunit is used, the analyzer uses flow buffer to handle incremental input and provide support for &oneline/&length. For further detail on this, see Buffering.

flowunit = HTTP_PDU(is_orig) withcontext (analyzer, this);

datagram

Opposite to flowunit, by declaring data unit as datagram, flow buffer is opted out. This results in faster parsing but no incremental input or buffering support.

datagram = HTTP_PDU(is_orig) withcontext (analyzer, this);

Byte Ordering and Alignment

Byte Ordering

Byte Alignment

type RPC_Opaque = record {
   length: uint32;
   data:   uint8[length];
   pad:    padding align 4;    # pad to 4-byte boundary
};

Functions

User can define functions in binpac. Function can be declared using one of the three ways:

PAC with embedded body

PAC style function prototype and embed the body using %{ %}:

function print_stuff(value :const_bytestring):bool
%{
   printf("Value [%s]\n", std_str(value).c_str());
%}

PAC with PAC-case body

Pac style function with a case body, this type of declaration is useful for extending later by casefunc:

function RPC_Service(prog: uint32, vers: uint32): EnumRPCService =
   case prog of {
       default -> RPC_SERVICE_UNKNOWN;
   };

Inlined by %code

Function can be completely inlined by using %code:

%code{
EnumRPCService RPC_Service(const RPC_Call* call)
   {
   return call ? call->service() : RPC_SERVICE_UNKNOWN;
   }
%}

Extending

PAC code can be extended by using "refine". This is useful for code reusing and splitting functionality for parallel development.

Extending record

Record can be extended to add additional attribute(s) by using "refine typeattr". One of the typical use is to add &let for split protocol parsing from protocol analysis.

refine typeattr HTTP_RequestLine += &let {
   process_request: bool =
       process_func(method, uri, version);
};

Extending type case

refine casetype RPC_Params += {
   RPC_SERVICE_PORTMAP -> portmap: PortmapParams(call);
};

Extending function case

Function which is declared as a PAC case can be extended by adding additional case into the switch.

refine casefunc RPC_BuildCallVal += {
   RPC_SERVICE_PORTMAP ->
       PortmapBuildCallVal(call, call.params.portmap);
};

Extending connection

Connection can be extended to add functions and members. Example:

refine connection RPC_Conn += {
   function ProcessPortmapReply(results: PortmapResults): bool
       %{
       %}
};

State Management

State is maintained by extending parsing class by declaring derivative. State lasts until the top level parsing unit (flowunit/datagram is destroyed).

Keywords

Source code embedding

C++ code can be embedded within the .pac file using the following directives. These code will be copied into the final generated code.

  • %header{...%}

    Code to be inserted in binpac generated header file.

  • %code{...%}

    Code to be inserted at the beginning of binpac generated C++ file.

  • %member{...%}

    Add additional member(s) to connection (?) and flow class.

  • %init{...%}

    Code to be inserted in flow constructor.

  • %cleanup{...%}

    Code to be inserted in flow destructor.

Embedded pac primitive

  • ${
  • $set{
  • $type{
  • $typeof{
  • $const_def{

Condition checking

&until

"&until" is used in conjunction with array declaration. It specifies exit condition for array parsing.

type HTTP_Headers = HTTP_Header[] &until($input.length() == 0);
&requires

Process data dependencies before evaluating field.

Example: typically, derivative field is evaluated after primary field. However "&requires" is used to force evaluate of length before msg_body.

type RPC_Message = record {
   xid:        uint32;
   msg_type:   uint32;
   msg_body:   case msg_type of {
       RPC_CALL    -> call:    RPC_Call(this);
       RPC_REPLY   -> reply:   RPC_Reply(this);
   } &requires(length);
} &let {
   length = sourcedata.length();   # length of the RPC_Message
} &byteorder = bigendian, &exportsourcedata, &refcount;
&if

Evaluate field only if condition is met.

type DNS_label(msg: DNS_message) = record {
   length:     uint8;
   data:       case label_type of {
       0 ->    label:  bytestring &length = length;
       3 ->    ptr_lo: uint8;
   };
} &let {
   label_type: uint8   = length >> 6;
   last: bool      = (length == 0) || (label_type == 3);
   ptr: DNS_name(msg)
       withinput $context.flow.get_pointer(msg.sourcedata,
           ((length & 0x3f) << 8) | ptr_lo)
       &if(label_type == 3);
   clear_pointer_set: bool = $context.flow.reset_pointer_set()
       &if(last);
};
case

There are two uses to the "case" keyword.

  • As part of record field. In this scenario, it allow alternative methods to parse a field. Example:

    type RPC_Reply(msg: RPC_Message) = record {
      stat:       uint32;
      reply:      case stat of {
          MSG_ACCEPTED -> areply:  RPC_AcceptedReply(call);
          MSG_DENIED   -> rreply:  RPC_RejectedReply(call);
      };
    } &let {
      call: RPC_Call = context.connection.FindCall(msg.xid);
      success: bool = (stat == MSG_ACCEPTED && areply.stat == SUCCESS);
    };
  • As function definition. Example:

    function RPC_Service(prog: uint32, vers: uint32): EnumRPCService =
        case prog of {
                default -> RPC_SERVICE_UNKNOWN;
        };

Note that one can "refine" both types of cases:

refine casefunc RPC_Service += {
       100000  -> RPC_SERVICE_PORTMAP;
};

Built-in macros

$input

This macro refers to the data that was passed into the ParseBuffer function. When $input is used, binpac generate a const_bytestring which contains the start and end pointer of the input.

PAC grammar:

&until($input.length()==0);

PAC source:

const_bytestring t_val__elem_input(t_begin_of_data, t_end_of_data);
if (  ( t_val__elem_input.length() == 0 )  )
$element

$element provides access to entry of the array type. Following are the ways which $element can be used.

  • Current element. Check on the value of the most recently parsed entry. This would get executed after each time an entry is parsed. Example:

    type SMB_ascii_string       = uint8[] &until($element == 0);
  • Current element's field. Example:

    type DNS_label(msg: DNS_message) = record {
       length:     uint8;
       data:       case label_type of {
           0 ->    label:  bytestring &length = length;
           3 ->    ptr_lo: uint8;
       };
    } &let {
       label_type: uint8 = length >> 6;
       last:       bool  = (length == 0) || (label_type == 3);
    };
    type DNS_name(msg: DNS_message) = record {
       labels:     DNS_label(msg)[] &until($element.last);
    };
$context

This macro refers to the Analyzer context class (Context<Name> class gets generated from analyzer <Name> withcontext {}). Using this macro, users can gain access to the "flow" object and "analyzer" object.

Other keywords

&transient

Do not create copy of the bytestring

type MIME_Line = record {
   line:   bytestring &restofdata &transient;
} &oneline;
&let

Adds derivative field to a record

type ncp_request(length: uint32) = record {
   data        : uint8[length];
} &let {
   function    = length > 0 ? data[0] : 0;
   subfunction = length > 1 ? data[1] : 0;
};
let

Declares global value. If the user does not specify a type, the compiler will assume the "int" type.

PAC grammar:

let myValue:uint8=10;

PAC source:

uint8 const myValue = 10;

PAC header:

extern uint8 const myValue;
&restofdata

Grab the rest of the data available in the FlowBuffer.

PAC grammar:

onebyte: uint8;
value: bytestring &restofdata &transient;

PAC source:

// Parse "onebyte"
onebyte_ = *((uint8 const *) (t_begin_of_data));
// Parse "value"
int t_value_string_length;
t_value_string_length = (t_end_of_data) - ((t_begin_of_data + 1));
int t_value__size;
t_value__size = t_value_string_length;
value_.init((t_begin_of_data + 1), t_value_string_length);
&length

Length can appear in two different contexts: as property of a field or as property of a record. Examples: &length as field property:

protocol    : bytestring &length = 4;

translates into:

const_byteptr t_end_of_data = t_begin_of_data + 4;
int t_protocol_string_length;
t_protocol_string_length = 4;
int t_protocol__size;
t_protocol__size = t_protocol_string_length;
protocol_.init(t_begin_of_data, t_protocol_string_length);
&check

This was originally intended to implement the behavior of the superseding "&enforce" attribute. It always has and always will just be a no-op to ensure anything that uses this doesn't suddenly and unintentionally break.

&enforce

Check a condition and raise exception if not met.

&chunked and $chunk

When parsing a long field with variable length, "chunked" can be used to improve performance. However, chunked field are not buffered across packet. Data for the chunk in the current packet can be access by using "$chunk".

&exportsourcedata

Data matched for a particular type, the data matched can be retained by using "&exportsourcedata".

.pac file

type myType = record {
   data:uint8;
} &exportsourcedata;

_pac.h

class myType
{
public:
   myType();
   ~myType();
   int Parse(const_byteptr const t_begin_of_data, const_byteptr const  _end_of_data);
   uint8 myData() const    { return myData_; }
   const_bytestring const & sourcedata() const { return sourcedata_; }
protected:
   uint8 myData_;
   const_bytestring sourcedata_;
};

_pac.cc

sourcedata_ = const_bytestring(t_begin_of_data, t_end_of_data);
sourcedata_.set_end(t_begin_of_data + 1);

Source data can be used within the type that match it or at the parent type.

type myParentType (child:myType) = record {
    somedata:uint8;
} &let{
   do_something:bool = print_stuff(child.sourcedata);
};

translates into

do_something_ = print_stuff(child()->sourcedata());
&refcount
withinput

Parsing Methodology

Buffering

binpac supports incremental input to deal with packet fragmentation. This is done via use of FlowBuffer class and maintaining buffering/parsing states.

FlowBuffer Class

FlowBuffer provides two mode of buffering: line and frame. Line mode is useful for parsing line based language like HTTP. Frame mode is best for fixed length message. Buffering mode can be switched during parsing and is done transparently to the grammar writer.

At compile time binpac calculates number of bytes required to evaluate each field. During run time, data is buffered up in FlowBuffer until there is enough to evaluate the "record". To optimize the buffering process, if FlowBuffer has enough data to evaluate on the first NewData, it would only mark the start and end pointer instead of copying.

  • void NewMessage();
    • Advances the orig_data_begin_ pointer depend on current mode_. Moves by 1/2 characters in LINE_MODE, by frame_length_ in FRAME_MODE and nothing in UNKNOWN_MODE (default mode).
    • Set buffer_n_ to 0
    • Reset message_complete_
  • void NewLine();
    • Reset frame_length_ and chunked_, set mode_ to LINE_MODE
  • void NewFrame(int frame_length, bool chunked_);
  • void GrowFrame(int new_frame_length);
  • void AppendToBuffer(const_byteptr data, int len);
    • Reallocate buffer_ to add new data then copy data
  • void ExpandBuffer(int length);
    • Reallocate buffer_ to new size if new size is bigger than current size.
    • Set minimum size to 512 (optimization?)
  • void MarkOrCopyLine();
    • Seek current input for end of line (CR/LF/CRLF depend on line break mode). If found append found data to buffer if one is already created or mark (set frame_length_) if one is not created (to minimize copying). If end of line is not found, append partial data till end of input to buffer. Buffer is created if one is not there.
  • const_byteptr begin()/end()
    • Returns buffer_ and buffer_n_ if a buffer exist, otherwise orig_data_begin_ and orig_data_begin_ + frame_length_.

Parsing States

  • buffering_state_ - each parsing class contains a flag indicating whether there are enough data buffered to evaluate the next block.
  • parsing_state_ - each parsing class which consists of multiple parsing data unit (line/frames) has this flag indicating the parsing stage. Each time new data comes in, it invokes parsing function and switch on parsing_state to determine which sub parser to use next.

Regular Expression

Evaluation Order

Running Binpac-generated Analyzer Standalone

To run binpac-generated code independent of Zeek. Regex library must be substituted. Below is one way of doing it. Use the following three header files.

RE.h

/*Dummy file to replace Zeek's file*/
#include "binpac_pcre.h"
#include "bro_dummy.h"

bro_dummy.h

#ifndef BRO_DUMMY
#define BRO_DUMMY
#define DEBUG_MSG(x...)  fprintf(stderr, x)
/*Dummy to link, this function suppose to be in Zeek*/
double network_time();
#endif

binpac_pcre.h

#ifndef bro_pcre_h
#define bro_pcre_h
#include <stdio.h>
#include <assert.h>
#include <string>
using namespace std;
// TODO: use configure to figure out the location of pcre.h
#include "pcre.h"
class RE_Matcher {
public:
   RE_Matcher(const char* pat){
       pattern_ = "^";
       pattern_ += "(";
       pattern_ += pat;
       pattern_ += ")";
       pcre_   = NULL;
       pextra_ = NULL;
   }
   ~RE_Matcher() {
       if (pcre_) {
           pcre_free(pcre_);
       }
   }
   int Compile() {
       const char *err = NULL;
       int erroffset = 0;
       pcre_ = pcre_compile(pattern_.c_str(),
                                    0,  // options,
                                    &err,
                                    &erroffset,
                                    NULL);
       if (pcre_ == NULL) {
           fprintf(stderr,
                   "Error in RE_Matcher::Compile(): %d:%s\n",
                   erroffset, err);
           return 0;
       }
       return 1;
   }

   int MatchPrefix (const char* s, int n){
       const char *err=NULL;
       assert(pcre_);
       const int MAX_NUM_OFFSETS = 30;
       int offsets[MAX_NUM_OFFSETS];
       int ret = pcre_exec(pcre_,
                                   pextra_,  // pcre_extra
                                   //NULL,  // pcre_extra
                                   s, n,
                                   0,     // offset
                                   0,     // options
                                   offsets,
                                   MAX_NUM_OFFSETS);
       if (ret < 0) {
           return -1;
       }
       assert(offsets[0] == 0);
       return offsets[1];
   }
protected:
   pcre *pcre_;
   string pattern_;
};
#endif

main.cc

In your main source, add this dummy stub.

/*Dummy to link, this function suppose to be in Zeek*/
double network_time(){
   return 0;
}

Q & A

  • Does &oneline only work when "flow" is used?

    Yes. binpac uses the flowunit definition in "flow" to figure out which types require buffering. For those that do, the parse function is:

    bool ParseBuffer(flow_buffer_t t_flow_buffer, ContextHTTP * t_context);

    And the code of flow_buffer_t provides the functionality of buffering up to one line. That's why &oneline is only active when "flow" is used and the type requires buffering.

    In certain cases we would want to use &oneline even if the type does not require buffering, binpac currently does not provide such functionality.

  • How would incremental input work in the case of regex?

    A regex should not take incremental input. (The binpac compiler will complain when that happens.) It should always appear below some type that has either &length=... or &oneline.

  • What is the role of Context<Name> class (generated by analyzer <Name> withcontext)?
  • What is the difference between ''withcontext'' and w/o ''withcontext''?

    withcontext should always be there. It's fine to have an empty context.

  • Elaborate on $context and how it is related to "withcontext".

    A "context" parameter is passed to every type. It provides a vehicle to pass something to every type without adding a parameter to every type. In that sense, it's optional. It exists for convenience.

  • Example usage of composite type array.

    Please see HTTP_Headers in http-protocol.pac in the Zeek source code.

  • Clarification on "connection" keyword (binpac paper).
  • Need a new way to attach hook additional code to each class beside &let.
  • &transient, how is this different from declaring anonymous field? and currently it doesn't seem to do much

    type HTTP_Header = record {
        name:   HTTP_HEADER_NAME &transient;
        :       HTTP_WS;
        value:  bytestring &restofdata &transient;
    } &oneline;
    // Parse "name"
    int t_name_string_length;
    t_name_string_length =
        HTTP_HEADER_NAME_re_011.MatchPrefix(
            t_begin_of_data,
            t_end_of_data - t_begin_of_data);
    if ( t_name_string_length < 0 )
        {
        throw ExceptionStringMismatch( "./http-protocol.pac:96",
             "|([^: \\t]+:)",
             string((const char *) (t_begin_of_data), (const char *) t_end_of_data).c_str()
             );
        }
    int t_name__size;
    t_name__size = t_name_string_length;
    name_.init(t_begin_of_data, t_name_string_length);
  • Detail on the globals ($context, $element, $input...etc)
  • How does BinPAC work with dynamic protocol detection?

    Well, you can use the code in DNS-binpac.cc as a reference. First, create a pointer to the connection. (See the example in DNS-binpac.cc)

    interp = new binpac::DNS::DNS_Conn(this);

    Pass the data received from "DeliverPacket" or "DeliverStream" to "interp->NewData()". (Again, see the example in DNS-binpac.cc)

    void DNS_UDP_Analyzer_binpac::DeliverPacket(int len, const u_char* data, bool orig, int seq, const IP_Hdr* ip, int caplen)
        {
        Analyzer::DeliverPacket(len, data, orig, seq, ip, caplen);
        interp->NewData(orig, data, data + len);
        }
  • Explanation of &withinput
  • Difference between using flow and not using flow (binpac generates Parse method instead of ParseBuffer)
  • &check currently working?
  • Difference between flowunit and datagram, datagram and &oneline, &length?
  • Go over TODO list in binpac release
  • How would input get handle/buffered when length is not known (chunked)
  • More feature multi byte character? utf16 utf32 etc.

TODO List

New Features

  • Provides a method to match simple ascii text.
  • Allows use fixed length array in addition to vector.

Bugs

Small clean-ups

  • Remove anonymous field bytestring assignment.
  • Redundant overflow checking/more efficient fixed length text copying.

Warning/Errors

Things that compiler should flag out at code generation time

  • Give warning when &transient is used on none bytestring
  • Give warning when &oneline, &length is used and flowunit is not.
  • Warning when more than one "connection" is defined