berty

A clean, safe and flexible implementation of BERT, a data-structure format inspired by Erlang ETF.

This project is in active development, and should not be used in production yet.

Features

Primary features:

Secondary features:

Usage

Berty was created to easily replace binary_to_term/1 and binary_to_term/2 built-in functions. In fact, the implementation is transparent in many cases. The big idea is to protect your system from outside, in particular atom and memory exhaution.

% create an atom from scratch
Atom = term_to_binary(test).

% An atom is automatically converted as binary
{ok, <<"test">>}
  = berty:decode(Atom).

% different methods can be used to deal with atoms.
{ok, test}
  = berty:decode(Atom, #{ atoms => {create, 0.2, warning} }).

% Other terms are supported
Terms = term_to_binary([{ok,1.0,"test",<<>>}]),
{ok, [{ok,1.0,"test",<<>>}]}
  = berty:decode(Terms).

More features are present, for example, dropping terms or creating custom callbacks.

Lists = term_to_binary([1024,<<>>,"test"]).

% let drop all integers
{ok, [<<>>, "test"]}
  = berty:decode(Lists, #{ integer_ext => drop
                         , small_integer_ext => drop
                         }).

% let create a custom callback
Callback = fun
  (_Term, Rest) ->
    {ok, doh, Rest}
end.
{ok, [doh, <<>>, "test"]}
  = berty:decode(Lists, #{ integer_ext => {callback, Callback}
                         , small_integer_ext => {callback, Callback}
                         }).

% let create another one.
Callback2 = fun
  (Term, Rest) when 1024 =:= Term ->
    logger:warning("catch term ~p", [1024]),
    {ok, Term, Rest};
  (Term, Rest) -> {ok, Term, Rest}
end.

{ok, [1024, <<>>, "test"]}
  = berty:decode(Lists, #{ integer_ext => {callback, Callback2}
                         , small_integer_ext => {callback, Callback2}
                         }).

Those are simple examples, more features are present and will be added. Here the most important functions:

berty:decode/1: standard BERT decoder with default options
berty:decode/2: standard BERT decoder with custom options
berty:decode/3: custom decoder with custom options
berty:encode/1: standard BERT encoder with default options
berty:encode/2: standard BERT encoder with custom options
berty:encode/3: custom encoder with custom options
berty:binary_to_term/1: wrapper around binary_to_term/1
berty:term_to_binary/1: wrapper around term_to_binary/1

Build

rebar3 compile
rebar3 shell

Test

rebar3 as test eunit
rebar3 as test shell

FAQ

Why creating another BERT implementation?

Mainly because of atoms management. In fact, binary_to_term/1 and term_to_binary/1 are not safe, if unknown data are coming from untrusted source, it's quite easy to simply kill the node by overflowing the number of atoms managed by the node itself, and probably also a full cluster if this data is shared.

% first erlang shell
file:write_file("atom1", term_to_binary([ list_to_atom("$test-" ++ integer_to_list(X)) || X <- lists:seq(1,1_000_000) ])).
% second erlang shell
file:write_file("atom2", term_to_binary([ list_to_atom("$test-" ++ integer_to_list(X)) || X <- lists:seq(1_000_000,2_000_000) ])).

Now restore those 2 files on another node.

% third erlang shell
f(D), {ok, D} = file:read_file("atom1"), binary_to_term(D).
f(D), {ok, D} = file:read_file("atom2"), binary_to_term(D).
no more index entries in atom_tab (max=1048576)

Crash dump is being written to: erl_crash.dump...done

Doh. Erlang VM crashed. We can fix that in many different way, here few examples:

avoid using binary_to_term/1 and term_to_binary/1 functions, instead create our own parser based on ETF specification. When terms are deserialized, atoms can be (1) converted in existing atom (2) converted in binary or list (3) simply dropped or replaced with something to alert the VM this part of the data is dangerous.
keep our own local atom table containing all atom deserialized. A soft/hard limit can be set.

Oh? really? Is it serious?

In fact, a simple solution already exists, using the option safe or used when using binary_to_term/2. It will protect you from creating non-existing atoms, but how many projects are using that?

mojombo/bert.erl: https://github.com/mojombo/bert.erl/blob/master/src/bert.erl#L25

-spec decode(binary()) -> term().

decode(Bin) ->
  decode_term(binary_to_term(Bin)).

mojombo/ernie: https://github.com/mojombo/ernie/blob/master/elib/ernie_server.erl#L178

receive_term(Request, State) ->
  Sock = Request#request.sock,
    case gen_tcp:recv(Sock, 0) of
        {ok, BinaryTerm} ->
          logger:debug("Got binary term: ~p~n", [BinaryTerm]),
          Term = binary_to_term(BinaryTerm),

sync/n2o: https://github.com/synrc/n2o/blob/master/src/services/n2o_bert.erl#L8

encode(#ftp{}=FTP) -> term_to_binary(setelement(1,FTP,ftpack));
encode(Term)       -> term_to_binary(Term).
decode(Bin)        -> binary_to_term(Bin).

ferd/bertconf: https://github.com/ferd/bertconf/blob/master/src/bertconf_lib.erl#L10

decode(Bin) ->
    try validate(binary_to_term(Bin)) of
      Terms -> {ok, Terms}
    catch
      throw:Reason -> {error, Reason}
    end.

a13x/aberth: https://github.com/a13x/aberth/blob/master/src/bert.erl#L25

-spec decode(binary()) -> term().

decode(Bin) ->
  decode_term(binary_to_term(Bin)).

yuce/bert.erl: https://github.com/yuce/bert.erl/blob/master/src/bert.erl#L24

-spec decode(binary()) -> term().
decode(Bin) ->
    decode_term(binary_to_term(Bin)).

And probably many more like this search on searchcode.com or github.com suggest.

It's highly probable lot of those functions are hard to call, but it could be the case. In situation where unknown data are coming, erlang:binary_to_term/1 and even erlang:binary_to_term/2 should be avoided or carefully used.

Why am I not aware of that?

Few articles¹² have been created in the past to explain these problems. On my side, if I was in charge of fixing this issue, I would probably do something in two times.

In the first step, I would probably create a workaround on atom creation function, with a soft/hard limit. When we reach the soft limit, warnings are displayed saying we reached the soft limit, but we can still create new atoms. When reaching the hard limit, atoms can't be created anymore, and exceptions are raised instead of crashing the host.

In a second step, I would probably create a flexible interface to deal with atoms and divide the problem in half:

create fixed atom store containing only atoms from source code (Erlang release and project), this one can't be increased.
create a second atom store containing dynamically created atoms during runtime, this one can be increased.

What I worry about is when dealing with mnesia. What could happen if someone create more than 2M unwanted atoms added in Mnesia or DETS? What kind of behavior the cluster will have? And how to fix that if it's critical.

Unfortunately, I think it will totally break atom performance, but it could be an interesting project to learn how Erlang BEAM works under the hood.

Are atoms the only issue there?

Well, it depends. If you are receving a (very) long string or list containing terms, it will have a direct impact on the memory, and it will eventually lead to memory exhaustion:

% size of the list should be checked
% if not, memory exhaustion can happen
[ $1 || _ <- lists:seq(0,160_000_000) ].
% eheap_alloc: Cannot allocate 3936326656 bytes of memory (of type "heap").
% Crash dump is being written to: erl_crash.dump...

Same behavior can be generated using binaries:

% big binaries can crash the BEAM
binary_to_term(<<131, 111, 4294967294:32/unsigned-integer, 0:8/integer, 255:8, 0:4294967280/unsigned-integer>>).
% binary_alloc: Cannot allocate 4294967293 bytes of memory (of type "binary").
% Crash dump is being written to: erl_crash.dump...

Generating ETF payload with very long binaries can also have an impact on CPUs, the following code can generate DoS and if many process

% big payload, high cpu usage, no crash.
% size of the big integer must be checked
% size: 2**18-1, binary byte size: 262_150 (~262kB)
_ = binary_to_term(<<131, 111, 262_143:32/unsigned-integer, 0:8/integer, 255:2_097_144/unsigned-integer>>).

% size: 2**19-1, binary byte size: 524_294 (~524kB)
_ = binary_to_term(<<131, 111, 524_287:32/unsigned-integer, 0:8/integer, 255:4_194_296/unsigned-integer>>).

% size: 2**20-1, binary byte size: 1_048_582 (~1MB)
_ = binary_to_term(<<131, 111, 1_048_575:32/unsigned-integer, 0:8/integer, 255:8_388_600/unsigned-integer>>).

Creating a long node name can crash the VM during startup, because the name of the node is encoded using an atom_ext term, encoded on 255 bits. If the name of the node is greater than 255, it crashes.

erl -sname $(pwgen -A0 252 1)
# Crash dump is being written to: erl_crash.dump...done

erl -name $(pwgen -A0 246 1)@localhost
# Crash dump is being written to: erl_crash.dump...done

It's highly probable other terms can have a deadly impact on a node or a cluster.

How to fix the root cause?

The problem is from atoms, at least one paper³ talked about that. Fixing the garbage collection issue could help a lot, but if it's not possible for many reason, using an high level implementation of ETF with some way to control what kind of data are coming might be an "okayish" solution.

The "Let it crash" philosophy is quite nice when developing high level application interacting in a safe place but this philosophy can't be applied in a place where uncontrolled data is coming. Some functions, like binary_to_term/1 must be avoid at all cost.

What about ETF schema?

This answer is a draft, a sandbox to design an Erlang ETF Schema feature.

It might be great to have syntax to create ETF schema, a bit like protobuf⁴, json schema⁵, XML⁶ (with XLST⁷) or ASN.1⁸. In fact, when I started to find something around this feature, I also found UBF⁹ project from Joe Armstrong.

schema1() ->
  integer().

schema2() ->
  tuple([[atom(ok), integer()]
        ,[atom(error), string(1024)]).

% fun ({ok, X}) when is_integer(X) -> true;
%     ({error, X) when is_list(X) andalso length(X) =< 1024 -> is_string(X);
%     (_) -> false.

schema3() ->
  tuple(

Here the final representation.

[{tuple, [{atom, [ok]}, {integer, []}]}
,{tuple, [{atom, [error]}, {string, [1024]}]}
]
% or
[[tuple, [2]]
,[atom, [ok,error]]
,[integer, []]
,[string, [1024]]
].

What about an ETF path feature?

Another feature like xmlpath or jsonpath is also required as well, an easy syntax and comprehensible one needs to be created. I would like to include:

pattern matching

% how to create an etf path?
% first example
% ETF = #{ key => #{ key2 => { ok, "test"} } }.
"test" = path(ETF, "#key#key2{ok,@}")

% second example
% ETF = [{ok, "test"}, {error, badarg}, {ok, "data"}].
[{ok, "test"},{ok, "data"}] = path(ETF, "[{ok,_}]")
% or
[]{ok,_}

% third example
% ETF = {ok, #{ <<"data">> => [<<"test">>] }}.
[<<"test">>] = path(ETF, "{ok,@}#!data").

Nothing to add?

When I wrote Serialization series — Do you speak Erlang ETF or BERT? (part 1) in 2017, someone told me to check another project called jem.js and read Replacing JSON when talking to Erlang (archive) blog post. What's funny here... Is that:

handle_post(Req, State) ->
  {ok, Body, Req1} = cowboy_req:body(Req),
  Decoded = erlang:binary_to_term(Body),
  Reply = do_whatever(Decoded),
  {erlang:term_to_binary(Reply), Req1, State}.

Yes, "Faster and more efficient", but can destroy your whole platform in few second. Don't do that. Please. Unfortunately, inaka.net seems to be down, it would have been funny to play with that.

Is there a "risk analysis" for each terms somewhere?

Probably, but I did not find a lot on that. Here a short summary of each terms is it safe or not and with the risk(s).

Terms	Code	Safe?	Risks
`ATOM_CACHE_REF`	82	no	atom exhaustion
`ATOM_EXT`	100	no	atom exhaustion
`ATOM_UTF8_EXT`	118	no	atom exhaustion
`BINARY_EXT`	109	maybe	dynamic binary length (32bits)
`BIT_BINARY_EXT`	77	maybe	dynamic bitstring length (32bits)
`EXPORT_EXT`	113	no	atom exhaustion
`FLOAT_EXT`	99	yes	31 bytes float fixed length
`FUN_EXT`	117	no	atoms exhaution
`INTEGER_EXT`	98	yes	1 byte fixed length
`LARGE_BIG_EXT`	111	maybe	dynamic integer length (32bits)
`LARGE_TUPLE_EXT`	105	maybe	dynamic tuple length (32bits)
`LIST_EXT`	108	maybe	dynamic list length (32bits)
`LOCAL_EXT`	121	yes	atom exhaustion
`MAP_EXT`	116	maybe	dynamic pair length (32bits)
`NEWER_REFERENCE_EXT`	90	no	memory exhaustion
`NEW_FLOAT_EXT`	70	yes	8 bytes fixed float
`NEW_FUN_EXT`	112	no	atom exhaution
`NEW_PID_EXT`	88	no	atom exhaution
`NEW_PORT_EXT`	89	no	atom exhaution
`NEW_REFERENCE_EXT`	114	maybe	dynamic reference length (16bits)
`NIL_EXT`	106	yes	fixed length
`PID_EXT`	103	no	atom exhaustion
`PORT_EXT`	102	no	atom exhaustion
`REFERENCE_EXT`	101	no	atom exhaustion
`SMALL_ATOM_EXT`	115	no	atom exhaustion
`SMALL_ATOM_UTF8_EXT`	119	no	atom exhaustion
`SMALL_BIG_EXT`	110	maybe	dynamic integer length (8bits)
`SMALL_INTEGER_EXT`	97	yes	fixed size
`SMALL_TUPLE_EXT`	104	maybe	dynamic tuple length (8bits)
`STRING_EXT`	107	maybe	dynamic string length (16bits)
`V4_PORT_EXT`	120	no	atom exhaustion

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
examples/chat		examples/chat
include		include
notes		notes
src		src
.gitignore		.gitignore
FUTURE.md		FUTURE.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
rebar.config		rebar.config
rebar.lock		rebar.lock

License

erlang-punch/berty

Folders and files

Latest commit

History

Repository files navigation

berty

Features

Usage

Build

Test

FAQ

Why creating another BERT implementation?

Oh? really? Is it serious?

Why am I not aware of that?

Are atoms the only issue there?

How to fix the root cause?

What about ETF schema?

What about an ETF path feature?

Nothing to add?

Is there a "risk analysis" for each terms somewhere?

Resources

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Languages