Skip to content

A clean, safe and flexible BERT implementation

License

Notifications You must be signed in to change notification settings

erlang-punch/berty

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

berty

Erlang Punch Berty License Erlang Punch Berty Top Language Erlang Punch Berty Workflow Status (main branch) Erlang Punch Berty Last Commit Erlang Punch Berty Code Size (bytes) Erlang Punch Berty Repository File Count Erlang Punch Berty Repository Size

A clean, safe and flexible implementation of BERT, a data-structure format inspired by Erlang ETF.

This project is in active development, and should not be used in production yet.

Features

Primary features:

  • High level implementation of ETF in pure Erlang
  • Atoms protection and limitation
  • Fine grained filtering based on type
  • Callback function or MFA
  • Fallback to binary_to_term function on demand
  • Drop terms on demande
  • Term size limitation
  • Custom options for term
  • Property based testing
  • BERT parser subset
  • Depth type protection
  • Fully documented
  • +90% coverage
  • 100% compatible with standard ETF
  • 100% compatible with BERT

Secondary features:

  • Global or fine grained statistics
  • Profiling and benchmarking facilities
  • Logging facilities
  • Tracing facilities
  • ETF path
  • ETF schema
  • Custom parser subset based on behaviors
  • ETF as stream of data
  • Usage example with ETF, BERT and/or custom parser
  • Low level optimization (optimized module with merl)

Usage

Berty was created to easily replace binary_to_term/1 and binary_to_term/2 built-in functions. In fact, the implementation is transparent in many cases. The big idea is to protect your system from outside, in particular atom and memory exhaution.

% create an atom from scratch
Atom = term_to_binary(test).

% An atom is automatically converted as binary
{ok, <<"test">>}
  = berty:decode(Atom).

% different methods can be used to deal with atoms.
{ok, test}
  = berty:decode(Atom, #{ atoms => {create, 0.2, warning} }).

% Other terms are supported
Terms = term_to_binary([{ok,1.0,"test",<<>>}]),
{ok, [{ok,1.0,"test",<<>>}]}
  = berty:decode(Terms).

More features are present, for example, dropping terms or creating custom callbacks.

Lists = term_to_binary([1024,<<>>,"test"]).

% let drop all integers
{ok, [<<>>, "test"]}
  = berty:decode(Lists, #{ integer_ext => drop
                         , small_integer_ext => drop
                         }).

% let create a custom callback
Callback = fun
  (_Term, Rest) ->
    {ok, doh, Rest}
end.
{ok, [doh, <<>>, "test"]}
  = berty:decode(Lists, #{ integer_ext => {callback, Callback}
                         , small_integer_ext => {callback, Callback}
                         }).

% let create another one.
Callback2 = fun
  (Term, Rest) when 1024 =:= Term ->
    logger:warning("catch term ~p", [1024]),
    {ok, Term, Rest};
  (Term, Rest) -> {ok, Term, Rest}
end.

{ok, [1024, <<>>, "test"]}
  = berty:decode(Lists, #{ integer_ext => {callback, Callback2}
                         , small_integer_ext => {callback, Callback2}
                         }).

Those are simple examples, more features are present and will be added. Here the most important functions:

  • berty:decode/1: standard BERT decoder with default options
  • berty:decode/2: standard BERT decoder with custom options
  • berty:decode/3: custom decoder with custom options
  • berty:encode/1: standard BERT encoder with default options
  • berty:encode/2: standard BERT encoder with custom options
  • berty:encode/3: custom encoder with custom options
  • berty:binary_to_term/1: wrapper around binary_to_term/1
  • berty:term_to_binary/1: wrapper around term_to_binary/1

Build

rebar3 compile
rebar3 shell

Test

rebar3 as test eunit
rebar3 as test shell

FAQ

Why creating another BERT implementation?

Mainly because of atoms management. In fact, binary_to_term/1 and term_to_binary/1 are not safe, if unknown data are coming from untrusted source, it's quite easy to simply kill the node by overflowing the number of atoms managed by the node itself, and probably also a full cluster if this data is shared.

% first erlang shell
file:write_file("atom1", term_to_binary([ list_to_atom("$test-" ++ integer_to_list(X)) || X <- lists:seq(1,1_000_000) ])).
% second erlang shell
file:write_file("atom2", term_to_binary([ list_to_atom("$test-" ++ integer_to_list(X)) || X <- lists:seq(1_000_000,2_000_000) ])).

Now restore those 2 files on another node.

% third erlang shell
f(D), {ok, D} = file:read_file("atom1"), binary_to_term(D).
f(D), {ok, D} = file:read_file("atom2"), binary_to_term(D).
no more index entries in atom_tab (max=1048576)

Crash dump is being written to: erl_crash.dump...done

Doh. Erlang VM crashed. We can fix that in many different way, here few examples:

  • avoid using binary_to_term/1 and term_to_binary/1 functions, instead create our own parser based on ETF specification. When terms are deserialized, atoms can be (1) converted in existing atom (2) converted in binary or list (3) simply dropped or replaced with something to alert the VM this part of the data is dangerous.

  • keep our own local atom table containing all atom deserialized. A soft/hard limit can be set.

Oh? really? Is it serious?

In fact, a simple solution already exists, using the option safe or used when using binary_to_term/2. It will protect you from creating non-existing atoms, but how many projects are using that?

It's highly probable lot of those functions are hard to call, but it could be the case. In situation where unknown data are coming, erlang:binary_to_term/1 and even erlang:binary_to_term/2 should be avoided or carefully used.

Why am I not aware of that?

Few articles12 have been created in the past to explain these problems. On my side, if I was in charge of fixing this issue, I would probably do something in two times.

In the first step, I would probably create a workaround on atom creation function, with a soft/hard limit. When we reach the soft limit, warnings are displayed saying we reached the soft limit, but we can still create new atoms. When reaching the hard limit, atoms can't be created anymore, and exceptions are raised instead of crashing the host.

In a second step, I would probably create a flexible interface to deal with atoms and divide the problem in half:

  1. create fixed atom store containing only atoms from source code (Erlang release and project), this one can't be increased.

  2. create a second atom store containing dynamically created atoms during runtime, this one can be increased.

What I worry about is when dealing with mnesia. What could happen if someone create more than 2M unwanted atoms added in Mnesia or DETS? What kind of behavior the cluster will have? And how to fix that if it's critical.

Unfortunately, I think it will totally break atom performance, but it could be an interesting project to learn how Erlang BEAM works under the hood.

Are atoms the only issue there?

Well, it depends. If you are receving a (very) long string or list containing terms, it will have a direct impact on the memory, and it will eventually lead to memory exhaustion:

% size of the list should be checked
% if not, memory exhaustion can happen
[ $1 || _ <- lists:seq(0,160_000_000) ].
% eheap_alloc: Cannot allocate 3936326656 bytes of memory (of type "heap").
% Crash dump is being written to: erl_crash.dump...

Same behavior can be generated using binaries:

% big binaries can crash the BEAM
binary_to_term(<<131, 111, 4294967294:32/unsigned-integer, 0:8/integer, 255:8, 0:4294967280/unsigned-integer>>).
% binary_alloc: Cannot allocate 4294967293 bytes of memory (of type "binary").
% Crash dump is being written to: erl_crash.dump...

Generating ETF payload with very long binaries can also have an impact on CPUs, the following code can generate DoS and if many process

% big payload, high cpu usage, no crash.
% size of the big integer must be checked
% size: 2**18-1, binary byte size: 262_150 (~262kB)
_ = binary_to_term(<<131, 111, 262_143:32/unsigned-integer, 0:8/integer, 255:2_097_144/unsigned-integer>>).

% size: 2**19-1, binary byte size: 524_294 (~524kB)
_ = binary_to_term(<<131, 111, 524_287:32/unsigned-integer, 0:8/integer, 255:4_194_296/unsigned-integer>>).

% size: 2**20-1, binary byte size: 1_048_582 (~1MB)
_ = binary_to_term(<<131, 111, 1_048_575:32/unsigned-integer, 0:8/integer, 255:8_388_600/unsigned-integer>>).

Creating a long node name can crash the VM during startup, because the name of the node is encoded using an atom_ext term, encoded on 255 bits. If the name of the node is greater than 255, it crashes.

erl -sname $(pwgen -A0 252 1)
# Crash dump is being written to: erl_crash.dump...done

erl -name $(pwgen -A0 246 1)@localhost
# Crash dump is being written to: erl_crash.dump...done

It's highly probable other terms can have a deadly impact on a node or a cluster.

How to fix the root cause?

The problem is from atoms, at least one paper3 talked about that. Fixing the garbage collection issue could help a lot, but if it's not possible for many reason, using an high level implementation of ETF with some way to control what kind of data are coming might be an "okayish" solution.

The "Let it crash" philosophy is quite nice when developing high level application interacting in a safe place but this philosophy can't be applied in a place where uncontrolled data is coming. Some functions, like binary_to_term/1 must be avoid at all cost.

What about ETF schema?

This answer is a draft, a sandbox to design an Erlang ETF Schema feature.

It might be great to have syntax to create ETF schema, a bit like protobuf4, json schema5, XML6 (with XLST7) or ASN.18. In fact, when I started to find something around this feature, I also found UBF9 project from Joe Armstrong.

schema1() ->
  integer().

schema2() ->
  tuple([[atom(ok), integer()]
        ,[atom(error), string(1024)]).

% fun ({ok, X}) when is_integer(X) -> true;
%     ({error, X) when is_list(X) andalso length(X) =< 1024 -> is_string(X);
%     (_) -> false.

schema3() ->
  tuple(

Here the final representation.

[{tuple, [{atom, [ok]}, {integer, []}]}
,{tuple, [{atom, [error]}, {string, [1024]}]}
]
% or
[[tuple, [2]]
,[atom, [ok,error]]
,[integer, []]
,[string, [1024]]
].

What about an ETF path feature?

Another feature like xmlpath or jsonpath is also required as well, an easy syntax and comprehensible one needs to be created. I would like to include:

  1. pattern matching
% how to create an etf path?
% first example
% ETF = #{ key => #{ key2 => { ok, "test"} } }.
"test" = path(ETF, "#key#key2{ok,@}")

% second example
% ETF = [{ok, "test"}, {error, badarg}, {ok, "data"}].
[{ok, "test"},{ok, "data"}] = path(ETF, "[{ok,_}]")
% or
[]{ok,_}

% third example
% ETF = {ok, #{ <<"data">> => [<<"test">>] }}.
[<<"test">>] = path(ETF, "{ok,@}#!data").

Nothing to add?

When I wrote Serialization series — Do you speak Erlang ETF or BERT? (part 1) in 2017, someone told me to check another project called jem.js and read Replacing JSON when talking to Erlang (archive) blog post. What's funny here... Is that:

handle_post(Req, State) ->
  {ok, Body, Req1} = cowboy_req:body(Req),
  Decoded = erlang:binary_to_term(Body),
  Reply = do_whatever(Decoded),
  {erlang:term_to_binary(Reply), Req1, State}.

Yes, "Faster and more efficient", but can destroy your whole platform in few second. Don't do that. Please. Unfortunately, inaka.net seems to be down, it would have been funny to play with that.

Is there a "risk analysis" for each terms somewhere?

Probably, but I did not find a lot on that. Here a short summary of each terms is it safe or not and with the risk(s).

Terms Code Safe? Risks
ATOM_CACHE_REF 82 no atom exhaustion
ATOM_EXT 100 no atom exhaustion
ATOM_UTF8_EXT 118 no atom exhaustion
BINARY_EXT 109 maybe dynamic binary length (32bits)
BIT_BINARY_EXT 77 maybe dynamic bitstring length (32bits)
EXPORT_EXT 113 no atom exhaustion
FLOAT_EXT 99 yes 31 bytes float fixed length
FUN_EXT 117 no atoms exhaution
INTEGER_EXT 98 yes 1 byte fixed length
LARGE_BIG_EXT 111 maybe dynamic integer length (32bits)
LARGE_TUPLE_EXT 105 maybe dynamic tuple length (32bits)
LIST_EXT 108 maybe dynamic list length (32bits)
LOCAL_EXT 121 yes atom exhaustion
MAP_EXT 116 maybe dynamic pair length (32bits)
NEWER_REFERENCE_EXT 90 no memory exhaustion
NEW_FLOAT_EXT 70 yes 8 bytes fixed float
NEW_FUN_EXT 112 no atom exhaution
NEW_PID_EXT 88 no atom exhaution
NEW_PORT_EXT 89 no atom exhaution
NEW_REFERENCE_EXT 114 maybe dynamic reference length (16bits)
NIL_EXT 106 yes fixed length
PID_EXT 103 no atom exhaustion
PORT_EXT 102 no atom exhaustion
REFERENCE_EXT 101 no atom exhaustion
SMALL_ATOM_EXT 115 no atom exhaustion
SMALL_ATOM_UTF8_EXT 119 no atom exhaustion
SMALL_BIG_EXT 110 maybe dynamic integer length (8bits)
SMALL_INTEGER_EXT 97 yes fixed size
SMALL_TUPLE_EXT 104 maybe dynamic tuple length (8bits)
STRING_EXT 107 maybe dynamic string length (16bits)
V4_PORT_EXT 120 no atom exhaustion

Resources

Footnotes

  1. https://erlef.github.io/security-wg/secure_coding_and_deployment_hardening/atom_exhaustion.html

  2. https://paraxial.io/blog/atom-dos

  3. Atom garbage collection by Thomas Lindgren, https://dl.acm.org/doi/10.1145/1088361.1088369

  4. https://protobuf.dev/overview/

  5. https://json-schema.org/

  6. https://en.wikipedia.org/wiki/XML

  7. https://en.wikipedia.org/wiki/XSLT

  8. https://en.wikipedia.org/wiki/ASN.1

  9. https://ubf.github.io/ubf/ubf-user-guide.en.html