Add types to the graph #94

ChrisCummins · 2020-08-20T15:47:32Z

This adds a fourth node type, and a fourth edge flow, both called
"type". The idea is to represent types as first-class elements in the
graph representation. This allows greater compositionality by breaking
up composite types into subcomponents, and decreases the required
vocabulary size required to achieve a given coverage.

Background

Currently, type information is stored in the "text" field of nodes for
constants and variables, e.g.:

node {
  type: VARIABLE
  text: "i8"
}

There are two issues with this:

Composite types end up with long textual representations,
e.g. "struct foo { i32 a; i32 b; ... }". Since there is an
unbounded number of possible structs, this prevents 100% vocabulary
coverage on any IR with structs (or other composite types).
In the future, we will want to encode different information on data
nodes, such as embedding literal values. Moving the type information
out of the data node "frees up" space for something else.

Overview

This changes the representation to represent types as first-class
elements in the graph. A "type" node represents a type using its
"text" field, and a new "type" edge connects this type to variables or
constants of that type, e.g. a variable "int x" could be represented as:

node {
  type: VARIABLE
  text: "var"
}
node {
  type: TYPE
  text: "i32"
}
edge {
  flow: TYPE
  source: 1
}

Composite types

Types may be composed by connecting multiple type nodes using type
edges. This allows you to break down complex types into a graph of
primitive parts. The meaning of composite types will depend on the
IR being targetted, the remainder describes the process for
LLVM-IR.

Pointer types

A pointer is a composite of two types:

[variable] <- [pointer] <- [pointed-type]

For example:

int32_t* instance;

Would be represented as:

node {
  type: TYPE
  text: "i32"
}
node {
  type: TYPE
  text: "*"
}
node {
  type: VARIABLE
  text: "var"
}
edge {
  text: TYPE
  target: 1
}
edge {
  text: TYPE
  source: 1
  target: 2
}

Where variables/constants of this type receive an incoming type edge
from the [pointer] node, which in turn receives an incoming type edge
from the [pointed-type] node.

One [pointer] node is generated for each unique pointer type. If a
graph contains multiple pointer types, there will be multiple
[pointer] nodes, one for each pointed type.

Example

Generated using:

cat <<EOF | clang2graph -xc - | graph2dot
int *A() {
  int a;
  return &a;
}
EOF

Struct types

A struct is a compsite type where each member is a node type which
points to the parent node. Variable/constant instances of a struct
receive an incoming type edge from the root struct node. Note that
the graph of type nodes representing a composite struct type may be
cyclical, since a struct can contain a pointer of the same type (think
of a binary tree implementation). For all other member types, a new
type node is produced. For example, a struct with two integer members
will produce two integer type nodes, they are not shared.

The type edges from member nodes to the parent struct are
positional. The position indicates the element number. E.g. for a
struct with three elements, the incoming type edges to the struct node
will have positions 0, 1, and 2.

This example struct:

struct s {
  int8_t a;
  int8_t b;
  struct s* c;
}

struct s instance;

Would be represented as:

node {
  type: TYPE
  text: "struct"
}
node {
  type: TYPE
  text: "i8"
}
node {
  type: TYPE
  text: "i8"
}
node {
  type: TYPE
  text: "*"
}
node {
  type: VARIABLE
  text: "var"
}
edge {
  flow: TYPE
  target: 1
}
edge {
  flow: TYPE
  target: 2
  position: 1
}
edge {
  flow: TYPE
  target: 3
  position: 2
}
edge {
  flow: TYPE
  source: 3
}
edge {
  flow: TYPE
  target: 4
}

Example

Generated using:

cat <<EOF | clang2graph -xc - | graph2dot
struct S {
  char a;
  char b;
  struct S* c;
};

char A() {
  struct S s;
  return s.a;
}
EOF

Array Types

An array is a composite type [variable] <- [array] <- [element-type].
For example, the array:

int a[10];

Would be represented as:

node {
  type: TYPE
  text: "i32"
}
node {
  type: TYPE
  text: "[]"
}
node {
  type: VARIABLE
  text: "var"
}
edge {
  flow: TYPE
  target: 1
}
edge {
  flow: TYPE
  source: 1
  target: 2
}

Example

Generated using:

cat <<EOF | clang2graph -xc - | graph2dot
int* A() {
  int a[10];
  return a;
}
EOF

Function Pointers

A function pointer is represented by a type node that uniquely identifies
the signature of a function, i.e. its return type and parameter types. The
caveat of this is that pointers to different functions which have the same
signature will resolve to the same type node. Additionally, there is no edge
connecting a function pointer type and the instructions which belong to this
function.

Example

This program contains two function signatures (int (void) and
float (void)), but function pointers to three different functions. This
highlights the caveat described above as the function pointers a and b
alias to the same type node:

Generated using:

br -c opt //:install && cat <<EOF | clang2graph -xc - | graph2dot
int A() {
  return 10;
}

int B() {
  return 5;
}

float C() {
  return 15;
}

int D() {
  int (*a)() = &A;
  int (*b)() = &B;
  float (*c)() = &C;
  return (*a)() + (*b)() + (*c)();
}
EOF

#82

Zacharias030 · 2020-08-22T07:48:35Z

Wow, I like your trust in deep learning 😛. This representation will be the ultimate test whether programs can be understood.

First read through, I have two design questions:

why connect all normal types globally? (Int32 ...)

If the NN learns something about it, I think it will be able to learn what i32 means and use that knowledge everywhere without needing the connection. Obviously „highways“ could be useful for many problems, but I don’t think we should aim for that here. One way to stay efficient would be to not use backward edges on such types. Then the GNN stays efficient but information doesn’t cross.

I didn’t see how fn‘s uniquely identify their signatures in your example.

Again, if we forward propagate the fn types, then connecting different functions of same signature is the right thing to do since it will be much more efficient to compute.
Thinking about it, I feel that forward-propagating types into a graph that looks mostly the same is quite a light-weight change to the overall thing. I expected these changes to be extremely heavy, but I don’t think they are atm.

The most difficult thing on my mind: what is the difficult problem that we are going to need this for :) I think the way we measured success on our problems as accuracy so far was very academic...

Best we discuss in person!
Thanks in anycase for the review request

ChrisCummins · 2020-08-24T10:05:55Z

Thanks for taking a look @Zacharias030!

I have no faith in deep learning. For now, I'm thinking only about what would be needed from a graph representation to enable better reasoning about types. Whether that aids the DL will need to be evaluated experimentally.

why connect all normal types globally? (Int32 ...)

My feeling is: why have multiple nodes if one will suffice? If sharing a type node creates highways that interferes with learning, that seems to me to be a failure of the learner, not a failure of the input representation.

I didn’t see how fn‘s uniquely identify their signatures in your example.

Good point, using "fn" as the text representation for function types doesn't encode the signature! I've changed the encoder to instead use the text representation of the signature (e.g. i32(...)) and updated the example above.

Again, if we forward propagate the fn types, then connecting different functions of same signature is the right thing to do since it will be much more efficient to compute.

I agree. What is currently not represented is the relation from a fn type to the instructions that implement that fn. By looking at the graph, you can see only that a function pointer is called, you can't tell where the control flow jumps to. In the general case, this cannot be statically determined, but for cases where we do know the address of the function pointer that is called, we should all call edges to/from that function.

Let's discuss the remainder on our next call :)

Cheers,
Chris

This changes the format of the LLVM-IR program graphs to store a list of unique strings, rather than LLVM-IR strings in each node. We use a graph-level "strings" feature to store a list of the original LLVM-IR string corresponding to each graph nodes. This allows to us to refer to the same string from multiple nodes without duplication. This breaks compatability with the inst2vec encoder on program graphs generated prior to this commit. Signed-off-by: format 2020.06.15 <github.com/ChrisCummins/format>

This updates the llvm2graph plots to show how a fifth "type graph" stage, and updates the README to describe how types are added to the graph. github.com//issues/82

This adds a fourth node type, and a fourth edge flow, both called "type". The idea is to represent types as first-class elements in the graph representation. This allows greater compositionality by breaking up composite types into subcomponents, and decreases the required vocabulary size required to achieve a given coverage. Background ---------- Currently, type information is stored in the "text" field of nodes for constants and variables, e.g.: node { type: VARIABLE text: "i8" } There are two issues with this: * Composite types end up with long textual representations, e.g. "struct foo { i32 a; i32 b; ... }". Since there is an unbounded number of possible structs, this prevents 100% vocabulary coverage on any IR with structs (or other composite types). * In the future, we will want to encode different information on data nodes, such as embedding literal values. Moving the type information out of the data node "frees up" space for something else. Overview -------- This changes the representation to represent types as first-class elements in the graph. A "type" node represents a type using its "text" field, and a new "type" edge connects this type to variables or constants of that type, e.g. a variable "int x" could be represented as: node { type: VARIABLE text: "var" } node { type: TYPE text: "i32" } edge { flow: TYPE source: 1 } Composite types --------------- Types may be composed by connecting multiple type nodes using type edges. This allows you to break down complex types into a graph of primitive parts. The meaning of composite types will depend on the IR being targetted, the remainder describes the process for LLVM-IR. Pointer types ------------- A pointer is a composite of two types: [variable] <- [pointer] <- [pointed-type] For example: int32_t* instance; Would be represented as: node { type: TYPE text: "i32" } node { type: TYPE text: "*" } node { type: VARIABLE text: "var" } edge { text: TYPE target: 1 } edge { text: TYPE source: 1 target: 2 } Where variables/constants of this type receive an incoming type edge from the [pointer] node, which in turn receives an incoming type edge from the [pointed-type] node. One [pointer] node is generated for each unique pointer type. If a graph contains multiple pointer types, there will be multiple [pointer] nodes, one for each pointed type. Struct types ------------ A struct is a compsite type where each member is a node type which points to the parent node. Variable/constant instances of a struct receive an incoming type edge from the root struct node. Note that the graph of type nodes representing a composite struct type may be cyclical, since a struct can contain a pointer of the same type (think of a binary tree implementation). For all other member types, a new type node is produced. For example, a struct with two integer members will produce two integer type nodes, they are not shared. The type edges from member nodes to the parent struct are positional. The position indicates the element number. E.g. for a struct with three elements, the incoming type edges to the struct node will have positions 0, 1, and 2. This example struct: struct s { int8_t a; int8_t b; struct s* c; } struct s instance; Would be represented as: node { type: TYPE text: "struct" } node { type: TYPE text: "i8" } node { type: TYPE text: "i8" } node { type: TYPE text: "*" } node { type: VARIABLE text: "var" } edge { flow: TYPE target: 1 } edge { flow: TYPE target: 2 position: 1 } edge { flow: TYPE target: 3 position: 2 } edge { flow: TYPE source: 3 } edge { flow: TYPE target: 4 } Array Types ----------- An array is a composite type [variable] <- [array] <- [element-type]. For example, the array: int a[10]; Would be represented as: node { type: TYPE text: "i32" } node { type: TYPE text: "[]" } node { type: VARIABLE text: "var" } edge { flow: TYPE target: 1 } edge { flow: TYPE source: 1 target: 2 } Function Pointers ----------------- A function pointer is represented by a type node that uniquely identifies the *signature* of a function, i.e. its return type and parameter types. The caveat of this is that pointers to different functions which have the same signature will resolve to the same type node. Additionally, there is no edge connecting a function pointer type and the instructions which belong to this function. github.com//issues/82

ChrisCummins · 2022-07-16T00:56:25Z

Merging via #199.

ChrisCummins force-pushed the feature/82_types branch 3 times, most recently from 1c44cf9 to da1b14b Compare August 21, 2020 23:31

ChrisCummins requested a review from Zacharias030 August 21, 2020 23:32

ChrisCummins marked this pull request as ready for review August 21, 2020 23:36

ChrisCummins force-pushed the feature/82_types branch from da1b14b to 0ad4d54 Compare August 22, 2020 00:36

ChrisCummins force-pushed the feature/82_types branch from 0ad4d54 to 789b8f0 Compare August 24, 2020 10:07

ChrisCummins added 3 commits August 30, 2020 02:55

Add description of type graph to documentation.

b2ef0af

This updates the llvm2graph plots to show how a fifth "type graph" stage, and updates the README to describe how types are added to the graph. github.com//issues/82

ChrisCummins force-pushed the feature/82_types branch from 789b8f0 to 9df26f8 Compare August 30, 2020 02:01

ChrisCummins mentioned this pull request Jul 15, 2022

Add types to graph (cherry-picked commit) #199

Merged

ChrisCummins closed this Jul 16, 2022

ChrisCummins deleted the feature/82_types branch July 16, 2022 00:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add types to the graph #94

Add types to the graph #94

ChrisCummins commented Aug 20, 2020 •

edited

Loading

Zacharias030 commented Aug 22, 2020 •

edited

Loading

ChrisCummins commented Aug 24, 2020

ChrisCummins commented Jul 16, 2022

Add types to the graph #94

Add types to the graph #94

Conversation

ChrisCummins commented Aug 20, 2020 • edited Loading

Background

Overview

Composite types

Pointer types

Example

Struct types

Example

Array Types

Example

Function Pointers

Example

Zacharias030 commented Aug 22, 2020 • edited Loading

ChrisCummins commented Aug 24, 2020

ChrisCummins commented Jul 16, 2022

ChrisCummins commented Aug 20, 2020 •

edited

Loading

Zacharias030 commented Aug 22, 2020 •

edited

Loading