Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add types to the graph #94

Closed
wants to merge 3 commits into from
Closed

Add types to the graph #94

wants to merge 3 commits into from

Conversation

ChrisCummins
Copy link
Owner

@ChrisCummins ChrisCummins commented Aug 20, 2020

This adds a fourth node type, and a fourth edge flow, both called
"type". The idea is to represent types as first-class elements in the
graph representation. This allows greater compositionality by breaking
up composite types into subcomponents, and decreases the required
vocabulary size required to achieve a given coverage.

Background

Currently, type information is stored in the "text" field of nodes for
constants and variables, e.g.:

node {
  type: VARIABLE
  text: "i8"
}

image

There are two issues with this:

  • Composite types end up with long textual representations,
    e.g. "struct foo { i32 a; i32 b; ... }". Since there is an
    unbounded number of possible structs, this prevents 100% vocabulary
    coverage on any IR with structs (or other composite types).

  • In the future, we will want to encode different information on data
    nodes, such as embedding literal values. Moving the type information
    out of the data node "frees up" space for something else.

Overview

This changes the representation to represent types as first-class
elements in the graph. A "type" node represents a type using its
"text" field, and a new "type" edge connects this type to variables or
constants of that type, e.g. a variable "int x" could be represented as:

node {
  type: VARIABLE
  text: "var"
}
node {
  type: TYPE
  text: "i32"
}
edge {
  flow: TYPE
  source: 1
}

image

Composite types

Types may be composed by connecting multiple type nodes using type
edges. This allows you to break down complex types into a graph of
primitive parts. The meaning of composite types will depend on the
IR being targetted, the remainder describes the process for
LLVM-IR.

Pointer types

A pointer is a composite of two types:

[variable] <- [pointer] <- [pointed-type]

For example:

int32_t* instance;

Would be represented as:

node {
  type: TYPE
  text: "i32"
}
node {
  type: TYPE
  text: "*"
}
node {
  type: VARIABLE
  text: "var"
}
edge {
  text: TYPE
  target: 1
}
edge {
  text: TYPE
  source: 1
  target: 2
}

Where variables/constants of this type receive an incoming type edge
from the [pointer] node, which in turn receives an incoming type edge
from the [pointed-type] node.

One [pointer] node is generated for each unique pointer type. If a
graph contains multiple pointer types, there will be multiple
[pointer] nodes, one for each pointed type.

Example

Generated using:

cat <<EOF | clang2graph -xc - | graph2dot
int *A() {
  int a;
  return &a;
}
EOF

Struct types

A struct is a compsite type where each member is a node type which
points to the parent node. Variable/constant instances of a struct
receive an incoming type edge from the root struct node. Note that
the graph of type nodes representing a composite struct type may be
cyclical, since a struct can contain a pointer of the same type (think
of a binary tree implementation). For all other member types, a new
type node is produced. For example, a struct with two integer members
will produce two integer type nodes, they are not shared.

The type edges from member nodes to the parent struct are
positional. The position indicates the element number. E.g. for a
struct with three elements, the incoming type edges to the struct node
will have positions 0, 1, and 2.

This example struct:

struct s {
  int8_t a;
  int8_t b;
  struct s* c;
}

struct s instance;

Would be represented as:

node {
  type: TYPE
  text: "struct"
}
node {
  type: TYPE
  text: "i8"
}
node {
  type: TYPE
  text: "i8"
}
node {
  type: TYPE
  text: "*"
}
node {
  type: VARIABLE
  text: "var"
}
edge {
  flow: TYPE
  target: 1
}
edge {
  flow: TYPE
  target: 2
  position: 1
}
edge {
  flow: TYPE
  target: 3
  position: 2
}
edge {
  flow: TYPE
  source: 3
}
edge {
  flow: TYPE
  target: 4
}

Example

Generated using:

cat <<EOF | clang2graph -xc - | graph2dot
struct S {
  char a;
  char b;
  struct S* c;
};

char A() {
  struct S s;
  return s.a;
}
EOF

Array Types

An array is a composite type [variable] <- [array] <- [element-type].
For example, the array:

int a[10];

Would be represented as:

node {
  type: TYPE
  text: "i32"
}
node {
  type: TYPE
  text: "[]"
}
node {
  type: VARIABLE
  text: "var"
}
edge {
  flow: TYPE
  target: 1
}
edge {
  flow: TYPE
  source: 1
  target: 2
}

Example

Generated using:

cat <<EOF | clang2graph -xc - | graph2dot
int* A() {
  int a[10];
  return a;
}
EOF

Function Pointers

A function pointer is represented by a type node that uniquely identifies
the signature of a function, i.e. its return type and parameter types. The
caveat of this is that pointers to different functions which have the same
signature will resolve to the same type node. Additionally, there is no edge
connecting a function pointer type and the instructions which belong to this
function.

Example

This program contains two function signatures (int (void) and
float (void)), but function pointers to three different functions. This
highlights the caveat described above as the function pointers a and b
alias to the same type node:

Generated using:

br -c opt //:install && cat <<EOF | clang2graph -xc - | graph2dot
int A() {
  return 10;
}

int B() {
  return 5;
}

float C() {
  return 15;
}

int D() {
  int (*a)() = &A;
  int (*b)() = &B;
  float (*c)() = &C;
  return (*a)() + (*b)() + (*c)();
}
EOF

#82

@ChrisCummins ChrisCummins force-pushed the feature/82_types branch 3 times, most recently from 1c44cf9 to da1b14b Compare August 21, 2020 23:31
@ChrisCummins ChrisCummins marked this pull request as ready for review August 21, 2020 23:36
@Zacharias030
Copy link
Collaborator

Zacharias030 commented Aug 22, 2020

Wow, I like your trust in deep learning 😛. This representation will be the ultimate test whether programs can be understood.

First read through, I have two design questions:

  • why connect all normal types globally? (Int32 ...)

If the NN learns something about it, I think it will be able to learn what i32 means and use that knowledge everywhere without needing the connection. Obviously „highways“ could be useful for many problems, but I don’t think we should aim for that here. One way to stay efficient would be to not use backward edges on such types. Then the GNN stays efficient but information doesn’t cross.

  • I didn’t see how fn‘s uniquely identify their signatures in your example.

Again, if we forward propagate the fn types, then connecting different functions of same signature is the right thing to do since it will be much more efficient to compute.
Thinking about it, I feel that forward-propagating types into a graph that looks mostly the same is quite a light-weight change to the overall thing. I expected these changes to be extremely heavy, but I don’t think they are atm.

The most difficult thing on my mind: what is the difficult problem that we are going to need this for :) I think the way we measured success on our problems as accuracy so far was very academic...

Best we discuss in person!
Thanks in anycase for the review request

@ChrisCummins
Copy link
Owner Author

Thanks for taking a look @Zacharias030!

I have no faith in deep learning. For now, I'm thinking only about what would be needed from a graph representation to enable better reasoning about types. Whether that aids the DL will need to be evaluated experimentally.

why connect all normal types globally? (Int32 ...)

My feeling is: why have multiple nodes if one will suffice? If sharing a type node creates highways that interferes with learning, that seems to me to be a failure of the learner, not a failure of the input representation.

I didn’t see how fn‘s uniquely identify their signatures in your example.

Good point, using "fn" as the text representation for function types doesn't encode the signature! I've changed the encoder to instead use the text representation of the signature (e.g. i32(...)) and updated the example above.

Again, if we forward propagate the fn types, then connecting different functions of same signature is the right thing to do since it will be much more efficient to compute.

I agree. What is currently not represented is the relation from a fn type to the instructions that implement that fn. By looking at the graph, you can see only that a function pointer is called, you can't tell where the control flow jumps to. In the general case, this cannot be statically determined, but for cases where we do know the address of the function pointer that is called, we should all call edges to/from that function.

Let's discuss the remainder on our next call :)

Cheers,
Chris

This changes the format of the LLVM-IR program graphs to store a list
of unique strings, rather than LLVM-IR strings in each node. We use a
graph-level "strings" feature to store a list of the original LLVM-IR
string corresponding to each graph nodes. This allows to us to refer
to the same string from multiple nodes without duplication.

This breaks compatability with the inst2vec encoder on program graphs
generated prior to this commit.

Signed-off-by: format 2020.06.15 <github.com/ChrisCummins/format>
This updates the llvm2graph plots to show how a fifth "type graph" stage,
and updates the README to describe how types are added to the graph.

github.com//issues/82
This adds a fourth node type, and a fourth edge flow, both called
"type". The idea is to represent types as first-class elements in the
graph representation. This allows greater compositionality by breaking
up composite types into subcomponents, and decreases the required
vocabulary size required to achieve a given coverage.

Background
----------

Currently, type information is stored in the "text" field of nodes for
constants and variables, e.g.:

    node {
      type: VARIABLE
      text: "i8"
    }

There are two issues with this:

 * Composite types end up with long textual representations,
   e.g. "struct foo { i32 a; i32 b; ... }". Since there is an
   unbounded number of possible structs, this prevents 100% vocabulary
   coverage on any IR with structs (or other composite types).

 * In the future, we will want to encode different information on data
   nodes, such as embedding literal values. Moving the type information
   out of the data node "frees up" space for something else.

Overview
--------

This changes the representation to represent types as first-class
elements in the graph. A "type" node represents a type using its
"text" field, and a new "type" edge connects this type to variables or
constants of that type, e.g. a variable "int x" could be represented as:

    node {
      type: VARIABLE
      text: "var"
    }
    node {
      type: TYPE
      text: "i32"
    }
    edge {
      flow: TYPE
      source: 1
    }

Composite types
---------------

Types may be composed by connecting multiple type nodes using type
edges. This allows you to break down complex types into a graph of
primitive parts. The meaning of composite types will depend on the
IR being targetted, the remainder describes the process for
LLVM-IR.

Pointer types
-------------

A pointer is a composite of two types:

    [variable] <- [pointer] <- [pointed-type]

For example:

    int32_t* instance;

Would be represented as:

    node {
      type: TYPE
      text: "i32"
    }
    node {
      type: TYPE
      text: "*"
    }
    node {
      type: VARIABLE
      text: "var"
    }
    edge {
      text: TYPE
      target: 1
    }
    edge {
      text: TYPE
      source: 1
      target: 2
    }

Where variables/constants of this type receive an incoming type edge
from the [pointer] node, which in turn receives an incoming type edge
from the [pointed-type] node.

One [pointer] node is generated for each unique pointer type. If a
graph contains multiple pointer types, there will be multiple
[pointer] nodes, one for each pointed type.

Struct types
------------

A struct is a compsite type where each member is a node type which
points to the parent node. Variable/constant instances of a struct
receive an incoming type edge from the root struct node. Note that
the graph of type nodes representing a composite struct type may be
cyclical, since a struct can contain a pointer of the same type (think
of a binary tree implementation). For all other member types, a new
type node is produced. For example, a struct with two integer members
will produce two integer type nodes, they are not shared.

The type edges from member nodes to the parent struct are
positional. The position indicates the element number. E.g. for a
struct with three elements, the incoming type edges to the struct node
will have positions 0, 1, and 2.

This example struct:

    struct s {
      int8_t a;
      int8_t b;
      struct s* c;
    }

    struct s instance;

Would be represented as:

    node {
      type: TYPE
      text: "struct"
    }
    node {
      type: TYPE
      text: "i8"
    }
    node {
      type: TYPE
      text: "i8"
    }
    node {
      type: TYPE
      text: "*"
    }
    node {
      type: VARIABLE
      text: "var"
    }
    edge {
      flow: TYPE
      target: 1
    }
    edge {
      flow: TYPE
      target: 2
      position: 1
    }
    edge {
      flow: TYPE
      target: 3
      position: 2
    }
    edge {
      flow: TYPE
      source: 3
    }
    edge {
      flow: TYPE
      target: 4
    }

Array Types
-----------

An array is a composite type [variable] <- [array] <- [element-type].
For example, the array:

    int a[10];

Would be represented as:

    node {
      type: TYPE
      text: "i32"
    }
    node {
      type: TYPE
      text: "[]"
    }
    node {
      type: VARIABLE
      text: "var"
    }
    edge {
      flow: TYPE
      target: 1
    }
    edge {
      flow: TYPE
      source: 1
      target: 2
    }

Function Pointers
-----------------

A function pointer is represented by a type node that uniquely identifies the
*signature* of a function, i.e. its return type and parameter types. The caveat
of this is that pointers to different functions which have the same signature
will resolve to the same type node. Additionally, there is no edge connecting a
function pointer type and the instructions which belong to this function.

github.com//issues/82
@ChrisCummins
Copy link
Owner Author

Merging via #199.

@ChrisCummins ChrisCummins deleted the feature/82_types branch July 16, 2022 00:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants