-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add types to the graph #94
Conversation
1c44cf9
to
da1b14b
Compare
da1b14b
to
0ad4d54
Compare
Wow, I like your trust in deep learning 😛. This representation will be the ultimate test whether programs can be understood. First read through, I have two design questions:
If the NN learns something about it, I think it will be able to learn what i32 means and use that knowledge everywhere without needing the connection. Obviously „highways“ could be useful for many problems, but I don’t think we should aim for that here. One way to stay efficient would be to not use backward edges on such types. Then the GNN stays efficient but information doesn’t cross.
Again, if we forward propagate the fn types, then connecting different functions of same signature is the right thing to do since it will be much more efficient to compute. The most difficult thing on my mind: what is the difficult problem that we are going to need this for :) I think the way we measured success on our problems as accuracy so far was very academic... Best we discuss in person! |
Thanks for taking a look @Zacharias030! I have no faith in deep learning. For now, I'm thinking only about what would be needed from a graph representation to enable better reasoning about types. Whether that aids the DL will need to be evaluated experimentally.
My feeling is: why have multiple nodes if one will suffice? If sharing a type node creates highways that interferes with learning, that seems to me to be a failure of the learner, not a failure of the input representation.
Good point, using "fn" as the text representation for function types doesn't encode the signature! I've changed the encoder to instead use the text representation of the signature (e.g.
I agree. What is currently not represented is the relation from a fn type to the instructions that implement that fn. By looking at the graph, you can see only that a function pointer is called, you can't tell where the control flow jumps to. In the general case, this cannot be statically determined, but for cases where we do know the address of the function pointer that is called, we should all call edges to/from that function. Let's discuss the remainder on our next call :) Cheers, |
0ad4d54
to
789b8f0
Compare
This changes the format of the LLVM-IR program graphs to store a list of unique strings, rather than LLVM-IR strings in each node. We use a graph-level "strings" feature to store a list of the original LLVM-IR string corresponding to each graph nodes. This allows to us to refer to the same string from multiple nodes without duplication. This breaks compatability with the inst2vec encoder on program graphs generated prior to this commit. Signed-off-by: format 2020.06.15 <github.com/ChrisCummins/format>
This updates the llvm2graph plots to show how a fifth "type graph" stage, and updates the README to describe how types are added to the graph. github.com//issues/82
This adds a fourth node type, and a fourth edge flow, both called "type". The idea is to represent types as first-class elements in the graph representation. This allows greater compositionality by breaking up composite types into subcomponents, and decreases the required vocabulary size required to achieve a given coverage. Background ---------- Currently, type information is stored in the "text" field of nodes for constants and variables, e.g.: node { type: VARIABLE text: "i8" } There are two issues with this: * Composite types end up with long textual representations, e.g. "struct foo { i32 a; i32 b; ... }". Since there is an unbounded number of possible structs, this prevents 100% vocabulary coverage on any IR with structs (or other composite types). * In the future, we will want to encode different information on data nodes, such as embedding literal values. Moving the type information out of the data node "frees up" space for something else. Overview -------- This changes the representation to represent types as first-class elements in the graph. A "type" node represents a type using its "text" field, and a new "type" edge connects this type to variables or constants of that type, e.g. a variable "int x" could be represented as: node { type: VARIABLE text: "var" } node { type: TYPE text: "i32" } edge { flow: TYPE source: 1 } Composite types --------------- Types may be composed by connecting multiple type nodes using type edges. This allows you to break down complex types into a graph of primitive parts. The meaning of composite types will depend on the IR being targetted, the remainder describes the process for LLVM-IR. Pointer types ------------- A pointer is a composite of two types: [variable] <- [pointer] <- [pointed-type] For example: int32_t* instance; Would be represented as: node { type: TYPE text: "i32" } node { type: TYPE text: "*" } node { type: VARIABLE text: "var" } edge { text: TYPE target: 1 } edge { text: TYPE source: 1 target: 2 } Where variables/constants of this type receive an incoming type edge from the [pointer] node, which in turn receives an incoming type edge from the [pointed-type] node. One [pointer] node is generated for each unique pointer type. If a graph contains multiple pointer types, there will be multiple [pointer] nodes, one for each pointed type. Struct types ------------ A struct is a compsite type where each member is a node type which points to the parent node. Variable/constant instances of a struct receive an incoming type edge from the root struct node. Note that the graph of type nodes representing a composite struct type may be cyclical, since a struct can contain a pointer of the same type (think of a binary tree implementation). For all other member types, a new type node is produced. For example, a struct with two integer members will produce two integer type nodes, they are not shared. The type edges from member nodes to the parent struct are positional. The position indicates the element number. E.g. for a struct with three elements, the incoming type edges to the struct node will have positions 0, 1, and 2. This example struct: struct s { int8_t a; int8_t b; struct s* c; } struct s instance; Would be represented as: node { type: TYPE text: "struct" } node { type: TYPE text: "i8" } node { type: TYPE text: "i8" } node { type: TYPE text: "*" } node { type: VARIABLE text: "var" } edge { flow: TYPE target: 1 } edge { flow: TYPE target: 2 position: 1 } edge { flow: TYPE target: 3 position: 2 } edge { flow: TYPE source: 3 } edge { flow: TYPE target: 4 } Array Types ----------- An array is a composite type [variable] <- [array] <- [element-type]. For example, the array: int a[10]; Would be represented as: node { type: TYPE text: "i32" } node { type: TYPE text: "[]" } node { type: VARIABLE text: "var" } edge { flow: TYPE target: 1 } edge { flow: TYPE source: 1 target: 2 } Function Pointers ----------------- A function pointer is represented by a type node that uniquely identifies the *signature* of a function, i.e. its return type and parameter types. The caveat of this is that pointers to different functions which have the same signature will resolve to the same type node. Additionally, there is no edge connecting a function pointer type and the instructions which belong to this function. github.com//issues/82
789b8f0
to
9df26f8
Compare
Merging via #199. |
This adds a fourth node type, and a fourth edge flow, both called
"type". The idea is to represent types as first-class elements in the
graph representation. This allows greater compositionality by breaking
up composite types into subcomponents, and decreases the required
vocabulary size required to achieve a given coverage.
Background
Currently, type information is stored in the "text" field of nodes for
constants and variables, e.g.:
There are two issues with this:
Composite types end up with long textual representations,
e.g. "struct foo { i32 a; i32 b; ... }". Since there is an
unbounded number of possible structs, this prevents 100% vocabulary
coverage on any IR with structs (or other composite types).
In the future, we will want to encode different information on data
nodes, such as embedding literal values. Moving the type information
out of the data node "frees up" space for something else.
Overview
This changes the representation to represent types as first-class
elements in the graph. A "type" node represents a type using its
"text" field, and a new "type" edge connects this type to variables or
constants of that type, e.g. a variable "int x" could be represented as:
Composite types
Types may be composed by connecting multiple type nodes using type
edges. This allows you to break down complex types into a graph of
primitive parts. The meaning of composite types will depend on the
IR being targetted, the remainder describes the process for
LLVM-IR.
Pointer types
A pointer is a composite of two types:
For example:
Would be represented as:
Where variables/constants of this type receive an incoming type edge
from the [pointer] node, which in turn receives an incoming type edge
from the [pointed-type] node.
One [pointer] node is generated for each unique pointer type. If a
graph contains multiple pointer types, there will be multiple
[pointer] nodes, one for each pointed type.
Example
Generated using:
Struct types
A struct is a compsite type where each member is a node type which
points to the parent node. Variable/constant instances of a struct
receive an incoming type edge from the root struct node. Note that
the graph of type nodes representing a composite struct type may be
cyclical, since a struct can contain a pointer of the same type (think
of a binary tree implementation). For all other member types, a new
type node is produced. For example, a struct with two integer members
will produce two integer type nodes, they are not shared.
The type edges from member nodes to the parent struct are
positional. The position indicates the element number. E.g. for a
struct with three elements, the incoming type edges to the struct node
will have positions 0, 1, and 2.
This example struct:
Would be represented as:
Example
Generated using:
Array Types
An array is a composite type [variable] <- [array] <- [element-type].
For example, the array:
Would be represented as:
Example
Generated using:
Function Pointers
A function pointer is represented by a type node that uniquely identifies
the signature of a function, i.e. its return type and parameter types. The
caveat of this is that pointers to different functions which have the same
signature will resolve to the same type node. Additionally, there is no edge
connecting a function pointer type and the instructions which belong to this
function.
Example
This program contains two function signatures (
int (void)
andfloat (void)
), but function pointers to three different functions. Thishighlights the caveat described above as the function pointers
a
andb
alias to the same type node:
Generated using:
#82