Skip to content
akeefer edited this page May 8, 2011 · 5 revisions

Handling graphs of objects in an ORM framework is inherently a fairly nasty problem. To make it a little more concrete, consider a simple bug-tracking system that has an Issue table, and concrete Tasks that can be attached to an Issue. If I create an Issue, then add Tasks to it, when does the Task get committed? Does updating the Issue commit any added Tasks? Do the Tasks need to be updated themselves after their fk is set? Does adding the Task immediately do a commit? What about if the Issue isn't yet committed? What if there are several objects I want to commit transactionally? If someone creates a Task, sets the issue on it, and commits, how do I make sure that Issue.Tasks picks that up? There are all sorts of problems inherent there.

After a couple of hours of going around in circles talking with Carson, I think we settled on the idea that we probably want to keep the ActiveRecord-style update model, but that we might want to some day combine it with a graph-based API to more easily manipulate complex graphs of objects. To make that concrete, here are the methods that I believe need to have explicitly defined semantics. Here's what I think their semantics should be for the standard API (it's almost what they already are, but not quite)

  • issue.update() - issue an INSERT or UPDATE on the Issue table, as appropriate. Normally, this only updates the Issue table. However, if there are Tasks that have been added to the Issue prior to its insertion into the database, issue.update() will also issue an INSERT on the task, effectively as if task.update() was called directly.
  • issue.Tasks.add(task) - Will set the task.issue_id column and immediately issue an INSERT or UPDATE on the task, provided that the Issue is already in the database. Note that the UPDATE statement will only update the issue_id column, and not update any other columns on the Task table. If the Issue is not already in the DB, this will be delayed until the Issue is committed. If the task was previously attached to a different issue, this will cause it to be removed from the previous Issue array.
  • task.Issue = issue - Sets the issue_id column on the task, but does not immediately update the database. Calls to issue.Tasks after this, however, should return this Task as well.
  • issue.Tasks.remove(task) - Will null out the task.issue_id column and immediately issue an UPDATE statement on the Task table, which will only update the issue_id column. If issue is not already committed, this will merely null out the column.
  • task.Issue = null - Will set the task.issue_id column to null, but will not immediately issue an INSERT or UPDATE. This will, however, immediately be reflected in the issue.Tasks array of the previously-attached Issue.
  • task.Issue = someOtherIssue - This will update task.issue_id, but will not immediately issue an INSERT or UPDATE. It will remove it from the prior Issue's Tasks array in-memory, and add it to someOtherIssue's
  • task.delete() - Will issue the DELETE SQL statement, and will immediately remove it from any Issue's array in-memory

That's the "standard" API, which should be well-explained. It would be nice if "add" and "remove" on issue.Tasks could use different names that are more explicit about their immediate database affect, but I think that's impractical, as it would be detrimental to the API. One could also argue that issue.Tasks.add(task) should throw if issue has not been committed to the database; I'm in principal okay with that API as well, as long as it's spelled out. The delayed-update approach is more "magic," but seems likely to do what the user wants more of the time. Another question to iron out is around whether or not the issue.Tasks.add(task) method should update all fields on the task, or just the issue_id column, as described above. Updating just the issue_id column seems like the logical approach, but leads to an inconsistency in that if the Task is newly-created, it will be INSERTed with all its columns, but if it already exists only issue_id will be changed.

An alternative approach would be to have issue.Tasks.add(task) require an explicit update on either the Issue or the Task. The argument in favor of requiring an explicit update to the Issue would be that it sort of unifies the situation where the Issue is not yet inserted, since in both cases the Task would only be updated at the time issue.update() was called. However, even that seems a little sketchy, because in one case the Task would be updated and in the other the entire Task would be inserted. Requiring that task.update() be called was rejected because it requires you to always remember to trap and store Task objects, which makes something like issue.Tasks.add(new Task()) silently fail by dropping data on the floor. That seems incredibly sub-optimal. So that's how we got to the conclusions around the current API as the default: it's not 100% consistent, but the rules are simple, you have control over them, and they can be combined with an explicit Transaction object to more tightly control transactionality.

The caching question gets a little harder. How do we make sure that issue.Tasks sees a remove of a Task when the Task has been removed? That question gets into how issue.Tasks is implemented, which means we need to understand the caching behavior. The most naive approach is to always load the tasks from the database, but that option has two problems. First of all, it means that different object instances are returned each time, which can quickly get confusing. Ideally, you want one DB row to result in one pointer, unless you explicitly request different behavior. Secondly, it's simply pretty expensive to do that. (The same holds with foreign key resolution). One option is to cache the fks and arrays on the objects themselves; that solves some of the problems, but it's not at all foolproof. If we just call task.Issue = null, how do we get a handle to the prior Issue object to alter its array, if the array has already been loaded and cached? The answer is likely that we don't, which can lead to some nasty surprises and data inconsistencies. We could work around that with a method on the objects to dump their array and fk caches . . . but then when they're reloaded, you still end up with object reference issues (i.e. multiple versions of the same thing flying around in memory). One option would simply be to disallow the explicit setting of foreign keys that are used to form arrays, and require the array operation to always be used; the fk setters could at least be deprecated. Then we could always ensure that the Issue object had a chance to update its internal state whenever a Task was added or removed. There are legitimate use cases for setting those back-pointers explicitly (if, for performance reasons, you don't want the Issue to be loaded in, and you just want to set the id), and there's also an API question there around which fks really are used to form arrays and which ones are more one-to-ones or even just contained-object pointers.

At this point, I think my preferred approach is to:

  • Advise that the array add and remove versions are used
  • Cache (by default) within the object itself
  • Provide an explicit object-graph-management API if desired

One other potential option would be to implicitly cache at the session level; I'm not sure if I like that option, however. It might be something to consider, potentially separate from the graph-management API, however.

The EntityGraphContext (we probably need a better term, but I want to avoid "bundle") idea is that you could have something similar to a transaction context, but which is really an object-graph-management context. So you could write code like:

using(mydb.GraphContext) { some code }

And within the using statement, all code would implicitly run through the EntityGraphContext. Within that context, the update() methods would be no-ops, add() and remove() methods would have deferred execution, array and fk references would be cached within the context and return consistent pointers, and the entire context would have to be committed explicitly by calling context.commit(), which would then perform all INSERT/UPDATE/DELETE operations in the appropriate order (presumably the safe order in the DB, if we can find one) inside a single transaction. The EntityGraphContext and the Transaction context would have to be mutually exclusive; there's really no reasonable way to combine them with any sort of sane semantics, since the Transaction context is likely to be used in a context where a read follows and update, and that read should see the updated data, which is impossible (effectively) unless that data has already been committed.

The API would thus be an optional add-on for more complicated cases, but would have an explicit, user-controlled scope rather than some magic scope attached to something like the session, or based on some configuration file.

It's possible we could build just caching in this fashion as well, such that array and fk references would be cached within this context . . . but it could get complicated, and having three different contexts (caching, object graph, and transaction) might be pretty rough.

Open Questions

  • Should issue.Tasks.add(task) fail fast if issue isn't already in the database? I'm starting to think that perhaps it should.
  • Should issue.Tasks.add(task) fail fast if task is already attached to another issue? I'm thinking maybe it should.
  • Should task.Issue appear in the API as writable? I'm going to argue no: the array addition version should always be allowed.
Clone this wiki locally