Eugene's Blog

I can't believe it's blog!

Unification for JS

Unification is a very interesting programming tool. Originated from logical programming (its the foundation of Prolog) and used in functional programming (e.g., Haskell) it helps to compare objects for equality, identify known patterns, and reassemble results differently.

Wikipedia gives a somewhat complex definition of unification, but many people see it as an extended equivalence comparison, a pattern matching tool, and some even find parallels with XPath queries, CSS, and even jQuery, all operating on regular objects. See it for yourself.

While JavaScript provides a notion of equality, it lacks structural matching. Fortunately it is easy to write one ourselves. It is in the same category as cloning and copying objects.

Unification is not a new concept. There are several open source libraries available, I will talk about heya-unify, which implements a symmetric first-order unification algorithm custom-tailored for JavaScript. I authored this module some time ago, and used successfully in my projects.

In order to try examples with node.js install it with npm:

1
npm install heya-unify

It can be used in browsers with an AMD loader/builder like require.js without modifications. The project itself is available on GitHub: heya-unify.

Structural matching

The problem: JavaScript doesn’t readily provide tools to match objects. For example, these two simple objects are not equal:

1
2
3
var a = [42], b = [42];
a == b;  // false
a === b; // false

The same goes for {}:

1
2
3
var c = {luck: 7}, d = {luck: 7};
c == d;  // false
c === d; // false

Why? Because internally JavaScript doesn’t try to match structures of those objects, but opts to compare their internal memory addresses. The latter is easier and faster, but frequently is not what required.

We can think of unification as a way to compare those objects structurally down to primitive types:

1
2
3
4
var unify = require("heya-unify"); // for node.js

unify(a, b); // truthy
unify(c, d); // truthy

unify() accepts two mandatory arguments (objects to unify, can be passed in any order), and an optional environment argument (more on that later). Usually unification objects are JSON-like, but custom objects can be used as well. For example, out of box unify() can deal with arrays, dates, and regular expressions.

When unify() checks for structural equivalence, it pays attention to objects being exact. It means that objects should have the same number of enumerable properties with the same names, and equivalent values. This behavior can be controlled by programmer (discussed below).

While comparing two data structures for exactness is useful, it is rarely needed in real programs. Yet there is one super-important use case for such unification: unit tests. If our function should produce a complex data structure, and we know the result, or its shape, we can unify them to see, if we got what was expected. That’s why heya-unify is extensively used in heya-unit, which will be a subject of another blog post.

Variables

Sometimes we don’t know an exact value of a certain component, and want to figure it out. In this case we can supply a variable instead of an actual value. During matching, this variable will be assigned a value, which we can query back later.

Our next example will “unpack” 2D coordinates from a JSON-like object:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
var variable = unify.variable; // to declutter our example

var x = variable(),
    y = variable();

var pattern = {
        position: {
            x: x, // instead of actual values, we specify variables
            y: y  // to receive those values
        }
    };

var env = unify(pattern, {position: {x: 2, y: 3}});
if(env){
    console.log("position:", x.get(env), y.get(env));
    // it will print: "position: 2 3"
}

Note how natural and simple it was to define a pattern.

This simple example shows important points:

  • unify() returns an environment object (details are below).
    • If a unification was unsuccessful, null will be returned.
  • Variables make sense only in a context of environment, which is required to understand their state.
    • Variables can be reused with different environments, no need to re-create them for every comparison.
    • This feature improves performance and memory characteristics, and allows to create patterns once statically, and reuse them multiple times.
  • Assigning multiple variables as a result of one unification can be viewed as a transaction. It is all-or-nothing type of deal.

Variables can point to any valid JavaScript object from primitive values to complex objects. Such objects can contain other variables, or even the variable itself creating a recursive structure.

Manual code

In order to understand better what unify() did for us, let’s compare it with an equivalent manual code, which is simple enough:

1
2
var x = data.position.x,
    y = data.position.y;

Well, it is not really equivalent: what if data is not an object? Easy to fix:

1
2
3
4
5
6
var x, y;

if(data && typeof data == "object"){
    x = data.position.x;
    y = data.position.y;
}

Still, what if it has no position sub-object? Another fix:

1
2
3
4
5
6
7
var x, y;

if(data && typeof data == "object" &&
        data.position && typeof data.position == "object"){
    x = data.position.x;
    y = data.position.y;
}

But what if x or y are missing?

1
2
3
4
5
6
7
8
var x, y;

if(data && typeof data == "object" &&
        data.position && typeof data.position == "object" &&
        "x" in data.position && "y" in data.position){
    x = data.position.x;
    y = data.position.y;
}

Now it is functionally the same, if we forget that unify() will ensure that data has nothing but position in it, and position has exactly two members: x and y. Actually this behavior of unify() can be controlled. More on that later.

As you can see even for a minimal case our manual code grows like a snowball. Adding more literals, while trivial for unify(), will make manual checks even more complex, harder to understand, and to maintain. What this complexity buys us? Performance. For example, the manual code above does not contain any function calls. Still, it can benefit from caching some sub-objects in temporary variables for a fast access. Introducing such variables will make this snippet even more “industrial”, and cryptic.

Stepping back and looking at our code, some programmers make an interesting point: the whole pattern matching with variables can be reinterpreted as a set of XPath-like queries bunched together, and executed at the same time on a per-object basis with variables serving as named results of an individual query.

More on variables

Unification variables behave differently from plain vanilla JavaScript variables. First of all, they can be in two states:

  1. Unbound, when nothing was assigned yet.
    • This is the initial state of a variable in a newly created environment.
    • Only unbound variables can be assigned by unifying them with values.
    • Transition from unbound to bound is a one-time deal.
  2. Bound, when a value was assigned to this variable.
    • Once bound, a variable’s state cannot be changed:
      • It cannot be made unbound.
      • New value cannot be assigned.
    • Using a bound variable is exactly the same as using its value.

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// let's find data items with x === y

var pattern = {
        position: {
            x: x,
            y: x  // we use the same variable twice
        }
    };

var env = unify(pattern, {position: {x: 2, y: 3}});
!env; // env is falsy (== null)

env = unify(pattern, {position: {x: 5, y: 5}});
env;  // env is truthy
console.log("our variable is ", x.bound(env) ? "bound" : "unbound"); // bound
console.log("both x and y are", x.get(env)); // 5

During the unification process an unbound variable become bound, if it is unified against any value. As soon as it is bound, its value is transparently substituted where the variable is encountered.

From the example above we see that, if an environment was returned:

  • bound(env) method returns a bound status of a variable in a given environment.
  • get(env) method returns a value for a bound variable.

While it is not all, now we are ready for some serious heavy lifting.

Advanced pattern matching

Many algorithms, which work on dynamic structures, are based on a simple pattern recognition, where we separate parts of it using a pattern, and create another object using recognized parts, or modify existing objects.

Let’s use red-black trees as our example. Obviously implementing the whole package is not suitable as a simple example — typical straightforward JavaScript implementations usually take hundreds of lines of terse code, so we will implement just one step (insert case #4):

You can see this image in all its gory details here: Red-black tree insert case 4.

We will assume that a node is represented as a following structure:

1
2
3
4
5
6
var node = {
        color:  "red",       // "red" or "black"
        left:   leftChild,   // a child node
        right:  rightChild,  // a child node
        parent: parentNode   // a parent node
    };

In real life it will hold a key, and possibly a value, and may have some additional properties.

To simplify our example, we will assume that a node configuration is exactly like in the picture, while in reality there is a certain freedom of how nodes are connected, e.g., a parent node can be left or right child of its parent.

First let’s build a pattern using variable names from the picture:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
var c1 = variable(), c2 = variable(), c3 = variable(),
    n  = variable(), p  = variable(), u  = variable(),
    g  = variable(), gg = variable();

var pattern4 = { // node
        color: "red",
        left:  c2,
        right: c3,
        parent: { // parent
            color: "red",
            left:  c1,
            right: n, // points to node itself
            parent: { // grandparent
                color: "black",
                left:  p, // points to parent
                right: u,  // points to uncle
                parent: gg // a great grandparent
            }
        }
    };

As we can see, our pattern is a current node’s view of a tree fragment shown in a picture as N. Note that we don’t verify that node’s parent has the node as a child assuming that our tree is already well-formed. But this verification is trivial to add.

Now let’s match it, and reassemble:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
var env = unify(pattern4, node);

if(env){
    // let's modify the tree
    var parent = p.get(env),
        grandparent = g.get(env);
    // rearrange pointers like in the Wikipedia article
    grandparent.left = node;
    node.parent = grandparent;
    node.left = parent;
    parent.parent = node;
    parent.left = c1.get(env);
    if(parent.left){ parent.left.parent = parent; }
    parent.right = c2.get(env);
    if(parent.right){ parent.right.parent = parent; }
    node.right = c3.get(env);
    if(node.right){ node.right.parent = node; }
}

If we compare this code with a C code in the Wikipedia article, three things are obvious:

  • Our pattern is an object literal that we want to see, rather than a bunch of if statements spread over the whole program. At the very least the object literal is easier to debug, and modify.
  • Our “assemble” part is basically the same.
    • The biggest difference is that we already have all required values separated, and available as bound variables.
  • We have some technical variables, which are declared, but never used, because we don’t need their value: node n, uncle u, and great-grandparent gg. Is it wasteful? Yes. In real applications we should use anonymous variables, or “open” objects. Both are described below.

In the C snippet you will notice that they have a reduced/incomplete set of tests to check if a node fits the case. That’s OK. Keep in mind that the step #4 is called conditionally from the previous step, which is called conditionally from its predecessor, and so on. Effectively all necessary checks are implemented, but they are distributed across many subroutines. As you can see it is not a problem for the unification.

Anonymous variable

While testing for structural equivalence we may disregard values like in the example above. Usually we want to test that a certain member is present without caring for its actual value.

This case is common enough to have a special facility for that — anonymous variable:

1
2
3
4
5
6
7
8
9
10
11
12
var x = variable(),
    any = unify.any; // anonymous variable is a singleton

var pattern = {
        position: {
            x: x,
            y: any
        }
    };

var env = unify(pattern, {position: {x: 2, y: 3}});
console.log("x is", x.get(env)); // x is 2

The anonymous variable can be unified with anything, its value is never stored, and cannot be retrieved. Consequently it does not implement an API of a regular variable. It can be used multiple times always succeeding regardless of its previous history.

Named variables

Internally variables are identified by a unique name, which is available as a name property. It is possible to specify it directly:

1
2
3
4
5
6
var pattern = {
        position: {
            x: variable("x"),
            y: variable("x")  // the same name is reused
        }
    };

In the example above we defined two different variable instances, but because they both named x, it will be treated as one variable similar to our example above, when we wanted to match only positions with the same x and y coordinates.

Sometimes after a successful match variables are not bound. It can happen in two cases:

  • When a variable didn’t participate in a match at all, e.g., it was not included.
  • When a variable was unified with another unbound variable. It is called “aliasing”.
    • Aliased variables are bound in one go: if one variable is bound, all its aliases are bound as well.

unify() uses a symmetric algorithm to unify its parameters. First, second, or both parameters can contain variables allowing a mutual matching, if desired.

Example:

1
2
3
4
5
6
var a = variable("a"), b = variable("b");
var env = unify(a, b);
console.log(a.bound(env)); // falsy
console.log(b.bound(env)); // falsy
console.log(a.alias("b", env)); // truthy
console.log(b.alias("a", env)); // truthy

Now we know the whole public API — in addition to methods bound(env) and get(env), variables implement one more public method, and a property:

  • alias(name, env) method returns true, if it is aliased with a specified named variable.
  • name returns its name as a string.

Environments

It is possible to do several unifications in a row using the same environment. In order to do that, an environment object should be passed as a third parameter of unify():

1
2
3
4
5
var a = variable("a"), b = variable("b");
var env = unify(a, b);
env = unify(a, 1, env);  // notice that we passed in the existing environment
console.log(a.get(env)); // 1
console.log(b.get(env)); // 1

It is possible to chain as many unify() calls as needed, but there is one important restriction: if unify() returned null it means that the current environment can be in a corrupted state. Specifically it may have aliased or bound some variables, but not others. It is advisable to abort the whole chain, and disregard the environment.

“Open” objects

JavaScript uses relatively complex data structures as foundational blocks, e.g., compare a named tuple common in logical and functional languages with a hash-based dynamic objects. Such objects make adding more properties a breeze, and we don’t want to restrict JavaScript’s expressiveness, while transplanting an abstract unification concept there. In many cases we don’t want to compare objects precisely, we want to compare only certain properties ignoring the rest.

One simple way to achieve that is to mark an object as “open”:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
var open = unify.open;

var x = variable(), y = variable();

var pattern = open({
        position: {
            x: x,
            y: y
        }
    });

var env = unify(pattern, {value: 42, position: {x: 2, y: 3}});
console.log("x is", x.get(env)); // x is 2
console.log("y is", y.get(env)); // y is 3

Please note that open() marks its parameter as “open” leaving possible children unchanged: in the example above, data should be an object that contains position property. Potentially data may have other properties as well, which will be ignored. But position property itself should be an object with precisely two enumerable properties: x and y. Anything else will lead to a mismatch.

If we want position to be an “open” pattern as well, we should mark it as “open” too:

1
2
3
4
5
6
var pattern = open({
        position: open({
            x: x,
            y: y
        })
    });

The same concept can be applied to arrays. In this case “open” array matches only the beginning of an array ignoring the rest:

1
2
3
4
5
6
7
var x = variable(), y = variable();

var pattern = open([x, y]);

var env = unify(pattern, [1, 2, 3, 4, 5]);
console.log("x is", x.get(env)); // x is 1
console.log("y is", y.get(env)); // y is 2

More JavaScript specifics

unify() knows how to compare other standard objects common in JavaScript: dates, and regular expressions.

It handles all primitive types by using === to compare them for equality. Of special note:

  • Exception: NaN is unified only with another NaN bypassing === check.
  • Just like JavaScript, unify() treats +0 and -0 as 0 ignoring a sign.
1
2
3
4
5
6
7
8
NaN === NaN;        // false
unify(NaN, NaN);    // truthy

+0 === -0;          // true
unify(+0, -0);      // truthy
1/+0 === 1/-0;      // false
1/+0 === Infinity;  // true
1/-0 === -Infinity; // true

Summary

Major takeaways for those who read this far:

  • Unification is an important tool for pattern matching, which can greatly simplify software development:
    • Patterns are represented as object literals in an expected shape making the whole process of specifying what we want natural, and obvious.
    • Using variables we can match a pattern, and save its important parts for future inspections.
    • We can check for repeating patterns by reusing variables.
  • We can execute several unifications in a row reusing already known bindings of variables by reusing environments.
  • We can specify “open” object patterns to match only important properties.

So far our data structures were essentially JSON-like: primitive values, naked objects, arrays. How about custom objects, which may define their own algorithms of unification? An upcoming post will cover several important topics including how to deal with custom objects. It will introduce a set of powerful helpers, which are provided by heya-unify.

Unification posts

Update on 6/5/2014: all installments will be posted in this list as soon as they go online:

Thank you for your help and suggestions!

Attributions

This post uses image by Prabhu B Doss under Creative Commons License.

Comments