Friday, June 16, 2017

Co-hygiene and quantum gravity

[l'Universo] è scritto in lingua matematica
([The Universe] is written in the language of mathematics)
— Galileo Galilei, Il Saggiatore (The Assayer), 1623.

Here's another installment in my ongoing exploration of exotic ways to structure a theory of basic physics.  In our last exciting episode, I backtraced a baffling structural similarity between term-rewriting calculi and basic physics to a term-rewriting property I dubbed co-hygiene.  This time, I'll consider what this particular vein of theory would imply about the big-picture structure of a theory of physics.  For starters, I'll suggest it would imply, if fruitful, that quantum gravity is likely to be ultimately unfruitful and, moreover, quantum mechanics ought to be less foundational than it has been taken to be.  The post continues on from there much further than, candidly, I had expected it to; by the end of this installment my immediate focus will be distinctly shifting toward relativity.

To be perfectly clear:  I am not suggesting anyone should stop pursuing quantum gravity, nor anything else for that matter.  I want to expand the range of theories explored, not contract it.  I broadly diagnose basic physics as having fallen into a fundamental rut of thinking, that is, assuming something deeply structural about the subject that ought not to be assumed; and since my indirect evidence for this diagnosis doesn't tell me what that deep structural assumption is, I want to devise a range of mind-bendingly different ways to structure theories of physics, to reduce the likelihood that any structural choice would be made through mere failure to imagine an alternative.

The structural similarity I've been pursuing analogizes between, on one side, the contrast of pure function-application with side-effect-ful operations in term-rewriting calculi; and on the other side, the contrast of gravity with the other fundamental forces in physics.  Gravity corresponds to pure function-application, and the other fundamental forces correspond to side-effects.  In the earlier co-hygiene post I considered what this analogy might imply about nondeterminism in physics, and I'd thought my next post in the series would be about whether or not it's even mathematically possible to derive the quantum variety of nondeterminism from the sort of physical structure indicated.  Just lately, though, I've realized there may be more to draw from the analogy by considering first what it implies about non-locality, folding in nondeterminism later.  Starting with the observation that if quantum non-locality ("spooky action at a distance") is part of the analog to side-effects, then gravity should be outside the entanglement framework, implying both that quantum gravity would be a non-starter, and that quantum mechanics, which is routinely interpreted to act directly from the foundation of reality by shaping the spectrum of alternative versions of the entire universe, would have to be happening at a less fundamental level than the one where gravity differs from the other forces.

On my way to new material here, I'll start with material mostly revisited from the earlier post, where it was mixed in with a great deal of other material; here it will be more concentrated, with a different emphasis and perhaps some extra elements leading to additional inferences.  As for the earlier material that isn't revisited here — I'm very glad it's there.  This is, deliberately, paradigm-bending stuff, where different parts don't belong to the same conceptual framework and can't easily be held in the mind all at once; so if I hadn't written down all that intermediate thinking at the time, with its nuances and tangents, I don't think I could recapture it all later.  I'll continue here my policy of capturing the journey, with its intermediate thoughts and their nuances and tangents.

Until I started describing λ-calculus here in earnest, it hadn't registered on me that it would be a major section of the post.  Turns out, though, my perception of λ-calculus has been profoundly transformed by the infusion of perspective from physics; so I found myself going back to revisit basic principles that I would have skipped lightly over twenty years ago, and perhaps even two years ago.  It remains to be seen whether developments later in this post will sufficiently alter my perspective to provoke yet another recasting of λ-calculus in some future post.

Contents
Side-effects
Variables
Side-effect-ful variables
Quantum scope
Geometry and network
Cosmic structure
Side-effects

There were three main notions of computability in the 1930s that were proved equi-powerful by the Church-Turing thesis:  general recursive functions, λ-calculus, and Turing machines (due respectively to Jacques Herbrand and Kurt Gödel, to Alonzo Church, and to Alan Turing).  General recursive functions are broadly equational in style, λ-calculus is stylistically more applicative; both are purely functional.  Turing machines, on the other hand, are explicitly imperative.  Gödel apparently lacked confidence in the purely functional approaches as notions of mechanical calculability, though Church was more confident, until the purely functional approaches were proven equivalent to Turing machines; which to me makes sense as a matter of concreteness.  (There's some discussion of the history in a paper by Solomon Feferman; pdf.)

This mismatch between abstract elegance and concrete straightforwardness was an early obstacle, in the 1960s, to applying λ-calculus to programming-language semantics.  Gordon Plotkin found a schematic solution strategy for the mismatch in his 1975 paper "Call-by-name, call-by-value and the λ-calculus" (pdf); one sets up two formal systems, one a calculus with abstract elegance akin to λ-calculus, the other an operational semantics with concrete clarity akin to Turing machines, then proves well-behavedness theorems for the calculus and correspondence theorems between the calculus and operational semantics.  The well-behavedness of the calculus allows us to reason conveniently about program behavior, while the concreteness of the operational semantics allows us to be certain we are really reasoning about what we intend to.  For the whole arrangement to work, we need to find a calculus that is fully well-behaved while matching the behavior of the operational semantics we want so that the correspondence theorems can be established.

Plotkin's 1975 paper modified λ-calculus to match the behavior of eager argument evaluation; he devised a call-by-value λv-calculus, with all the requisite theorems.  The behavior was, however, still purely functional, i.e., without side-effects.  Traditional mathematics doesn't incorporate side-effects.  There was (if you think about it) no need for traditional mathematics to explicitly incorporate side-effects, because the practice of traditional mathematics was already awash in side-effects.  Mutable state:  mathematicians wrote down what they were doing; and they changed their own mental state and each others'.  Non-local control-flow (aka "goto"s):  mathematicians made intuitive leaps, and the measure of proof was understandability by other sapient mathematicians rather than conformance to some purely hierarchical ordering.  The formulae themselves didn't contain side-effects because they didn't have to.  Computer programs, though, have to explicitly encompass all these contextual factors that the mathematician implicitly provided to traditional mathematics.  Programs are usually side-effect-ful.

In the 1980s Matthias Felleisen devised λ-like calculi to capture side-effect-ful behavior.  At the time, though, he didn't quite manage the entire suite of theorems that Plotkin's paradigm had called for.  Somewhere, something had to be compromised.  In the first published form of Felleisen's calculi, he slightly weakened the well-behavedness theorems for the calculus.  In another published variant he achieved full elegance for the calculus but slightly weakened the correspondence theorems between the calculus and the operational semantics.  In yet another published variant he slightly modified the behavior — in operational semantics as well as calculus — to something he was able to reconcile without compromising the strength of the various theorems.  This, then, is where I came into the picture:  given Felleisen's solution and a fresh perspective (each generation knows a little less about what can't be done than the generation before), I thought I saw a way to capture the unmodified side-effect-ful behavior without weakening any of the theorems.  Eventually I seized an opportunity to explore the insight, when I was writing my dissertation on a nearby topic.  To explain where my approach fits in, I need to go back and pick up another thread:  the treatment of variables in λ-calculus.

Variables

Alonzo Church also apparently seized an opportunity to explore an insight when doing research on a nearby topic.  The main line of his research was to see if one could banish the paradoxes of classical logic by developing a formal logic that weakens reductio ad absurdum — instead of eliminating the law of the excluded middle, which was a favored approach to the problem.  But when he published the logic, in 1932, he mentioned reductio ad absurdum in the first paragraph and then spent the next several paragraphs ranting about the evils of unbound variables.  One gathers he wanted everything to be perfectly clear, and unbound variables offended his sense of philosophical precision.  His logic had just one possible semantics for a variable, namely, a parameter to be supplied to a function; he avoided the need for any alternative notions of universally or existentially quantified variables, by the (imho quite lovely) device of using higher-order functions for quantification.  That is (since I've brought it up), existential quantifier Σ applied to function F would produce a proposition ΣF meaning that there is some true proposition FX, and universal quantifier Π applied to F, proposition ΠF meaning that every proposition FX is true.  In essence, he showed that these quantifiers are orthogonal to variable-binding; leaving him with only a single variable-binding device, which, for some reason lost to history, he called "λ".

λ-calculus is formally a term-rewriting calculus; a set of terms together with a set of rules for rewriting a term to produce another term.  The two basic well-behavedness properties that a term-rewriting calculus generally ought to have are compatibility and Church-Rosser-ness. Compatibility says that if a term can be rewritten when it's a standalone term, it can also be rewritten when it's a subterm of a larger term.  Church-Rosser-ness says that if a term can be rewritten in two different ways, then the difference between the two results can always be eliminated by some further rewriting.  Church-Rosser-ness is another way of saying that rewriting can be thought of as a directed process toward an answer, which is characteristic of calculi.  Philosophically, one might be tempted to ask why the various paths of rewriting ought to reconverge later, but this follows from thinking of the terms as the underlying reality.  If the terms merely describe the reality, and the rewriting lets us reason about its development, then the term syntax is just a way for us to separately describe different parts of the reality, and compatibility and Church-Rosser-ness are just statements about our ability (via this system) to reason separately about different aspects of the development at different parts of the reality without distorting our eventual conclusion about where the whole development is going.  From that perspective, Church-Rosser-ness is about separability, and convergence is just the form in which the separability appears in the calculus.

The syntax of λ-calculus — which particularly clearly illustrates these principles — is

T   ::=   x | (TT) | (λx.T)  .
That is, a term is either a variable; or a combination, specifying that a function is applied to an operand; or a λ-expression, defining a function of one parameter.  The T in (λx.T) is the body of the function, x its parameter, and free occurrences of x in T are bound by this λ.  An occurrence of x in T is free if it doesn't occur inside a smaller context (λx.[ ]) within T.  This connection between a λ and the variable instances it binds is structural.  Here, for example, is a term involving variables x, y, and z, annotated with pointers to a particular binding λ and its variable instances:
((λx.((λy.((λx.(xz))(xy)))(xz)))(xy))  .
  ^^                 ^     ^
The x instance in the trailing (xy) is not bound by this λ since it is outside the binding expression.  The x instance in the innermost (xz) is not bound since it is captured by another λ inside the body of the one we're considering.  I suggest that the three marked elements — binder and two bound instances — should be thought of together as the syntactic representation of a deeper, distributed entity that connects distant elements of the term.

There is just one rewriting rule — one of the fascinations of this calculus, that just one rule suffices for all computation — called the β-rule:

((λx.T1)T2)   →   T1[x ← T2]   .
The left-hand side of this rule is the redex pattern (redex short for reducible expression); it specifies a local pattern in the syntax tree of the term.  Here the redex pattern is that some particular parent node in the syntax tree is a combination whose left-hand child is a λ-expression.  Remember, this rewriting relation is compatible, so the parent node doesn't have to be the root of the entire tree.  It's important that this local pattern in the syntax tree includes a variable binder λ, thus engaging not only a local region of the syntax tree, but also a specific distributed structure in the network of non-local connections across the tree.  Following my earlier post, I'll call the syntax tree the "geometry" of the term, and the totality of the non-local connections its "network topology".

The right-hand side of the rule specifies replacement by substituting the operand T2 for the parameter x everywhere it occurs free in the body T1; but there's a catch.  One might, naively, imagine that this would be recursively defined as

x[x ← T]   =   T
x1[x2 ← T]   =   x1   if x1 isn't x2

(T1 T2)[x ← T]   =   (T1[x ← T] T2[x ← T])

(λx.T1)[x ← T2]   =   (λx.T1)
(λx1.T1)[x2 ← T2]   =   (λx1.T1[x2 ← T2])   if x1 isn't x2.
This definition just descends the syntax tree substituting for the variable, and stops if it hits a λ that binds the same variable; very straightforward, and only a little tedious.  Except that it doesn't work.  Most of it does; but there's a subtle error in the rule for descending through a λ that binds a different variable,
(λx1.T1)[x2 ← T2]   =   (λx1.T1[x2 ← T2])   if x1 isn't x2.
The trouble is, what if T1 contains a free occurrence of x2 and, at the same time, T2 contains a free instance of x1?  Then, before the substitution, that free instance of x1 was part of some larger distributed structure; that is, it was bound by some λ further up in the syntax tree; but after the substitution, following this naive definition of substitution, a copy of T2 is embedded within T1 with an instance of x1 that has been cut off from the larger distributed structure and instead bound by the inner λx1, essentially altering the sense of syntactic template T2.  The inner λx1 is then said to capture the free x1 in T2, and the resulting loss of integrity of the meaning of T2 is called bad hygiene (or, a hygiene violation).  For example,
((λy.(λx.y))x)   ⇒β   (λx.y)[y ← x]
but under the naive definition of substitution, this would be (λx.x), because of the coincidence that the x we're substituting for y happens to have the same name as the bound variable of this inner λ.  If the inner variable had been named anything else (other than y) there would have been no problem.  The "right" answer here is a term of the form (λz.x), where any variable name could be used instead of z as long as it isn't "x" or "y".  The standard solution is to introduce a rule for renaming bound variables (called α-renaming), and restrict the substitution rule to require that hygiene be arranged beforehand.  That is,
(λx1.T)   →   (λx2.T[x1 ← x2])   where x2 doesn't occur free in T

(λx1.T1)[x2 ← T2]   =   (λx1.T1[x2 ← T2])   if x1 isn't x2 and doesn't occur free in T2.
Here again, this may be puzzling if one thinks of the syntax as the underlying reality.  If the distributed structures of the network topology are the reality, which the syntax merely describes, then α-renaming is merely an artifact of the means of description; indeed, the variable-names themselves are merely an artifact of the means of description.

Side-effect-ful variables

Suppose we want to capture classical side-effect-ful behavior, unmodified, without weakening any of the theorems of Plotkin's paradigm.  Side-effects are by nature distributed across the term, and would therefore seem to belong naturally to its network topology.  In Felleisen's basic calculus, retaining the classical behavior and requiring the full correspondence theorems, side-effect-ful operations create syntactic markers that then "bubble up" through the syntax tree till they reach the top of the term, from which the global consequence of the side-effect is enacted by a whole-term-rewriting rule — thus violating compatibility, since the culminating rule is by nature applied to the whole term rather than to a subterm.  This strategy seems, in retrospect, to be somewhat limited by an (understandable) inclination to conform to the style of variable handling in λ-calculus, whose sole binding device is tied to function application at a specific location in the geometry.  Alternatively (as I seized the opportunity to explore in my dissertation), one can avoid the non-compatible whole-term rules by making the syntactic marker, which bubbles up through the term, a variable-binder.  These side-effect-ful bindings are no longer strongly tied to a particular location in the geometry; they float, potentially to the top of the term, or may linger further down in the tree if the side-effect happens to only affect a limited region of the geometry.  But the full classical behavior (in the cases Felleisen addressed) is captured, and Plotkin's entire suite of theorems are supported.

The calculus in which I implemented this side-effect strategy (along with some other things, that were the actual point of the dissertation but don't apparently matter here) is called vau-calculus.

Recall that the β-rule of λ-calculus applies to a redex pattern at a specific location in the geometry, and requires a binder to occur there so that it can also tie in to a specific element of the network topology.  The same is true of the side-effect-ful rules of the calculus I constructed:  a redex pattern occurs at a specific location in the geometry with a local tie-in to the network topology.  There may then be a substitutive operation on the right-hand side of the rule, which uses the associated element of the network topology to propagate side-effect-ful consequences back down the syntax tree to the entire encompassed subterm.  There is a qualitative difference, though, between the traditional substitution of the β-rule and the substitutions of the side-effect-ful operations.  A traditional substitution T1[x ← T2] may attach new T2 subtrees at certain leaves of the T1 syntax tree (free instances of x in T1), but does not disturb any of the pre-existing tree structure of T1.  Consequently, the only effect of the β-rule on the pre-existing geometry is the rearrangement it does within the redex pattern.  This is symmetric to the hygiene property, which assures (by active intervention if necessary, via α-renaming) that the only effect of the β-rule on the pre-existing network topology is what it does to the variable element whose binding is within the redex pattern.  I've therefore called the geometry non-disturbance property co-hygiene.  As long as β-substitution is the only variable substitution used, co-hygiene is an easily overlooked property of the β-rule since, unlike hygiene, it does not require any active intervention to maintain.

The substitutions used by the side-effect-ful rewriting operations go to the same α-renaming lengths as the β-rule to assure hygiene.  However, the side-effect-ful substitutions are non-co-hygienic.  This might, arguably, be used as a technical definition of side-effects, which cause distributed changes to the pre-existing geometry of the term.

Quantum scope

Because co-hygiene is about not perturbing pre-existing geometry, it seems reasonable that co-hygienic rewriting operations should be more in harmony with the geometry than non-co-hygienic rewriting operations.  Thus, β-rewriting should be more in harmony with the geometry of the term than the side-effect-ful operations; which, subjectively, does appear to be the case.  (The property that first drew my attention to all this was that α-renaming, which is geometrically neutral, is a special case of β-substitution, whereas the side-effect-ful substitutions are structurally disparate from α-renaming.)

And gravity is more in harmony with the geometry of spacetime than are the other fundamental forces; witness general relativity.

Hence my speculation, by analogy, that one might usefully structure a theory of basic physics such that gravity is co-hygienic while the other fundamental forces are non-co-hygienic.

One implication of this line of speculation (as I noted in the earlier post) would be fruitlessness of efforts to unify the other fundamental forces with gravity by integrating them into the geometry of spacetime.  If the other forces are non-co-hygienic, their non-affinity with geometry is structural, and trying to treat them in a more gravity-like way would be like trying to treat side-effect-ful behavior as structurally akin to function-application in λ-calculus — which I have long reckoned was the structural miscue that prevented Felleisen's calculus from supporting the full set of well-behavedness theorems.

On further consideration, though, something more may be suggested; even as the other forces might not integrate into the geometry of spacetime, gravity might not integrate into the infrastructure of quantum mechanics.  All this has to do with the network topology, a non-local infrastructure that exists even in pure λ-calculus, but which in the side-effect-ful vau-calculus achieves what one might be tempted to call "spooky action at a distance".  Suppose that quantum entanglement is part of this non-co-hygienic aspect of the theory.  (Perhaps quantum entanglement would be the whole of the non-co-hygienic aspect, or, as I discussed in the earlier post, perhaps there would be other, non-quantum non-locality with interesting consequences at cosmological scale; then again, one might wonder if quantum entanglement would itself have consequences at cosmological scale that we have failed to anticipate because the math is beyond us.)  It would follow that gravity would not exhibit quantum entanglement.  On one hand, this would imply that quantum gravity should not work well as a natural unification strategy.  On the other hand, to make this approach work, something rather drastic must happen to the underpinnings of quantum mechanics, both philosophical and technical.

We understand quantum mechanics as describing the shape of a spectrum of different possible realities; from a technical perspective that is what quantum mechanics describes, even if one doesn't accept it as a philosophical interpretation (and many do accept that interpretation, if only on grounds of Occam's Razor that there's no reason to suppose philosophically some other foundation than is supported technically).  But, shaped spectra of alternative versions of the entire universe seems reminiscent of whole-term rewriting in Felleisen's calculus — which was, notably, a consequence of a structural design choice in the calculus that actually weakened the internal symmetry of the system.  The alternative strategy of vau-calculus both had a more uniform infrastructure and avoided the non-compatible whole-term rewriting rules.  An analogous theory of basic physics ought to account for quantum entanglement without requiring wholesale branching of alternative universes.  Put another way, if gravity isn't included in quantum entanglement, and therefore has to diverge from the other forces at a level more basic than the level where quantum entanglement arises, then the level at which quantum entanglement arises cannot be the most basic level.

Just because quantum structure would not be at the deepest level of physics, does not at all suggest that what lies beneath it must be remotely classical.  Quantum mechanics is mathematically a sort of lens that distorts whatever classical system is passed through it; taking the Schrödinger equation as demonstrative,

iℏ Ψ
t
 =   Ĥ Ψ ,
the classical system is contained in the Hamiltonian function Ĥ, which is plugged into the equation to produce a suitable spectrum of alternatives.  Hence my description of the quantum equation itself as basic.  But, following the vau-calculus analogy, it seems some sort of internal non-locality ought to be basic, as it follows from the existence of the network topology; looking at vau-calculus, even the β-rule fully engages the network topology, though co-hygienically.

Geometry and network

The above insights on the physical theory itself are mostly negative, indicating what this sort of theory of physics would not be like, what characteristics of conventional quantum math it would not have.  What sort of structure would it have?

I'm not looking for detailed math, just yet, but the overall shape into which the details would be cast.  Some detailed math will be needed, before things go much further, to demonstrate that the proposed approach is capable of generating predictions sufficiently consistent with quantum mechanics, keeping in mind the well-known no-go result of Bell's Theorem.  I'm aware of the need; the question, though, is not whether Bell's Theorem can be sidestepped — of course it can, like any other no-go theorem, by blatantly violating one of its premises — but whether it can be sidestepped by a certain kind of theory.  So the structure of the theory is part of the possibility question, and needs to be settled before we can ask the question properly.

In fact, one of my concerns for this sort of theory is that it might have too many ways to get around Bell's Theorem.  Occam's Razor would not look favorably on a theory with redundant Bell-avoidance devices.

Let's now set aside locality for a moment, and consider nondeterminism.  Bell's Theorem calls (in combination with some experimental results that are, somewhat inevitably, argued over) for chronological nondeterminism, that is, nondeterminism relative to the time evolution of the physical system.  One might, speculatively, be able to approximate that sort of nondeterminism arbitrarily well, in a fundamentally non-local theory, by exploiting the assumption that the physical system under consideration is trivially small relative to the whole cosmos.  We might be able to draw on interactions with distant elements of the cosmos to provide a more-or-less "endless" supply of pseudo-randomness.  I considered this possibility in the earlier post on co-hygiene, and it is an interesting theoretical question whether (or, at the very least, how) a theory of this sort could in fact generate the sort of quantum probability distribution that, according to Bell's Theorem, cannot be generated by a chronologically deterministic local theory.  The sort of theory I'm describing, however, is merely a way to provide a local illusion of nondeterminism in a non-local theory with global determinism — and when we're talking chronology, it is difficult even to define global determinism (because, thanks to relativity, "time" is tricky to define even locally; made even trickier since we're now contemplating a theory lacking the sort of continuity that relativity relies upon; and is likely impossible to define globally, thanks to relativity's deep locality).  It's also no longer clear anymore why one should expect chronological determinism at all.

A more straightforward solution, seemingly therefore favored by Occam's Razor, is to give up on chronological determinism and instead acquire mathematical determinism, by the arguably "obvious" strategy of supposing that the whole of spacetime evolves deterministically along an orthogonal dimension, converting unknown initial conditions (initial in the orthogonal dimension) into chronological nondeterminism.  I demonstrated the principle of this approach in an earlier post.  It is a bit over-powered, though; a mathematically deterministic theory of this sort — moreover, a mathematically deterministic and mathematically local theory of this sort — can readily generate not only a quantum probability distribution of the sort considered by Bell's Theorem, but, on the face of it, any probability distribution you like.  This sort of excessive power would seem rather disfavored by Occam's Razor.

The approach does, however, seem well-suited to a co-hygiene-directed theory.  Church-Rosser-ness implies that term rewriting should be treated as reasoning rather than directly as chronological evolution, which seemingly puts term rewriting on a dimension orthogonal to spacetime.  The earlier co-hygiene post noted that calculi, which converge to an answer via Church-Rosser-ness, contrast with grammars, which are also term-rewriting systems but exist for the purpose of diverging and are thus naturally allied with mathematical nondeterminism whereas calculi naturally ally with mathematical determinism.  So our desire to exploit the calculus/physics analogy, together with our desire for abstract separability of parts, seems to favor this use of a rewriting dimension orthogonal to spacetime.

A puzzle then arises about the notion of mathematical locality.  When the rewriting relation, through this orthogonal dimension (which I used to call "meta-time", though now that we're associating it with reasoning some other name is wanted), changes spacetime, there's no need for the change to be non-local.  We can apparently generate any sort of physical laws, quantum or otherwise, without the need for more than strictly local rewrite rules; so, again by Occam's Razor, why would we need to suppose a whole elaborate non-local "network topology"?  A strictly local rewriting rule sounds much simpler.

Consider, though, what we mean by locality.  Both nondeterminism and locality must be understood relative to a dimension of change, thus "chronological nondeterminism"; but to be thorough in defining locality we also need a notion of what it means for two elements of a system state to be near each other.  "Yes, yes," you may say, "but we have an obvious notion of nearness, provided by the geometry of spacetime."  Perhaps; but then again, we're now deep enough in the infrastructure that we might expect the geometry of spacetime to emerge from something deeper.  So, what is the essence of the geometry/network distinction in vau-calculus?

A λ-calculus term is a syntax tree — a graph, made up of nodes connected to each other by edges that, in this case, define the potential function-application relationships.  That is, the whole purpose of the context-free syntax is to define where the interactions — the redex patterns for applying the β-rule — are.  One might plausibly say much the same for the geometry of spacetime re gravity, i.e., location in spacetime defines the potential gravitational interactions.  The spacetime geometry is not, evidently, hierarchical like that of λ-calculus terms; that hierarchy is apparently a part of the function-application concept.  Without the hierarchy, there is no obvious opportunity for a direct physical analog to the property of compatibility in term-rewriting calculi.

The network topology, i.e., the variables, provide another set of connections between nodes of the graph.  These groups of connection are less uniform, and the variations between them do not participate in the redex patterns, but are merely tangential to the redex patterns thus cuing the engagement of a variable structure in a rewriting transformation.  In vau-calculi the variable is always engaged in the redex through its binding, but this is done for compatibility; by guaranteeing that all the variable instances occur below the binding in the syntax tree, the rewriting transformation can be limited to that branch of the tree.  Indeed, only the λ bindings really have a fixed place in the geometry, dictated by the role of the variable in the syntactically located function application; side-effect-ful bindings float rather freely, and their movement through the tree really makes no difference to the function-application structure as long as they stay far enough up in the tree to encompass all their matching variable instances.  If not for the convenience of tying these bindings onto the tree, one might represent them as partly or entirely separate from the tree (depending on which kind of side-effect one is considering), tethered to the tree mostly by the connections to the bound variable instances.  The redex pattern, embedded within the geometry, would presumably be at a variable instance.  Arranging for Church-Rosser-ness would, one supposes, be rather more challenging without compatibility.

Interestingly, btw, of the two classes of side-effects considered by vau-calculus (and by Felleisen), this separation of bindings from the syntax tree is more complete for sequential-state side-effects than for sequential-control side-effects — and sequential control is much more simply handled in vau-calculus than is sequential state.  I'm still wondering if there's some abstract principle here that could relate to the differences between various non-gravitational forces in physics, such as the simplicity of Maxwell's equations for electromagnetism.

This notion of a binding node for a variable hovering outside the geometry, tethered more-or-less-loosely to it by connections to variable instances, has a certain vague similarity to the aggressive non-locality of quantum wave functions.  The form of the wave function would, perhaps, be determined by a mix of the nature of the connections to the geometry together with some sort of blurring effect resulting from a poor choice of representing structures; the hope would be that a better choice of representation would afford a more focused description.

I've now identified, for vau-calculus, three structural differences between the geometry and the network.

  • The geometry contains the redex patterns (with perhaps some exotic exceptions).
  • The geometric topology is much simpler and more uniform than the network topology.
  • The network topology is treated hygienically by all rewriting transformations, whereas the geometry is treated co-hygienically only by one class of rewriting transformations (β).
But which of these three do we expect to carry over to physics?

The three major classes of rewriting operations in vau-calculus — function application, sequential control, and sequential state — all involve some information in the term that directs the rewrite and therefore belongs in the redex pattern.  All three classes of operations involve distributing information to all the instances of the engaged variable.  But, the three classes differ in how closely this directing information is tied to the geometry.

For function application, the directing information is entirely contained in the geometry, the redex pattern of the β-rule, ((λx.T1)T2).  The only information about the variable not contained within that purely geometric redex pattern is the locations of the bound instances.

For sequential control, the variable binder is a catch expression, and the bound variable instances are throw expressions that send a value up to the matching catch.  (I examined this case in detail in an earlier post.)  The directing information contained in the variable, beyond the locations of the bound instances, would seem to be the location of the catch; but in fact the catch can move, floating upward in the syntax tree, though moving the catch involves a non-co-hygienic substitutive transformation — in fact, the only non-co-hygienic transformation for sequential control.  So the directing information is still partly tied to the syntactic structure (and this tie is somehow related to the non-co-hygiene).  The catch-throw device is explicitly hierarchical, which would not carry over directly to physics; but this may be only a consequence of its relation to the function-application structure, which does carry over (in the broad sense of spacetime geometry).  There may yet be more to make of a side analogy between vau-calculus catch-throw and Maxwell's Equations.

For sequential state, the directing information is a full-blown environment, a mapping from symbols to values, with arbitrarily extensive information content and very little relation to geometric location.  The calculus rewrite makes limited use of the syntactic hierarchy to coordinate time ordering of assignments — not so much inherently hierarchical as inherently tied to the time sequencing of function applications, which itself happens to be hierarchical — but this geometric connection is even weaker than for catch-throw, and its linkage to time ordering is more apparent.  In correspondence with the weaker geometric ties, the supporting rewrite rules are much more complicated, as they moderate passage of information into and out of the mapping repository.

"Time ordering" here really does refer to time in broadly the same sense that it would arise in physics, not to rewriting order as such.  That is, it is the chronological ordering of events in the programming language described by the rewriting system, analogous to the chronological ordering of events described by a theory of physics.  Order of rewriting is in part related to described chronology, although details of the relationship would likely be quite different for physics where it's to do with relativity.  This distinction is confusing even in term-rewriting PL semantics, where PL time is strictly classical; one might argue that confusion between rewriting, which is essentially reasoning, and evaluation, which is the PL process reasoned about, resulted in the unfortunately misleading "theory of fexprs is trivial" result which I have discussed here previously.

It's an interesting insight that, while part of the use of syntactic hierarchy in sequential control/state — and even in function application, really — is about compatibility, which afaics does not at all carry over to physics, their remaining use of syntactic hierarchy is really about coordination of time sequencing, which does occur in physics in the form of relativity.  Admittedly, in this sort of speculative exploration of possible theories for physics, I find the prospect of tinkering with the infrastructure of quantum mechanics not nearly as daunting as tinkering with the infrastructure of relativity.

At any rate, the fact that vau-calculus puts the redex pattern (almost always) entirely within a localized area of the syntax, would seem to be more a statement about the way the information is represented than about the geometry/network balance.  That is, vau-calculus represents the entire state of the system by a syntactic term, so each item of information has to be given a specific location in the term, even if that location is chosen somewhat arbitrarily.  It is then convenient, for time ordering, to require that all the information needed for a transformation should get together in a particular area of the term.  Quantum mechanics may suffer from a similar problem, in a more advanced form, as some of the information in a wave function may be less tied to the geometry than the equations (e.g. the Schrödinger equation) depict it.  What really makes things messy is devices that are related to the geometry but less tightly so than the primary, co-hygienic device.  Perhaps that is the ultimate trade-off, with differently structured devices becoming more loosely coupled to the geometry and proportionately less co-hygienic.

All of which has followed from considering the first of three geometry/network asymmetries:  that redex patterns are mostly contained in the geometry rather than the network.  The other two asymmetries noted were  (1) that the geometric structure is simple and uniform while the network structure is not, and  (2) that the network is protected from perturbation while the geometry is not — i.e., the operations are all hygienic (protecting the network) but not all are co-hygienic (protecting the geometry).  Non-co-hygiene complicates things only moderately, because the perturbations are to the simple, uniform part of the system configuration; all of the operations are hygienic, so they don't perturb the complicated, nonuniform part of the configuration.  Which is fortunate for mathematical treatment; if the perturbations were to the messy stuff, it seems we mightn't be able to cope mathematically at all.  So these two asymmetries go together.  In my more cynical moments, this seems like wishful thinking; why should the physical world be so cooperative?  However, perhaps they should be properly understood as two aspects of a single effect, itself a kind of separability, the same view I've recommended for Church-Rosser-ness; in fact, Church-Rosser-ness may be another aspect of the same whole.  The essential point is that we are able to usefully consider individual parts of the cosmos even though they're all interconnected, because there are limits on how aggressively the interconnectedness is exercised.  The "geometry" is the simple, uniform way of decomposing the whole into parts, and "hygiene" is an assertion that this decomposition suffices to keep things tractable.  It's still fair to question why the cosmos should be separable in this way, and even to try to build a theory of physics in which the separation breaks down; but there may be some reassurance, re Occam's Razor, in the thought that these two asymmetries (simplicity/uniformity, and hygiene) are two aspects of a single serendipitous effect, rather than two independently serendipitous effects.

Cosmic structure

Most of these threads are pointing toward a rewriting relation along a dimension orthogonal to spacetime, though we're lacking a good name for it atm (I tend to want to name things early in the development process, though I'm open to change if a better name comes along).

One thread, mentioned above, that seems at least partly indifferent to the rewriting question, is that of changes in the character of quantum mechanics at cosmological scale.  This relates to the notion of decoherence.  It was recognized early in the conceptualization of quantum mechanics that a very small entangled quantum system would tend to interact with the rest of the universe and thereby lose its entanglement and, ultimately, become more classical. We can only handle the quantum math for very small physical systems; in fact, rather insanely small physical systems.  Intuitively, what if this tendency of entanglement to evaporate when interacting with the rest of the universe ceases to be valid when the size of the physical system is sufficiently nontrivial compared to the size of the whole universe?  In the traditional quantum mechanics, decoherence appears to be an all-or-nothing proposition, a strict dichotomy tied to the concept of observation.  If something else is going on at large scales, either it is an unanticipated implication of the math-that-we-can't-do, or it is an aspect of the physics that our quantum math doesn't include because the phenomena that would cause us to confront this aspect are many orders of magnitude outside anything we could possibly apply the quantum math to.  It's tantalizing that this conjures both the problem of observation, and the possibility that quantum mechanics may be (like Newtonian mechanics) only an approximation that's very good within its realm of application.

The persistently awkward interplay of the continuous and discrete is a theme I've visited before.  Relativity appears to have too stiff a dose of continuity in it, creating a self-reference problem even in the non-quantum case (iirc Einstein had doubts on this point before convincing himself the math of general relativity could be made to work); and when non-local effects are introduced for the quantum case, continuity becomes overconstraining.  Quantum gravity efforts suffer from a self-reference problem on steroids (non-renormalizable infinities).  The Big Picture perspective here is that non-locality and discontinuity go together because a continuum — as simple and uniform as it is possible to be — is always going to be perceived as geometry.

The non-local network in vau-calculus appears to be inherently discrete, based on completely arbitrary point-to-point connections defined by location of variable instances, with no obvious way to set up any remotely similar continuous arrangement.  Moreover, the means I've described for deriving nondeterminism from the network connections (on which I went into some detail in the earlier post) exploits the potential for chaotic scrambling of discrete point-to-point connections by following successions of links hopscotching from point to point.  While the geometry might seem more amenable to continuity, a truly continuous geometry doesn't seem consistent with point-to-point network connections, either, as one would then have the prospect of an infinitely dense tangle of network connections to randomly unrelated remote points, a sort of probability-density field that seems likely to wash out the randomness advantages of the strategy and less likely to be mathematically useful; so the whole rewriting strategy appears discrete in both the geometry and network aspects of its configuration as well as in the discrete rewriting steps themselves.

The rewriting approach may suffer from too stiff a dose of discreteness, as it seems to force a concrete choice of basic structures.  Quantum mechanics is foundationally flexible on the choice of elementary particles; the mathematical infrastructure (e.g. the Schrödinger equation) makes no commitment on the matter at all, leaving it to the Hamiltonian Ĥ.  Particles are devised comparatively freely, as with such entities as phonons and holes.  Possibly the rewriting structure one chooses will afford comparable flexibility, but it's not at all obvious that one could expect this level of versatile refactoring from a thoroughly discrete system.  Keeping in mind this likely shortfall of flexibility, it's not immediately clear what the basic elements should be.  Even if one adopts, say, the standard model, it's unclear how that choice of observable particles would correspond to concrete elements in a discrete spacetime-rewriting system (in one "metaclassical" scenario I've considered, spacetime events are particle-like entities tracing out one-dimensional curves as spacetime evolves across an orthogonal dimension); and it is by no means certain that the observable elements ought to follow the standard model, either.  As I write this there is, part of the time, a cat sitting on the sofa next to me.  It's perfectly clear to me that this is the correct way to view the situation, even though on even moderately closer examination the boundaries of the cat may be ambiguous, e.g. at what point an individual strand of fur ceases to be part of the cat.  By the time we get down to the scale where quantum mechanics comes into play and refactoring of particles becomes feasible, though, is it even certain that those particles are "really" there?  (Hilaire Belloc cast aspersions on the reality of a microbe merely because it couldn't be seen without the technological intervention of a microscope; how much more skepticism is recommended when we need a gigantic particle accelerator?)

Re the structural implications of quasiparticles (such as holes), note that such entities are approximations introduced to describe the behavior of vastly complicated systems underneath.  A speculation that naturally springs to mind is, could the underlying "elementary" particles be themselves approximations resulting from complicated systems at a vastly smaller scale; which would seem problematic in conventional physics since quantum mechanics is apparently inclined to stop at Planck scale.  However, the variety of non-locality I've been exploring in this thread may offer a solution:  by maintaining network connections from an individual "elementary" particle to remote, and rather arbitrarily scrambled, elements of the cosmos, one could effectively make the entire cosmos (or at least significant parts of it) serve as the vastly complicated system underlying the particle.

It is, btw, also not certain what we should expect as the destination of a spacetime-rewriting relation.  An obvious choice, sufficient for a proof-of-concept theory (previous post), is to require that spacetime reach a stable state, from which there is either no rewriting possible, or further rewriting leaves the system state unchanged.  Is that the only way to derive a final state of spacetime?  No.  Whatever other options might be devised, one that comes to mind is some form of cycle, repeating a closed set of states of spacetime, perhaps giving rise to a set of states that would manifest in more conventional quantum math as a standing wave.  Speculatively, different particles might differ from each other by the sort of cyclic pattern they settle into, determining a finite — or perhaps infinite — set of possible "elementary particles".  (Side speculation:  How do we choose an initial state for spacetime?  Perhaps quantum probability distributions are themselves stable in the sense that, while most initial probability distributions produce a different final distribution, a quantum distribution produces itself.)

Granting that the calculus/physics analogy naturally suggests some sort of physical theory based on a discrete rewriting system, I've had recurring doubts over whether the rewriting ought to be in the direction of time — an intuitively natural option — or, as discussed, in a direction orthogonal to spacetime.  At this point, though, we've accumulated several reasons to prefer rewriting orthogonal to spacetime.

Church-Rosser-ness.  CR-ness is about ability to reason separately about the implications of different parts of the system, without having to worry about which reasoning to do first.  The formal property is that whatever order one takes these locally-driven inferences in ("locally-driven" being a sort of weak locality), it's always possible to make later inferences that reach a common conclusion by either path.  This makes it implausible to think of these inference steps as if they were chronological evolution.

Bell's Theorem.  The theorem says, essentially, the probability distributions of quantum mechanics can't be generated by a conventionally deterministic local theory.  Could it be done by a non-local rewriting theory evolving deterministically forward in time?  My guess would be, probably it could (at least for classical time); but I suspect it'd be rather artificial, whereas my sense of the orthogonal-dimension rewriting approach (from my aforementioned proof-of-concept) is that it ought to work out neatly.

Relativity.  Uses an intensively continuous mathematical infrastructure to construct a relative notion of time.  It would be rather awkward to set an intensively discrete rewriting relation on top of this relative notion of time; the intensively discrete rewriting really wants to be at a deeper level of reality than any continuous relativistic infrastructure, rather than built on top of it (just as we've placed it at a deeper level than quantum entanglement), with apparent continuity arising from statistical averaging over the discrete foundations.  Once rewriting is below relativity, there is no clear definition of a "chronological" direction for rewriting; so rewriting orthogonal to spacetime is a natural device from which to derive relativistic structure.  Relativity is however a quintessentially local theory, which ought to be naturally favored by a predominately local rewriting relation in the orthogonal dimension.  Deriving relativistic structure from an orthogonal rewriting relation with a simple causal structure also defuses the self-reference problems that have lingered about gravity.

It's rather heartening to see this feature of the theory (rewriting orthogonal to spacetime) — or really any feature of a theory — drawing support from considerations in both quantum mechanics and relativity.

The next phase of exploring this branch of theory — working from these clues to the sort of structure such a theory ought to have — seems likely to study how the shape of a spacetime-orthogonal rewriting system determines the shape of spacetime.  My sense atm is that one would probably want particular attention to how the system might give rise to a relativity-like structure, with an eye toward what role, if any, a non-local network might play in the system.  Keeping in mind that β-rule use of network topology, though co-hygienic, is at the core of what function application does and, at the same time, inspired my suggestion to simulate nondeterminism through repeatedly rescrambled network connections; and, likewise, keeping in mind evidence (variously touched on above) on the possible character of different kinds of generalized non-co-hygienic operations.

Monday, August 29, 2016

Interpreted programming languages

Last night I drifted off while reading a Lisp book.
xkcd 224.

It's finally registered on me that much of the past half century of misunderstandings and confusions about the semantics of Lisp, of quotation, and, yes, of fexprs, can be accounted for by failure to recognize there is a theoretical difference between an interpreted programming language and a compiled programming language.  If we fail to take this difference into account, our mathematical technology for studying compiled programming languages will fail when applied to interpreted languages, leading to loss of coherence in language designs and a tendency to blame the language rather than the theory.

Technically, a compiler translates the program into an executable form to be run thereafter, while an interpreter figures out what the program says to do and just does it immediately.  Compilation allows higher-performance execution, because the compiler takes care of reasoning about source-code before execution, usually including how to optimize the translation for whatever resources are prioritized (time, space, peripherals).  It's easy to suppose this is all there is to it; what the computer does is an obvious focus for attention.  One might then suppose that interpretation is a sort of cut-rate alternative to compilation, a quick-and-dirty way to implement a language if you don't care about performance.  I think this misses some crucial point about interpretation, some insight to be found not in the computer, but in the mind of the programmer.  I don't understand that crucial insight clearly enough — yet — to focus a whole blog post on it; but meanwhile, there's this theoretical distinction between the two strategies which also isn't to be found in the computer's-eye view, and which I do understand enough about to focus this blog post on.

It's not safe to say the language is the same regardless of which way it's processed, because the language design and the processing strategy aren't really independent.  In principle a given language might be processed either way; but the two strategies provide different conceptual frameworks for thinking about the language, lending themselves to different language design choices, with different purposes — for which different mathematical properties are of interest and different mathematical techniques are effective.  This is a situation where formal treatment has to be considered in the context of intent.  (I previously blogged about another case where this happens, formal grammars and term-rewriting calculi, which are both term-rewriting systems but have nearly diametrically opposite purposes; over t‍ha‍r.)

I was set onto the topic of this post recently by some questions I was asked about Robert Muller's M-Lisp.  My dissertation mentions Muller's work only lightly, because Muller's work and mine are so far apart.  However, because they seem to start from the same place yet lead in such different directions, one naturally wants to know why, and I've struck on a way to make sense of it:  starting from the ink-blot of Lisp, Muller and I both looked to find a nearby design with greater elegance — and we ended up with vastly different languages because Muller's search space was shaped by a conceptual framework of compilation while mine was shaped by a conceptual framework of interpretation.  I will point out, below, where his path and mine part company, and remark briefly on how this divergence affected his destination.

The mathematical technology involved here, I looked at from a lower-level perspective in an earlier post.  It turns out, from a higher-level perspective, that the technology itself can be used for both kinds of languages, but certain details in the way it is usually applied only work with compiled languages and, when applied to interpreted languages, result in the trivialization of theory noted by Wand's classic paper, "The Theory of Fexprs is Trivial".

Contents
M-language
Compilation
Universal program
S-language
Half a century's worth of misunderstandings and confusions
M-language

I'll explore this through a toy programming language which I'll then modify, starting with something moderately similar to what McCarthy originally described for Lisp (before it got unexpectedly implemented).

This is a compiled programming language, without imperative features, similar to λ-calculus, for manipulating data structures that are nested trees of atomic symbols.  The syntax of this language has two kinds of expressions:  S-expressions (S for Symbolic), which don't specify any computation but merely specify constant data structures — trees of atomic symbols — and M-expressions (M for Meta), which specify computations that manipulate these data structures.

S-expressions, the constants of the language, take five forms:

S   ::=   s | () | #t | #f | (S . S)   .
That is, an S-expression is either a symbol, an empty list, true, false, or a pair (whose elements are called, following Lisp tradition, its car and cdr).  A symbol name is a sequence of one or more upper-case letters.  (There should be no need, given the light usage in this post, for any elaborate convention to clarify the difference between a symbol name and a nonterminal such as S.)

I'll assume the usual shorthand for lists, (S ...) ≡ (S . (...)), so for example (FOO BAR QUUX) ≡ (FOO . (BAR . (QUUX . ()))); but I won't complicate the formal grammar with this shorthand since it doesn't impact the abstract syntax.

The forms of M-expressions start out looking exactly like λ-calculus, then add on several other compound forms and, of course, S-expressions which are constants:

M   ::=   x | [λx.M] | [M M] | [if M M M] | [car M] | [cdr M] | [cons M M] | [eq? M M] | S   .
The first form is a variable, represented here by nonterminal x.  A variable name will be a sequence of one or more lower-case letters.  Upper- versus lower-case letters is how McCarthy distinguished between symbols and variables in his original description of Lisp.

Variables, unlike symbols, are not constants; rather, variables are part of the computational infrastructure of the language, and any variable might stand for an arbitrary computation M.

The second form constructs a unary function, via λ; the third applies a unary function to a single argument.  if expects its first argument to be boolean, and returns its second if the first is true, third if the first is false.  car and cdr expect their argument to be a pair and extract its first and second element respectively.  cons constructs a pair.  eq? expects its two arguments to be S-expressions, and returns #t if they're identical, #f if they're different.

Compound unary functions, constructed by λ, are almost first-class.  They can be returned as the values of expressions, and they can be assigned to variables; but as the grammar is set out, they cannot appear as elements of data structures.  A pair expression is built up from two S-expressions, and a compound unary function is not an S-expression.  McCarthy's original description of Lisp defines S-expressions this way; his stated purpose was only to manipulate trees of symbols.  Trained as I am in the much later Lisp culture of first-class functions and minimal constraints, it felt unnatural to follow this narrower definition of S-expressions; but in the end I had to define it so.  I'm trying to reproduce the essential factors that caused Lisp to come out the way it did, and, strange to tell, everything might have come out differently if not for the curtailed definition of S-expressions.

(One might ask, couldn't we indirectly construct a pair with a function in it using cons?  A pair with a function in it would thus be a term that can't be represented directly by a source expression.  This point likely doesn't matter directly to how things turned out; but fwiw, I suspect McCarthy didn't have in mind to allow that, no.  It's entirely possible he also hadn't really thought about it yet at the preliminary stage of design where this detail had its impact on the future of Lisp.  It's the sort of design issue one often discovers by playing around with a prototype — and in this case, playing around with a prototype is how things got out of hand; more on that later.)

Compilation

Somehow we want to specify the meanings of programs in our programming language.  Over the decades, a number of techniques for formal PL semantics have been entertained.  One of the first things tried was to set up a term-rewriting machine modeled roughly on λ-calculus, that would perform small finite steps until it reached a final state; that was Peter Landin's notion, when Christopher Strachey set him onto the problem in the 1960s.  It took some years to get the kinks out of that approach, and meanwhile other techniques were tried — such as denotational semantics, meta-circular evaluation — but setting up a term-rewriting calculus has been quite a popular technique since the major technical obstacles to it were overcome.  Of the three major computational models tied together in the 1930s by the Church-Turing thesis, two of them were based on term-rewriting:  Turing machines, which were the convincing model, the one that lent additional credibility to the others when they were all proven equivalent; and λ-calculus, which had mathematical elegance that the nuts-and-bolts Turing-machine model lacked.  The modern "small-step" rewriting approach to semantics (as opposed to "big-step", where one deduces how to go in a single step from start to finish) does a credible job of combining the strengths of Turing-machines and λ-calculus; I've a preference for it myself, and it's the strategy I'll use here.  I described the technique in somewhat more depth in my previous post on this material.

Small-step semantics applies easily to this toy language because every intermediate computational state of the system is naturally represented by a source-code expression.  That is, there is no obvious need to go beyond the source-code grammar we've already written.  Some features that have been omitted from the toy language would make it more difficult to limit all computational states to source expressions; generally these would be "stateful" features, such as input/output or a mutable store.  Landin used a rewriting system whose terms (computational states) were not source-code expressions.  One might ask whether there are any language features that would make it impossible to limit computational states to source expressions, and the answer is essentially yes, there are — features related not to statefulness, but to interpretation.  For now, though, we'll assume that all terms are source expressions.

We can define the semantics of the language in reasonably few rewriting schemata.

[[λx.M1] M2]   →   M1[x ← M2]
[if #t M1 M2]   →   M1
[if #f M1 M2]   →   M2
[car (S1 . S2)]   →   S1
[cdr (S1 . S2)]   →   S2
[cons S1 S2]   →   (S1 . S2)
[eq? S S]   →   #t
[eq? S1 S2]   →   #f     if the Sk are different.

Rewriting relation → is the compatible closure of these schemata; that is, for any context C and terms Mk, if M1 → M2 then C[M1] → C[M2].  Relation → is also Church-Rosser:  although a given term M may be rewritable in more than one way, any resulting difference can always be eliminated by later rewriting.  That is, the reflexive transitive closure →* has the diamond property:  if M1 →* M2 and M1 →* M3, then there exists M4 such that M2 →* M4 and M3 →* M4.

Formal equality of terms, =, is the symmetric closure of →* (thus, the reflexive symmetric transitive compatible closure of the schemata, which is to say, the least congruence containing the schemata).

Another important relation is operational equivalence, ≅.  Two terms are operationally equivalent just if replacing either by the other in any possible context preserves the observable result of the computation.  M1 ≅ M2 iff for every context C and S-expression S,  C[M1] ↦* S  iff  C[M2] ↦* S.  (Fwiw, relation ↦ is what the computation actually does, versus → which is anything the rewriting calculus could do; → is compatible Church-Rosser and therefore nice to work with mathematically, but ↦ is deterministic and therefore we can be sure it does what we meant it to.  Another way of putting it is that → has the mathematical character of λ-calculus while ↦ has the practical character of Turing-machines.)

Operational equivalence is exactly what must be guaranteed in order for an optimizing compiler to safely perform a local source-to-source transformation:  as long as the two terms are operationally equivalent, the compiler can replace one with the other in any context.  The rewriting calculus is operationally sound if formal equality implies operational equivalence; then the rewriting calculus can supply proofs of operational equivalence for use in optimizing compilation.

Before moving on, two other points of interest about operational equivalence:

  • Operational equivalence of S-expressions is trivial; that is, S1 and S2 are operationally equivalent only if they are identical.  This follows immediately by plugging the trivial context into the definition of operational equivalence, C[Sk] ≡ Sk.  Thus, in every non-trivial operational equivalence M1 ≅ M2, at least one of the Mk is not an S-expression.

  • All terms in the calculus — all M-expressions — are source expressions; but if there were any terms in the calculus that were not source expressions, they would be irrelevant to a source-to-source optimizer; however, if these non-source terms could be usefully understood as terms in an intermediate language used by the compiler, an optimizer might still be able to make use of them and their formal equalities.

Universal program

McCarthy's Lisp language was still in its infancy when the project took an uncontrollable turn in a radically different direction than McCarthy had envisioned going with it.  Here's what happened.

A standard exercise in theory of computation is to construct a universal Turing machine, which can take as input an encoding of an arbitrary Turing machine T and an input w to T, and simulate what T would do given input w.  This is an extremely tedious exercise; the input to a Turing machine looks nothing like the control device of a Turing machine, so the encoding is highly intrusive, and the control device of the universal machine is something of an unholy mess.  McCarthy set out to lend his new programming language mathematical street cred by showing that not only could it simulate itself with a universal program, but the encoding would be much more lucid and the logic simpler in contrast to the unholy mess of the universal Turing machine.

The first step of this plan was to describe an encoding of programs in the form of data structures that could be used as input to a program.  That is to say, an encoding of M-expressions as S-expressions.  Much of this is a very straightforward homomorphism, recursively mapping the non-data M-expression structure onto corresponding S-expressions; for our toy language, encoding φ would have

  • φ(x) = symbol s formed by changing the letters of its name from lower-case to upper-case. Thus, φ(foo) = FOO.
  • φ([λx.M]) = (LAMBDA φ(x) φ(M)).
  • φ([M1 M2]) = (φ(M1) φ(M2)).
  • φ([if M1 M2 M3]) = (IF φ(M1) φ(M2) φ(M3)).
  • φ([car M]) = (CAR φ(M)).
  • φ([cdr M]) = (CDR φ(M)).
  • φ([cons M1 M2]) = (CONS φ(M1) φ(M2)).
  • φ([eq? M1 M2]) = (EQ φ(M1) φ(M2)).

(This encoding ignores the small detail that certain symbol names used in the encoding — LAMBDA, IF, CAR, CDR, CONS, EQ — must not also be used as variable names, if the encoding is to behave correctly.  McCarthy seems not to have been fussed about this detail, and nor should we be.)

For a proper encoding, though, S-expressions have to be encoded in a way that unambiguously distinguishes them from the encodings of other M-expressions.  McCarthy's solution was

  • φ(S) = (QUOTE S).

Now, in some ways this is quite a good solution.  It has the virtue of simplicity, cutting the Gordian knot.  It preserves the readability of the encoded S-expression, which supports McCarthy's desire for a lucid encoding.  The main objection one could raise is that it isn't homomorphic; that is, φ((S1 . S2)) is not built up from φ(S1) and φ(S2).

As McCarthy later recounted, they had expected to have plenty of time to refine the language design before it could be implemented.  (The FORTRAN compiler, after all, had been a massive undertaking.)  Meanwhile, to experiment with the language they began hand-implementing particular functions.  The flaw in this plan was that, because McCarthy had been so successful in demonstrating a universal Lisp function eval with simple logic, it wasn't difficult to hand-implement eval; and, because he had been so successful in making the encoding lucid, this instantly produced a highly usable Lisp interpreter.  The sudden implementation precipitated a user community and substantial commitment to specifics of what had been a preliminary language design.

All this might have turned out differently if the preliminary design had allowed first-class functions to be elements in pairs.  A function has to be encoded, homomorphically, which would require a homomorphic encoding of pairs, perhaps φ((M1 . M2)) = (CONS φ(M1) φ(M2)); once we allow arbitrary M-expressions within the pair syntax, (M1 . M2), that syntax itself becomes a pair constructor and there's really no need for a separate cons operator in the M-language; then CONS can encode the one constructor.  One might then reasonably restrict QUOTE to base cases; more, self-encode () #t and #f, leaving only the case of encoding symbols, and rename QUOTE to SYMBOL.  The encoding would then be fully homomorphic — but the encodings of large constant data structures would become unreadable.  For example, fairly readable constant structure

((LAMBDA X (X Y)) FOO)
would encode through φ as
(CONS (CONS (SYMBOL LAMBDA) (CONS (SYMBOL X) (CONS (CONS (SYMBOL X) (CONS (SYMBOL Y) ())) ()))) (CONS (SYMBOL FOO) ()))   .
That didn't happen, though.

The homomorphic, non-QUOTE encoding would naturally tend to produce a universal function with no practical potential for meta-programming.  In theory, one could use the non-homomorphic QUOTE encoding and still not offer any native meta-programming power.  However, the QUOTE-based encoding means there are data structures lying around that look exactly like working executable code except that they happen to be QUOTE‍d.  In practice, the psychology of the notation makes it rather inevitable that various facilities in the language would allow a blurring of the line between data and code.  Lisp, I was told when first taught the language in the 1980s, treats code as data.  Sic:  I was not told Lisp treats data as code, but that it treats code as data.

In other words, Lisp had accidentally become an interpreted language; a profoundly different beast than the compiled language McCarthy had set out to create, and one whose character naturally suggests a whole different set of features that would not have occurred to someone designing a compiled language in 1960.  There were, of course, some blunders along the way (dynamic scope is an especially famous one, and I would rate the abandonment of fexprs in favor of macros as another of similar magnitude); but in retrospect I see all that as part of exploring a whole new design space of interpreted features.  Except that over the past three decades or so the Lisp community seems to have somewhat lost track of its interpreted roots; but, I'll get back to that.

Of interest:

  • In S-expression Lisp, all source expressions are S-expressions.  It is no less true now than before that an operational equivalence M1 ≅ M2 can only be nontrivial if at least one of the Mk is not an S-expression; but now, if the Mk are source expressions, we can be absolutely certain that they are both S-expressions.  So if the operational equivalence relation is restricted to source expressions, it's trivial.  This isn't disastrous; it just means that, in order to have nontrivial theory, we are going to have to have some terms that are not source expressions (as Landin did, though for a different reason); and if we choose to compile the language, we won't be allowed to call our optimizations "local source-to-source" (any given optimization could be one or the other, but not both at once).

  • This is the fork in the road where Muller and I went our separate ways.  Muller's M-Lisp, taking the compiled-language view, supposes that S-expressions are encoded homomorphically, resulting in a baseline language with no native meta-programming power.  He then considers how to add some meta-programming power to the resulting language.  However, practical meta-programming requires the programmer be able to write lucid code that can also be manipulated as data; and the homomorphic encoding isn't lucid.  So instead, meta-programming in Muller's extended M-Lisp uses general M-expressions directly (rather than their encodings).  If an M-expression turns out to be wanted as data, it then gets belatedly encoded — with the drawback that the M-expression can't be rewritten until the rewriting schemata can tell it won't be needed as data.  This causes difficulties with operational equivalence of general M-expressions; in effect, as the burden of meta-programming is shifted from S-expressions to general M-expressions, it carries along with it the operational-equivalence difficulties that had been limited to S-expressions.

S-language

McCarthy hadn't finished the details of M-expressions, so S-expression Lisp wasn't altogether an encoding of anything; it was itself, leaving its implementors rather free to invent it as they went along.  Blurring the boundary between quoted data and unquoted code provided meta-programming facilities that hadn't been available in compiled languages (essentially, I suggest, a sort of flexibility we enjoy in natural languages).  In addition to QUOTE itself (which has a rather fraught history entangled with first-class functions and dynamic scope, cf. §3.3.2 of my dissertation), from very early on the language had fexprs, which are like LAMBDA-constructed functions except that they treat their operands as data — as if the operands had been QUOTE‍d — rather than evaluating them as code (which may later be done, if desired, explicitly by the fexpr using EVAL).  In 1963, macros were added — not the mere template-substitution macros found in various other languages, but macros that treat their operands as data and perform an arbitrary computation to generate a data structure as output, which is then interpreted as code at the point of call.

But how exactly do we specify the meanings of programs in this interpreted S-expression language?  We could resort to the meta-circular evaluator technique; this is a pretty natural strategy since an evaluator is exactly what we have as our primary definition of the language.  That approach, though, is difficult to work with mathematically, and in particular doesn't lend itself to proofs of operational equivalence.  If we try to construct a rewriting system the way we did before, we immediately run into the glaring practical problem that the same representation is used for executable code, which we want to have nontrivial theory, and passive data which necessarily has perfectly trivial theory.  That is, as noted earlier, all source expressions are S-expressions and operational equivalence of S-expressions is trivial.  It's possible to elaborate in vivid detail the theoretical train wreck that results from naive application of the usual rewriting semantics strategy to Lisp with quotation (or, worse, fexprs); but this seems to be mainly of interest if one is trying to prove that something can't be done.  I'm interested in what can be done.

If what you want is nontrivial theory, that could in principle be used to guide optimizations, this is not difficult to arrange (once you know how; cf. my past discussion of profundity index).  As mentioned, all nontrivial operational equivalences must have at least one of the two terms not a source expression (S-expression), therefore we need some terms that aren't source expressions; and our particular difficulty here is having no way to mark a source expression unmistakably as code; so, introduce a primitive context that says "evaluate this source expression".  The new context only helps with operational equivalence if it's immune to QUOTE, and no source expression is immune to QUOTE, so that's yet another way to see that the new context must form a term that isn't a source expression.

Whereas the syntax of the compiled M-language had two kinds of terms — constant S-expressions and computational M-expressions — the syntax of the interpreted S-language will have three kinds of terms.  There are, again, the "constant" terms, the S-expressions, which are now exactly the source expressions.  There are the "computational" terms, which are needed for the actual work of computation; these are collectively shaped something like λ-calculus.  We expect a big part of the equational strength of the rewriting calculus to reside in these computational terms, roughly the same equational strength as λ-calculus itself, and therefore of course those terms have to be entirely separate from the source expressions which can't have nontrivial equational theory.  And then there are the "interpretation" terms, the ones that orchestrate the gradual conversion of source expressions into computational expressions.  The code-marking terms are of this sort.  The rewriting rules involving these "interpretation" terms will amount to an algorithm for interpreting source code.

This neat division of terms into three groups won't really be as crisp as I've just made it sound.  Interpretation is by nature a gradual process whose coordination seeps into other parts of the grammar.  Some non-interpretation terms will carry along environment information, in order to make it available for later use.  This blurring of boundaries is perhaps another part of the flexibility that (I'll again suggest) makes interpreted languages more similar to natural languages.

I'll use nonterminal T for arbitrary terms.  Here are the interpretation forms.

T   ::=   [eval T T] | ⟨wrap T⟩ | e   .
Form [eval T1 T2] is a term that stands for evaluating a term T1 in an environment T2.  This immediately allows us to distinguish between statements such as
S1 ≅ S2
[eval S1 e0] ≅ [eval S2 e0]
∀e, [eval S1 e] ≅ [eval S2 e]   .
The first proposition is the excessively strong statement that S-expressions Sk are operationally equivalent — interchangeable in any context — which can only be true if the Sk are identically the same S-expression.  The second proposition says that evaluating S1 in environment e0 is operationally equivalent to evaluating S2 in environment e0 — that is, for all contexts C and all S-expressions S, C[[eval S1 e0]] ↦* S  iff  C[[eval S2 e0]] ↦* S.  The third proposition says that evaluating S1 in any environment e is operationally equivalent to evaluating S2 in the same e — which is what we would ordinarily have meant, in a compiled language, if we said that two executable code (as opposed to data) expressions Sk were operationally equivalent.

The second form, ⟨wrap T⟩, is a wrapper placed around a function T, that induces evaluation of the operand passed to the function.  If T is used without such a wrapper (and presuming T isn't already a wrapped function), it acts directly on its unevaluated operand — that is, T is a fexpr.

The third form, e, is simply an environment.  An environment is a series of symbol-value bindings, ⟪sk←Tk⟫; there's no need to go into gory detail here (though I did say more in a previous post).

The computational forms are, as mentioned, similar to λ-calculus with some environments carried along.

T   ::=   x | [combine T T T] | ⟨λx.T⟩ | ⟨εx.T⟩   .
Here we have a variable, a combination, and two kinds of function.  Form ⟨λx.T⟩ is a function that substitutes its operand for x in its body T.  Variant ⟨εx.T⟩ substitutes its dynamic environment for x in its body T.

Form [combine T1 T2 T3] is a term that stands for combining a function T1 with an operand T2 in a dynamic environment T3.  The dynamic environment is the set of bindings in force at the point where the function is called; as opposed to the static environment, the set of bindings in force at the point where the function is constructed.  Static environments are built into the bodies of functions by the function constructor, so they don't show up in the grammar.  For example, [eval (LAMBDA X FOO) e0] would evaluate to a function with static environment e0, of the form ⟨wrap ⟨λx.[eval FOO ⟪...⟫]⟩⟩ with the contents of e0 embedded somewhere in the ⟪...⟫.

Putting it all together,

T   ::=   x | [combine T T T] | ⟨λx.T⟩ | ⟨εx.T⟩ | [eval T T] | ⟨wrap T⟩ | e | S   .

The rewriting schemata naturally fall into two groups, those for internal computation and those for source-code interpretation.  (There are of course no schemata associated with the third group of syntactic forms, the syntactic forms for passive data, because passive.)  The computation schemata closely resemble λ-calculus, except with the second form of function used to capture the dynamic environment (which fexprs sometimes need).

[combine ⟨λx.T1⟩ T2 T3]   →   T1[x ← T2]
[combine ⟨εx.T1⟩ T2 T3]   →   [combine T1[x ← T3] T2 T3]   .
The interpretation schemata look very much like the dispatching logic of a Lisp interpreter.
[eval d T]   →   d
        if d is an empty list, boolean, λ-function, ε-function, or environment
[eval s e]   →   lookup(s,e)     if symbol s is bound in environment e
[eval (T1 T2) T3]   →   [combine T1 T2 T3]
[combine ⟨wrap T1⟩ T2 T3]   →   [combine T1 [eval T2 T3] T3]   .
(There would also be some atomic constants representing primitive first-class functions and reserved operators such as if, and schemata specifying what they do.)

Half a century's worth of misunderstandings and confusions

As I remarked earlier, Lisp as we know it might not have happened — at least, not when and where it did — if McCarthy had thought to allow first-class functions to occur in pairs.  The thing is, though, I don't think it's all that much of an "accident".  He didn't think to allow first-class functions to occur in pairs, and perhaps the reason we're likely to think to allow them today is that our thinking has been shaped by decades of the free-wheeling attitude fostered by the language that Lisp became because he didn't think to then.  The actual sequence of events seems less unlikely than one might first suppose.

Researchers trying to set up semantics for Lisp have been led astray, persistently over the decades, by the fact that the primary Lisp constructor of first-class functions is called LAMBDA.  Its behavior is not that of calculus λ, exactly because it's entangled with the process of interpreting Lisp source code.  This becomes apparent when contemplating rewriting calculi for Lisp of the sort I've constructed above (and have discussed before on this blog):  When you evaluate a LAMBDA expression you get a wrapped function, one that explicitly evaluates its operand and then passes the result to a computational function; that is, passes the result to a fexpr.  Scan that:  ordinary Lisp functions do not correspond directly to calculus λ-abstractions, but fexprs do correspond directly to calculus λ-abstractions.  In its relation to Lisp, λ-calculus is a formal calculus of fexprs.

Much consternation has also been devoted to the perceived theoretical difficulty presented by Lisp's quotation operator (and presented in more extreme form by fexprs), because it presents a particular context that can distinguish any two S-expressions placed into it:  (QUOTE S1) and (QUOTE S2) are observably distinct whenever the Sk are distinct from each other.  Yet, this observation only makes sense in a compiled programming language.  Back in the day, it would have been an unremarkable observation that Lisp only has syntax for data structures, no syntax at all for control.  Two syntactically distinct Lisp source expressions are operationally non-equivalent even without any quotation or fexpr context, because they don't represent programs at all; they're just passive data structures.  The context that makes a source expression code rather than data is patently not in the source; it's in what program you send the source expression to.  Conventional small-step operational semantics presumes the decision to compile, along with a trivial interpretive mapping between source expressions and internal computational terms (so the interpretive mapping doesn't have to appear explicitly in the rewriting schemata).  It is true that, without any such constructs as quotation or fexprs, there would be no reason to treat the language as interpreted rather than compiled; but once you've crossed that Rubicon, the particular constructs like quotation or fexprs are mere fragments of, and can be distractions from, the main theoretical challenge of defining the semantics of an interpreted language.

The evolution of Lisp features has itself been a long process of learning how best to realize the flexibility offered by interpreted language.  Fexprs were envisioned just about from the very beginning — 1960 — but were sabotaged by dynamic scope, a misfeature that resulted from early confusion over how to handle symbol bindings in an interpreter.  Macros were introduced in 1963, and unlike fexprs they lend themselves to preprocessing at compile time if one chooses to use a compiler; but macros really ought to be much less mathematically elegant... except that in the presence of dynamic scope, fexprs are virulently unstable.  Then there was the matter of first-class functions; that's an area where Lisp ought to have had a huge advantage; but first-class functions don't really come into their own without static scope (The Art of the Interpreter noted this) — and first-class functions also present a difficulty for compilers (which is why procedures in ALGOL were second-class).  The upshot was that after Lisp 1.5, when Lisp splintered into multiple dialects, first-class functions went into eclipse until they reemerged in the mid-1970s when Scheme introduced static scope into the language.  Fexprs held on longer but, ironically, were finally rejected by the Lisp community in 1980 — just a little ahead of the mainstream adoption of Scheme's static-scope innovation.  So for the next twenty years and more, Lisp had static scope and first-class functions, but macros and no fexprs.  Meanwhile, EVAL — key workhorse of meta-linguistic flexibility — was expunged from the new generation of mainstream Lisps and has had great difficulty getting back in.

The latter half of Lisp's history has been colored by a long-term trend in programming language design as a whole.  I've alluded to this several times above.  I have no specific sources to suggest for this; it's visible in the broad sweep of what languages were created, what research was done, and I've sensed it though my interactions with the Lisp community over the past thirty years.  When I learned Lisp in the mid-1980s, it was from the Revised Maclisp Manual, Saturday Evening Edition (which I can see a few feet away on my bookshelf as I write this, proof that manuals can be well-written).  Maclisp was a product of the mentality of the 1970s.  Scheme too was a product of that mentality.  And what comes through to me now, looking back, isn't the differences between those languages (different though they are), but that those people knew, gut deep, that Lisp is an interpreted language — philosophically, regardless of the technical details of the language processing software.  The classic paper I cited above for the relationship between first-class functions and static scope was one of the "lambda" papers associated with the development of Scheme: "The Art of the Interpreter".  The classic textbook — the Wizard Book — that emerged from the Scheme design is Structure and Interpretation of Computer Programs.

But then things changed.  Compilation had sometimes intruded into Lisp design, yes (with unfortunate results, as I've mentioned), but the intrusion became more systematic.  Amongst Scheme's other achievements it had provided improved compilation techniques, a positive development but which also encouraged greater focus on the challenges of compilation.  We refined our mathematical technology for language semantics of compiled languages, we devised complex type systems for use with compiled languages, more and more we designed our languages to fit these technologies — and as Lisp didn't fit, more and more we tweaked our Lisp dialects to try to make them fit.  Of course some of the indigenous features of Lisp couldn't fit, because the mathematical tools were fundamentally incompatible with them (no pun intended).  And somewhere along the line, somehow, we forgot, perhaps not entirely but enough, that Lisp is interpreted.  Second-class syntax has lately been treated more and more as if it were a primary part of the language, rather than a distraction from the core design.  Whatever merits such languages have, wholeheartedly embracing the interpretation design stance is not among them.

I'm a believer in trying more rather than less.  I don't begrudge anyone their opportunity to follow the design path that speaks to them; but not all those paths speak to me.  Second-class syntax doesn't speak to me, nor recasting Lisp into a compiled language.  I'm interested in compiling Lisp, but want the language design to direct those efforts rather than the other way around.  To me, the potential of interpretation beckons; the exciting things we've already found on that path suggest to me there's more to find further along, and the only way to know is to follow the path and see.  To do that, it seems to me we have to recognize that it is a distinct path, the distinct philosophy of interpretation; and, in company with that, we need to hone our mathematical technology for interpreted languages.

These are your father's parentheses
Elegant weapons
for a more... civilized age.
xkcd 297.