grammar book

Thursday, April 08, 2004

Revisit

Last time we encountered a problem with matching nodes on an AST with XPath. What we were trying to do is match certain patterns of nodes in the AST so that we could expand them as macro uses. Problem was, the grammar we were representing in our AST wasn't rich enough to match against. We want to be able to match against types, like 'type name', 'variable', 'statement', 'block', 'expression', and so on... but our grammar had only punctionuation: 'identifier', 'brace', 'parenthesis', 'bracket', 'semicolon', 'comma', etc. The grammar was essentially just punctuation, with a minimal amount of structure representing the 'matching punctuation' rules.

The thing is this. We know what sort of constructs can appear in a java program. But we don't know what sort of things can appear in the macro version. So we don't know how much, or what sort, of structure we to put in the AST. Do we represent:

try { } catchall (Exception e) { }

<ident>try</ident> <brace></brace> <ident>catchall</ident> <paren><ident>Exception</ident><ident>e</ident></paren> <brace></brace>

<statement> <ident><keyword>try</keyword></ident> <statement> <block></block> </statement> <ident><keyword>catchall</keyword></ident> <paren> <declaration> <ident><type>Exception</type></ident> <ident>e</ident> </declaration> </paren> <statement> <block></block> </statement> </statement>

or any of the other countless ways of structuring that sort of thing.

It seems to me that part of the problem is that we want to match elements based on what type it represents. For instance, in the second example above, we have a <block> inside a <statement> element as a way to represent the fact that a 'block' is a kind of 'statement'. We need this sort of type hierarchy information because we want to match against it -- we want to make rules that match both 'statement' and 'block' nodes. {ref WSDL}

Having type hierarchy matching would only partially help things. We still need to know what type hierarchies to use.. Do we do like the second example above, or

We can do stuff like the above -- naming the parts of the AST based on the grammar production it comes from -- for syntax constructs we know, but what about non-standard constructs that appear when we're writing macros. For instance,

try { } catchall (OperationNotSupported, OutOfMemory e) { }

How do we represent the stuff in the parentheses? How do we represent the parentheses themselves? Do we represent it as an <expression> holding several <ident>s and a <comma>? Or a <paren> with an <ident>, <comma> and a <declaration>?

Hrmm.. maybe we can generalize the basic semantic grammar to include "extension points" for macros. Well, by extension points I mean, bits of the grammar are so general that the can include regular sun java or new constructs. Consider declarations. We can have declarations of the following sorts:

package: package com.third_bit.jp; import: import com.sdfg.sdfg.Sdfg; class: public class Foo extends Bar implements Quux, Baaz {} interface: public interface Quux extends Qux {} field: public int fooSize; method: public void setFooSize(int fooSize) {}

We could generalize all of this with the following productions:

where <mod>, <type>, and <name> are all identifiers.

This idea of generalizing the grammar seems basically like what we were doing before by only representing punctionation. This is just more of a compromise which allows some of the annotating and typing we need, while still allowing new constructs.

But again, how general to go? As the grammar becomes more abstract (closer to just punctuation) it becomes more expressive but harder to match against in an XSL. Maybe 'harder' isn't the right word. What I mean is that making the grammar more abstract means that the macro expansion in the XSL seems to become less expressive. (Right? Because if, say, I only have 'ident' to match against, I can't tell whether it's a 'type' or a 'variable'... ).

I'm finding it hard to decide how abstract to make the Abstract Syntax Tree. That is, what do we actually need in terms of expressivity?

JSE parses the source using a basic punctuation grammar and then does all of it's matching in the code. From what I can see (though, I find the code very difficult to understand) they in-effect have a grammar which allows one to match against expressions, statements, blocks, type names, or generally identifiers, and punctuation (much like the first example AST representation).

Something seems off about all all of this. We're trying out macro expansions using XSL because we're thinking along the lines of representing source code as as XML. So far that has meant representing the AST of the source in XML. But now we've ended up putting little more than representing a punctuation. What have we really gained?

Maybe I'm going about this the wrong way. I've been thinking that the XML should represent the source code in such a general way that we don't need to know what parts are macro and what parts aren't.

What if we represent everything in the AST as it is. That is, we use an AST that represents macros and general syntax explicitly. For instance,

<sun:if> <test>/*some expression*/</test> <body> <pipitone:tryall> <foo-block> /*try block*/ </foo-block> <bar-catch> <type-list> <item>/*item1*/</item> <item>/*item2*/</item> </type-list> <exception-name>/*exception name*/</exception-name> </bar-catch> <quux-block>/*catch all block*/</quux-block> </pipitone:tryall> </body> </sun:if>

which represents:

if (/some expression/) { try { /*try block*/ } catchall (/*item1*/ /*item2*/ /*exception name*/) { /*catch all block*/ } }

So it's clear from the XML that the 'if' stuff belongs to sun java syntax, and the 'tryall' stuff belongs to the pipitone java syntax extension.

How did I render the above XML into that java code? Because I have a source renderer for sun:if and pipitone:tryall syntactic components in my head. I could have just as easily rendered it in pure sun java as:

if (/some expression/) { try { /*try block*/ } catch (/*item1*/ /*exception name*/) { /*catch all block*/ } catch (/*item2*/ /*exception name*/) { /*catch all block*/ } }

I could have also rendered it as XML. It's pretty clear to me that this sort of thing would now be a synch to do with XSL.

So what I'm saying is that now the job of determing which bits of the code represent which bits of syntax is up to the plain text source code parser (as it should be, right?) rather than the macro template expander. Maybe bits of syntax could be imported like any other type:

import syntax com.third_bit.pipitone.syntax.tryall;

And then if there was any ambiguity in parsing the source code, it could be dealt with like normal ambiguity in type usage. For instance, if I'm using a org.w3c.Document and a org.apache.lucene.document.Document in the same class I'm forced to use the fully qualified type name everywhere... So I could do this with syntax as well. Imagine that I wanted my 'tryall' macro to use the keyword 'catch' instead of 'catchall', but that might conflict with Sun's Java syntax. So:

import syntax com.third_bit.pipitone.syntax.try; ... sometime later in the code ... try { } com.third_bit.pipitone.catch (Exception1, Exception2 e) { } ... try { } java.lang.syntax.catch (Exception e) { }

might be a way to disambiguate this.

Under this scheme source code as XML is represented so that it is much more strongly connected to the syntax it represents. It can be validated against a schema much more specifically, etcetera etcetera. Hrm..

# posted 9:11 AM.

grammar book

Thursday, April 08, 2004

Revisit

0 Comments:

Links

archives