$Id: NameClassAlgebra.docbook,v 1.1 2001-06-22 16:28:26-07 bear Exp bear $
Table of Contents
Arithmetic of name class (NC) might be easily understood if you imagine a grid. Any name class, no matter how big its definition is, can be expressed in a grid. For example, the following name class
<choice> <nsName ns="fooNS"/> <name ns="barNS">zoo</name> <difference ns="zigNS"> <nsName /> <name>guf</name> </difference> </choice>
can be expressed as:
The columns indicates local names, and the rows indicates namespace URIs. This grid can be read in the obvious way: if the name is (barNS,zoo), then this name is accepted by this NC.
The key concept is '*', the wildcard. It is the wild card, and this entry is used only when there is no row/column for the name being tested. For example, if you are testing (fooNS,abcdef), then you should look up the cell of (fooNS,*). Therefore this name should be rejected.
Note that we don't use cells like (*,zoo) or (*,guf). Those cells are delegated to (*,*) cell. So if you are testing (abcdefNS,zoo), then you should check the (*,*) cell. This asymmetry is due to the fact that we have <nsName/>, which matches anything as long as its namespace URI is something, whereas we don't have a primitive which matches anything as long as its local name is something.
If you are really familiar with NC arithmetic, then probably you'll find that the grid notation is too verbose. Yes, you are right. You can express the name class in much more concise way. But nevertheless this grid notation is easy to understand.
It is very easy and straightforward to convert a NC written in RELAX NG syntax to the grid notation. Here is the detail of the algorithm written in C-like language:
Grid calcGrid( NameClass nc ) { // step.1 visit all <name> element in the given nc // and collect its content. Set localNames; for( all element e in nc ) if(e.tagName()=="name") localNames.add( e.bodyText() ); // step.2 visit all "ns" attributes and collect them. // we're assuming that every element has the "ns" attribute. Set URIs; for( all element e in nc ) URIs.add( e.getAttribute("ns") ); // step.3 add the wildcard localNames.add("*"); URIs.add("*"); // step.4 now fill the grid. Grid g; for( all uri in URIs ) for( all ln in localNames ) g[uri,ln] = isValidName( nc, uri,ln ); return g; }
In this algorithm, the wildcard is denoted by a special string "*". Any string will serve as the wildcard as long as it is not a valid name for an element/attribute name.
Also, this algorithm assumes that there exists a function "isValidName", which takes a name class, a namespace URI, and a local name, then return a boolean that indicates whether the specified (uri,local) pair is accepted by the name class. This algorithm is fairly straightforward and therefore it's not described in this article.
It is also possible to convert a grid notation to the equivalent RELAX NG syntax. The following algorithm is relatively simple, and produces RELAX NG patterns of (IMO) acceptable quality.
XML calcXML( Grid g ) { XML result; for( all uri in g.rows ) { if( uri=='*' ) continue; // ignore the wildcard XML temp; for( all ln in g.cols ) { if( ln=='*' ) continue; // ignore the wildcard if(g[uri,*]!=g[uri,ln]) temp = <choice> temp <name ns="uri"> ln </name> </choice> } if( g[uri,*]!=g[*,*] ) temp = <difference> <nsName ns="uri"/> temp </difference>; result = <choice> result temp <choice>; } if( g[*,*]==Y ) result = <not> result </not>; } }
Since many RELAX NG representations are possible for one grid, many different algorithms are possible; this is just an example.
Conceptually, a name class can be thought as a set of allowed (URI,local) pairs.
Name class is closed to the boolean operation. Given name classes NC1 and NC2, it is always possible to produce
the negation of NC, in which a name is accepted if and only if it is NOT accepted by NC.
<not>NC</not>
the intersection of NC1 and NC2, in which a name is accepted if and only if it is accepted by both NC1 and NC2.
<difference> NC1 <not>NC2</not></difference>
the union of NC1 and NC2, in which a name is accepted if and only if it is accepted by NC1 or NC2 (or both).
<choice> NC1 NC2</choice>
The easiest way to perform name class arithmetic is to use RELAX NG syntax, as described above.
Or you can do the same operations by using grids. The algorithm will be like this:
Grid transduceGrid( Grid a, Grid b ) { Grid result; // step.1 create a grid by merging cols and rows // of the two grids. result.rows.addAll( a.rows ); result.rows.addAll( b.rows ); result.cols.addAll( a.cols ); result.cols.addAll( b.cols ); // step.2 fill the grid. for( all uri in result.rows ) { for( all ln in result.cols ) { // do the operation you want result[uri,ln] = op( a.lookup(uri,ln), b.lookup(uri,ln) ); } } }
If the function "op" is defined as follows, then you'll get the intersection.
boolean op( boolean a, boolean b ) { return a&&b; }
To get the union, the "op" function should be:
boolean op( boolean a, boolean b ) { return a||b; }
Negation can be done simply by flipping all the cells of the grid.
If you perform those operations by using grids, then the resulting grid is always bigger than ( or equal to) the operand grids. This means that if you perform operations many times, then you'll probably get a very big grid.
But sometimes the grid becomes unnecessarily redundant. Consider the following grid for example:
Guess how redundant it is!
First, you might notice that the row of "zigNS" is unnecessary because every cell of the row "zigNS" is equal to the cell (*,*). Such rows can be removed from the grid anytime.
The row of "barNS" is unnecessary by the same reasoning. So actually we can remove two rows at the same time. But for the purpose of better explanation, let us keep "barNS" untouched.
Now look at the "zoo" column. This column is the same as the "*" column. columns like this can also be removed from the grid.
By removing the "zoo" column and the "barNS" row, now we get the minimized grid.
guf | * | |
fooNS | N | Y |
* | - | N |
To summarize, the grid can be shrinked by applying the following two rules.
Sometimes you may want to know whether a particular name class is empty or not. (A name class is said to be empty when it accepts nothing.)
The grid notation is useful for this purpose. A name class is empty if and only if there is no 'Y' in its grid.
By generalizing the above problem, let's say you want to decide whether a name class can accept infinite number of names. For example, the following name class accepts 3 names.
<choice> <name>name1</name> <name>name2</name> <name>name3</name> </choice>
But sometimes a name class can be so complex that you can't tell from its RELAX NG syntax.
<difference> <choice> <anyName/> <nsName ns="fooNS" /> </choice> <difference> <anyName/> <name>abc</name> </difference> </difference>
The grid is also useful for this purpose: First convert the pattern to the grid, then count 'Y' in the "*" column. If you have any 'Y' in that column, then that name class can accept infinite number of names.
If the grid has no 'Y' in that column, then count 'Y' in other cells. That number tells you how many names are accepted by the name class.