Sebastian J. Mielke

Sebastian J. Mielke

Describing discontinuous constituents with LCFRS

2016-10-21

Claim: Natural language utterances follow a certain structure [citation needed]. Since most people like to believe that this claim holds true (i.e., that there is syntax), the interesting question becomes: what does this “structure” look like and how can a computer work with it?

Prologue: nouns and their friends

We know that we can group words into categories like “nouns”, “verbs”, or “adjectives”. For a given sentence like “I write a blog” we end up with these part-of-speech tags (POS tags): PRP I VB write DT a NN blog PRP are pronouns, VB verbs, DT determiners (like “the” and “a”), and NN common nouns. But as humans we see that there is more going on in this sentence, namely that the verb “write” seems to connect two parts: the pronoun “I” and then our object: “a blog”. So it looks like tagging single words won't be enough, we will want to mark specific phrases as such. [1]

Act 1: Phrases, constituency trees and context-free grammars

We will call the phrase “a blog” a noun phrase (NP) and draw it like this: NP DT a NN blog We call both the individual words as well as phrases they make up (like this NP) constituents of a sentence.

Syntax studies of English quickly show that there is more to these constituents than we've seen: a noun phrase for example might only be constituted by a single word (e.g., “I”) or multiple words (like the phrase “a blog”). However, constituents can easily be composed of more constituents and can even be nested! Let's add an adjective (JJ) in there and see how we could represent that new NP (both options are technically possible, which one you prefer depends on your linguistic taste): NP DT a NP JJ new NN blog or NP DT a JJ new NN blog

Now we can interpret a full sentence as a hierarchical constituency tree as well: S NP PRP I VP VB write NP DT a NN blog

We also notice how both “I” and “a blog” are a noun phrase! Let's try to write down that knowledge. We want to say that a NP can be derived into these two phrases:

NPI
NPa blog

However, now we have to talk about what happens in the “…” part: a noun phrase isn't immediately constituted of “a” and “blog”, it consists of a determiner DET and a common noun NN! These then of course can be said to consist of “a” and “blog”, respectively.

If we look at all these immediate constituence relations, we can read 8 so-called rules off our tree:

S NP VP(1)
NP PRP (2)
PRPI (3)
VP VB NP(4)
VB write (5)
NP DT NN(6)
DT a (7)
NN blog (8)

Note how we deconstructed the knowledge that a NP can eventually be derived to for example “a blog” into three rules (6, 7, and 8), each one of which descends only exactly one layer in the tree:

NP

6

NP DT NN

7

NP DT a NN

8

NP DT a NN blog

We can see that given this set of rules we can derive not only this tree, but also one that reads “a blog write I” (which is pretty non-sensical, alright). And if we were to add a new rule

NPNN,

we could also derive “I write blog” and “blog write I”. Try deriving these sentences yourself. You see, the more rules we add, the more flexible we are in our generation process.

The set of rules we now played with is the foundation of a context-free grammar (CFG). It is called context-free, since each little expansion (each application of a rule) doesn't depend on any other parts of the tree – you can see that the left-hand side of each rule only contains exactly one expandable thing.

We call these expandable things non-terminals and all the things we can't expand any further terminals. In this case the terminals are English words and the non-terminals are either POS tags or syntactic categories (our phrase labels like NP and VP).

Act 2: Discontinuous constituents

Now let's consider sentences like “What should I do” or “That much I believe”. What is happening here (structurally)? How would you represent that?

When we as humans want to see what's going on here, we might try to undo the operations that made these sentences weird (wh-movement and fronting, respectively) and then try to look at our phrases:

SQ VP WP What VB do MD should NP PRP I
S VP NP DT That NN much VB believe NP PRP I
*“I should do what” and “I believe that much” are easy to analyze with our new tools. “do what” and “believe that much” are just good ol' verb phrases (VP). Easy enough.

Sadly, in our given (seemingly scrambled) sentences these phrases are discontinuous, meaning they don't really form one coherent phrase but are “all over the place”. Drawn in our known representation, this is what our clever “analysis” would result in:

SQ VP WP What VB do MD should NP PRP I
S VP NP DT That NN much VB believe NP PRP I
Disregard the questionable choice of phrase labels and look at the nasty crossing branches. [2]

Unfazed and naïve, we try to read some CFG rules off the first sentence:

SQVP MD NP
VPWP VB
WPWhat
VBdo
MDshould
NPPRP
PRPI

Hopefully you'll be raising your hand now asking what the hell this is supposed to be. If you didn't, watch these rules crash and burn when we try to derive the original tree (omitting some steps to save some space):

SQ VP MD NP

SQ VP WP VB MD NP PRP

SQ VP WP What VB do MD should NP PRP I

That's not our sentence! Of course not, how was the grammar supposed to know that it was supposed to split up the VP and place the first part (the “What”) at the beginning, but the second part (the “do”) at the end? The bad news is: context-free grammars can't do this, they can just glue everything together in the order that's specified in the rule – and that's it. No component-juggling.

The good news is: there are grammar formalisms more powerful than context-free grammars. Of course, more powerful formalisms with more “intelligence” are also more expensive for the computer (don't worry, it's not only hard for you), so we want to become just a little bit more powerful.[3]

Act 3: Linear context-free rewriting systems

Linear context-free rewriting systems (LCFRS; Vijay-Shanker, Weir, and Joshi, 1987) find a nice tradeoff between expressivity and “computability”. Let's not worry about the intimidating name for now (after the CFGs you saw LCFRSs only have two more letters) and jump right back into our example tree.

Since this is not supposed to be a thorough theory paper, we'll just treat LCFRSs like the grammars we've already seen. Still, make sure you sit comfortably before laying eyes on our new rules:

SQ <x1.1x2.1x3.1x1.2> (NP MD NP)
VP <x1.1 , x2.1> (WP VB)
WP <What> ( )
VB <do> ( )
MD <should> ( )
NP <x1.1> (PRP)
PRP <I> ( )

As you can see, all in all we still have the same structure: rules, which expand our non-terminals on the left-hand side into more complex right-hand sides. Only now there is something weird going on on that right-hand side.

To understand what's happening there, let's take a step back: we need a quick recap and some new terminology. On the left of the following picture we have the parse tree (we've seen that one all the time) and on the right the derivation of the sentence from our previous CFG example, represented as a tree (which I'm just going to creatively name the derivation tree):

S NP PRP I VP VB write NP DT a NN blog
SNP VP NPPRP PRP → I VPVB NP VB → write NPDT NN DT → a NN → blog
Where the parse tree has nodes labeled with non-terminals, the derivation tree labels them with rules – the rules that were used to expand some non-terminal at exactly this position in the parse tree. [4]

So far, we have read these trees top-down: we start with some initial non-terminal like S and expand it into non-terminals and terminals, again and again. But we can also read this derivation bottom-up! The rules on the leaves generate words of the English language, e.g., PRP generates “I”. PRP then passes the current work-in-progress-sentence (“I”) upwards to the next rule, namely “NPPRP”. This rule takes everything that was passed upwards (which in this case is just “I”) and passes it on again. The next stop is already the last one: the rule at the tree root, “SNP VP”. We followed the left branch, represented by NP (which passed “I” upwards) meanwhile the right branch (lead by VP) hasn't been idle and has generated “write a blog”. Now at the root node, we just concatenate everything that was passed up and since there are no higher rules to which we would pass this result, “I write a blog” is the finished sentence!

So we can say that CFGs just glue everything you've generated together to form the result sentence. I already not-so-subtly hinted at the fact that they can't “juggle components” – and this is exactly what the scary-looking rules of our LCFRS can do!

Let's look at the first two rules in more detail – and translate our current cryptic notation into a picture:

SQ → <x1.1x2.1x3.1x1.2> (VP MD NP)
SQ 1.1 2.1 3.1 1.2 VP 1.1 1.2 MD 2.1 NP 3.1
VP → <x1.1 , x2.1> (WP VB)
VP 1.1 1.2 WP 1.1 VB 1.2

What all these jiggly lines mean will make more sense when we use these rules to construct the LCFRS derivation tree of our specific example sentence: What should I do SQ 1.1 2.1 3.1 1.2 VP 1.1 1.2 MD 2.1 NP 3.1 VP 1.1 1.2 WP 1.1 VB 1.2 WP What VB do MD should NP 1.1 PRP 1.1 PRP I

Now this is something we can work with! Again, let us read this derivation tree bottom-up. The leafs first generate words like “What” and “I”. Next, if we look at the rule on the far right, it just takes the “I” that the PRP has generated and passes it upwards – just like the corresponding CFG rule we've already seen.

On the left side WP and VB passed up “What” and “do”, respectively. If this were a CFG rule, we would just glue them together to get “What do” and pass it upwards, right? Here, however, we don't glue them together – we pass both of them upwards separately!

This is the first magical thing LCFRS allow us to do: instead of propagating a single thing upwards, we can pass multiple things separately!

This pays off when we finally look at what happens in the topmost rule: We get “What” and “do” (separately!) from the VP, “should” from the MD, and “I” from the NP. Now this rule again is much more fancy than the stupid CFG rules: instead of gluing things together in the order they come in we can let them be concatenated in any order we like, namely instead of “What do should I” we create “What should I do”.

This is the second magical thing LCFRS allow us to do: instead of only using in-order concatenation, we can choose a more flexible composition function!

A “more flexible composition function” sounds a little vague, but we actually have a clear condition: it has to be linear[5] and non-deleting, meaning that everything that gets passed up from below has to be used exactly once in the construction of the result. That means that reordering is totally fine – and so is adding new terminals (remember, words of the english language).

This means we could also build a rule like this[6]: NP the 1.1 NN 1.1

Once you feel comfortable with these pictures, let's get back to the gory, short notation. The angular brackets in our rules represent this new composition function. So if we now read the rule “VP → <x1.1 , x2.1> (WP VB)” again, what can we see? First the general structure looks like this: <?,?>. This tells us that this rule will want to pass two elements upwards. We call 2 the fan-out of the non-terminal VP and of the rule.[7] Now let's look at what's inside of these question marks. The first one is just x1.1, meaning that from the first child we take the first element. The second one, x2.1 tells us to simply take the second child's first element.

And “SQ → <x1.1x2.1x3.1x1.2> (VP MD NP)”? Again, we first find out what it's going to return: <?>. Looks like it's just going to one thing – and that is the concatenation x1.1x2.1x3.1x1.2!

And the new “the”-rule we just thought of? How would it look in this notation? Try to figure it out for yourself. Don't continue reading until you did. Ready? Okay: “NP → <the x1.1> (NN)”.

I don't know how you feel, but if got all this and you're still with us, you can really pat yourself on the back!

Epilogue: Now what?

If this post only got you excited about LCFRSs and now you want to know more, there's good and bad news. The good news is: people have written about LCFRS for quite some time now, there is lots of material out there (for example my bachelor's thesis). The bad news is: none of it seems to be as colorful and fun as I tried to make this post (especially not my bachelor's thesis) – and it often seems like everyone is using a different notation. That's why I believe it's important to understand the intuition of this powerful formalism first.

If you ever do decide to check out said thesis, you will see that this post is pretty much just the motivation and introduction to my thesis, the real work being a clean mathematical formulation of LCFRS, their extraction from a treebank (how do we get this grammar from lots of example trees) and binarization (how do we make sure all rules only have two children to allow faster processing).

Finally, if you have a minute, please let me know you made it to the end of this post and what you think about it! Thank you for reading :)



  1. This is more of a semantic and task-oriented argument, rather than one that arises purely from syntactic observations. Check out Colin Philips' nice introduction to Syntax here, if you're curious how to come up with the following ideas of constituency!

  2. Sadly (or luckily?) English does not really have many discontinuous phrases and the discontinuities we encountered here could also be questioned – other interpretations result in other trees. Example: Look at the two interpretations of the sentence “I called her up”:

    S NP PRP I VP DcVB VB called RP up NP PRP her

    vs.

    S NP PRP I VP VB called RP up NP PRP her
    (We invent a new fictional label DcVB for our discontinuous verb “to call up”.)

  3. In the Chomsky hierarchy, the class of languages that is more expressive than context-free languages is called context-sensitive, but sadly this class is already so expressive and powerful that dealing with it is quite computationally expensive. What we are looking for is a class that is between these two: a class of mildly context-sensitive languages, specifically the one generated by the formalism we will see now (or equivalent ones).

  4. For CFGs it's very easy to transform one tree into the other. Think about it!

  5. You're right, this is why they're called linear context-free rewriting systems. We also already saw why we call them context-free: because their derivation process (visible in the derivation tree) is context-free – it's just the set of sentences that can be generated that is not context-free.

  6. Unlike all our previous examples, we don't give “the” a POS tag in this example, but hey, it's a free country.

  7. Perhaps you realized that this implies that every non-terminal is associated with exactly one fan-out, so if we also want to have a VP with fan-out 1… that's not possible. The easy solution is to use VP2 and VP1 as two separate non-terminals for these two cases.