Grammars and parsing

7.1. Grammars and parsing#

The syntax of a language is specified by a grammar, which is a set of grammar rules of the form

Category1 --> Category2,Category3.
Category2 --> [Terminal].

Here, CategoryX denotes a syntactic category, specifying the type of a sentence part (e.g. noun, noun phrase, etc.). The first rule states that a Category2 followed by a Category3 is a Category1. For instance, the fact that a sentence may consist of a noun phrase followed by a verb phrase is expressed by the rule

sentence --> noun_phrase,verb_phrase.

A terminal is any word which occurs in the language. The second rule above assigns a syntactic category to a word. For instance:

noun --> [bicycle].

Syntactic categories are also called non-terminals.

A grammar which specifies a tiny bit of the English language is given below. As in clausal logic, grammar rules are separated by periods.

sentence             --> noun_phrase,verb_phrase.
noun_phrase          --> proper_noun.
noun_phrase          --> article,adjective,noun.
noun_phrase          --> article,noun.
verb_phrase          --> intransitive_verb.
verb_phrase          --> transitive_verb,noun_phrase.
article              --> [the].
adjective            --> [lazy].
adjective            --> [rapid].
proper_noun          --> [achilles].
noun                 --> [turtle].
intransitive_verb    --> [sleeps].
transitive_verb      --> [beats].

/** <examples>
?- phrase(sentence,[achilles,beats,the,lazy,turtle]).
?- phrase(sentence,L),reverse(L,L).
*/

Some sentences generated by this grammar are: ‘the lazy turtle sleeps’, ‘Achilles beats the turtle’, and ‘the rapid turtle beats Achilles’. The grammatical structure of these sentences can be described by a parse tree, which is a tree containing the words of the sentence as leaves, and the syntactic categories assigned to parts of the sentence as nodes (Figure 7.1).

../../../_images/image0022.svg — Figure 7.1 Parse tree for the sentence ‘the rapid turtle beats Achilles’.#

Exercise 7.1 #

Redraw this parse tree in the manner of an SLD proof tree, where ‘resolvents’ are partially parsed sentences such as

[the],[rapid],noun,verb_phrase

and ‘clauses’ are grammar rules.

Such a parse tree can be constructed by starting with the non-terminal sentence, and repeatedly replacing non-terminals by the right-hand side of an applicable rule, until the given sentence is obtained as a sequence of terminals. This method is called top-down parsing. Alternatively, we could start with the sentence and look for parts of it which occur on the right-hand side of a rule, and replace that part of the sentence with the non-terminal on the left-hand side of the rule, until we obtain the single non-terminal sentence. This procedure is called bottom-up parsing. It should be noted that both methods require search: at any stage, several rules might be applicable.

Exercise 7.2 #

Draw the search space generated by the above grammar for a top-down parse, if grammar rules are applied to sentences from left to right. Discuss the similarities and differences with SLD-trees.

In general, grammar rules are allowed to be recursive. For instance, a noun phrase can contain several adjectives, as described by the following rules:

noun_phrase          --> article,noun_phrase2.
noun_phrase2         --> noun.
noun_phrase2         --> adjective,noun_phrase2.
article              --> [the].
adjective            --> [lazy].
adjective            --> [rapid].
noun                 --> [turtle].

/** <examples>
?- phrase(noun_phrase,[the,lazy,rapid,turtle]).
?- phrase(noun_phrase,L).
*/

This set of rules allows ‘the lazy rapid turtle’ as a noun phrase. Recursion extends the descriptive power of a grammar considerably, by allowing repetitive structures.

Grammars like the ones we have seen are called context-free grammars. This name derives from the fact that only one non-terminal is allowed on the left of a grammar rule. A grammar rule which contains several non-terminals on its left-hand side is called context-sensitive: some of those non-terminals act as a context for the others, allowing the rule to be used only when that context is present. As an example, consider a grammar which would rule out sentences like ‘the turtles sleeps’, in which the ‘plurality’ (singular, plural) of noun and verb disagree. A candidate would be:

sentence                   --> noun_phrase,plurality,verb_phrase.
noun_phrase                --> article,noun.
plurality                  --> singular.
plurality                  --> plural.
verb_phrase                --> intransitive_verb.
article                    --> [the].
noun,singular              --> [turtle],singular.
noun,plural                --> [turtles],plural.
singular,intransitive_verb --> [sleeps].
plural,intransitive_verb   --> [sleep].

In this grammar, the non-terminal plurality creates a context for the applicability of the rewrite rules for noun and intransitive verb. Procedural programming languages like Pascal are also, to some extent, context-sensitive: statements like X:=10 can only be parsed in the context created by the declaration of the variable X (if it is declared to be a Boolean, the statement is illegal). Apart from this, such programming languages are context-free: each statement can be parsed without referring to its context.

Context-sensitive grammars greatly increase the complexity of the parsing task; moreover, the grammatical structure of sentences cannot be simply described by a parse tree. In this chapter, we will restrict attention to context-free grammars, extended with some Prolog-specific features. The resulting grammars are called Definite Clause Grammars, and will be introduced in the next section.