# How to remove ambiguity

How to simplify all kinds of ambiguous grammars.

Prerequisites: the concepts of grammar, language and parse tree.

**Note: **The transformation of the grammar is the first (optional) step for the creation of a parser of every kind: LL or LR or LALR. This will solve the ambiguity for LR and LALR but for the LL parser you’ll still need to remove epsilon production, remove left recursion and remove left factorization.

Teaching tool for all kinds of parser: jsmachines.

In this post only the transformation will be treated but if you’re writing a parser this post has some insight.

As you all know a context free grammar ( CFG ) is ambiguous if there is a string in the language that can be generated by 2 or more, parse trees. A CFG is unambiguous if every string generated has exactly one parse tree.

The parser generators are able to handle *only some* type of ambiguity, in particular the ambiguity generated by the priority and associativity of the operators. An example of the associativity problem is:

Where for this simple grammar, the string *INT + INT + INT* generate 2 parse tree.

Even if this kind of ambiguity is handled by parser generator, I’m going to show you how to create an equivalent unambiguous grammar. Equivalent means that the new grammar will generate the same language as the old one. **It’s often the case that a string in the new grammar, will have a different parse tree than the one in the old grammar.**

The transformation is composed of 3 steps:

- Find where the ambiguity is.
- Decide which parse tree is the correct one.
- Create an equivalent unambiguous grammar.

## Example left associativity

The grammar in the picture above is *Exp → Exp + Exp | INT . *We find were the ambiguity is, by looking at the parse tree and based on the left associativity rule, the right tree is the left one. This is because we want to sum the expression from left to right, as 1+2+3 = (1+2+3), and since the evaluation of the result from the tree is bottom-up, the left tree represent the calculation. What we need to do is allow the first *Exp *in the production and block the second *Exp *with the introduction of a terminal. The rewritten grammar is:

*Exp → Exp + INT — —*for enforcing the left associativity.*Exp → INT — —*for producing the string*INT .*

## Example right associativity

An other example from algebra is the right associativity of the power operator: 1^2^3 = 1^(2^3) . The basic grammar is: *Exp → Exp ^ Exp | INT .*

The correct tree is the right one. The rewritten grammar is:

*Exp → INT ^ Exp — —*for enforcing the right associativity.*Exp → INT — —*for producing the string*INT .*

## Example no associativity

If we want to write a simple inequality for the *less than* operator there is no associativity since it’s allowed only a single operator. So a string like *INT < INT < INT* is not permitted. Below is the parse tree for the string in the grammar *Exp → Exp < Exp | INT *.

They are both incorrect. Since this is a toy example we could rewrite the grammar as *Exp → INT< INT *. In a normal case we need the application of the rule only once, so we need to introduce a new non terminal; for example:

*Exp → Exp’ < Exp’**Exp’ → Exp’ + INT | INT*

## Example priority (precedence)

This is the case where some symbol have the precedence above other such as multiplication above sum. From the simple grammar: *Exp → Exp * Exp | Exp + Exp | INT . *It’s possible find a string that produces 2 trees as: *INT + INT * INT *. But we all know that the product has precedence, so the correct tree is the right one.

In this case we have to introduce a new non-terminal. It’s important to remember that the tree is evaluated bottom up, so the priority in the grammar is inverted, in the first rule there is the lowest priority.

*Exp → Exp + Term | Term**Term → Term * INT | INT*

Both the operators are left associative.

## CFG for arithmetic

The grammar for writing simple expression is a classic in textbook since it show a mix of the cases above.

*Exp → Exp * Exp | Exp / Exp | Exp + Exp | Exp - Exp |Exp ^ Exp | (Exp) | -Exp | INT*

The order precedence from the lowest to the highest is :* + -* then ** / *then* *^ then unary minus and *()*. Also the operator power has right associativity, the unary minus has no associativity, all the others have left associativity. So the unambiguous grammar is (the names are changed for readability):

*A → A+ B| A - B | B**B → B * C | B / C | C**C → D ^ C | -D | D**D → ( A ) | INT*

## CFG for boolean

*F → F ∧ F | F ∨ F | ¬ F | ( F ) | atom*

The normal precedence order is: *∨ *then *∧ *then* ¬ *then* () . *The operators (and, or) are left associative. So the unambiguous grammar is:

*A → A ∨ B| B**B → B ∧ C | C*- C
*→ ¬ D | D* - D
*→ ( A ) | atom*

## Dangling else problem

This is the problem of determining, in a series of **if **terminating with an **else**, to which **if** the **else** is referring. The ambiguous grammar is:

*stmt → if expr then stmt | if expr then stmt else stmt | other*

The other represent the various statements of the grammar that are not relevant on this issue.

A string that generates a conflict is: *if E₁ then if E₂ then S₁ else S₂ .*

Where the *Eᵢ* represent a boolean expression and the *Sᵢ *represent a generic statement of the programming language. The two parse tree generated by the string:

In all programming languages, the first parse tree is the correct one. The idea behind it, is “Match each else with the closest unmatched then” . Following this rule we want to prevent the situation of the second tree: the statement appearing between a **then **and an **else **must be “matched”, expressed in an other way, the interior statement must not end with an unmatched (open) **then**. A matched statement is either an **if-then-else** containing matched statements or it’s any other kind of unconditional statement ( the not considered statement external to the grammar).

So in the new grammar we will have matched statements and unmatched statements. To clarify the situation, it’s possible to list all the allowed productions:

*if expr then matched_stmt**if expr then open_stmt**if expr then matched_stmt else matched_stmt**if expr then matched_stmt else open_stmt*

The productions **not** allowed are:

*if expr then open_stmt else matched_stmt**if expr then open_stmt else open_stmt*

So now we can create the rules:

*matched_stmt → if expr then matched_stmt else matched_stmt | other**open_stmt → if expr then matched_stmt |if expr then open_stmt | if expr then matched_stmt else open_stmt*

And the first rule of the grammar is the one that allow us to create both kind of statements:

*stmt → open_stmt | matched_stmt*

It’s possible to simplify the *open_stmt *rule observing that it’s possible to merge the first two production so the grammar becomes:

*stmt → open_stmt | matched_stmt**matched_stmt → if expr then matched_stmt else matched_stmt | other**open_stmt → if expr then stmt | if expr then matched_stmt else open_stmt*

That’s all, I hope this post has been useful.