Английская Википедия:Chomsky–Schützenberger enumeration theorem

Материал из Онлайн справочника
Перейти к навигацииПерейти к поиску

In formal language theory, the Chomsky–Schützenberger enumeration theorem is a theorem derived by Noam Chomsky and Marcel-Paul Schützenberger about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

Statement

In order to state the theorem, a few notions from algebra and formal language theory are needed.

Let <math>\mathbb{N}</math> denote the set of nonnegative integers. A power series over <math>\mathbb{N}</math> is an infinite series of the form

<math>f = f(x) = \sum_{k=0}^\infty a_k x^k = a_0 + a_1 x^1 + a_2 x^2 + a_3 x^3 + \cdots</math>

with coefficients <math>a_k</math> in <math>\mathbb{N}</math>. The multiplication of two formal power series <math>f</math> and <math>g</math> is defined in the expected way as the convolution of the sequences <math>a_n</math> and <math>b_n</math>:

<math>f(x)\cdot g(x) = \sum_{k=0}^\infty \left(\sum_{i=0}^k a_i b_{k-i}\right) x^k.</math>

In particular, we write <math>f^2 = f(x)\cdot f(x)</math>, <math>f^3 = f(x)\cdot f(x)\cdot f(x)</math>, and so on. In analogy to algebraic numbers, a power series <math>f(x)</math> is called algebraic over <math>\mathbb{Q}(x)</math>, if there exists a finite set of polynomials <math>p_0(x), p_1(x), p_2(x), \ldots , p_n(x)</math> each with rational coefficients such that

<math>p_0(x) + p_1(x) \cdot f + p_2(x)\cdot f^2 + \cdots + p_n(x)\cdot f^n = 0.</math>

A context-free grammar is said to be unambiguous if every string generated by the grammar admits a unique parse tree or, equivalently, only one leftmost derivation. Having established the necessary notions, the theorem is stated as follows.

Chomsky–Schützenberger theorem. If <math>L</math> is a context-free language admitting an unambiguous context-free grammar, and <math>a_k := | L \ \cap \Sigma^k |</math> is the number of words of length <math>k</math> in <math>L</math>, then <math>G(x)=\sum_{k = 0}^\infty a_k x^k</math> is a power series over <math>\mathbb{N}</math> that is algebraic over <math>\mathbb{Q}(x)</math>.

Proofs of this theorem are given by Шаблон:Harvtxt, and by Шаблон:Harvtxt.

Usage

Asymptotic estimates

The theorem can be used in analytic combinatorics to estimate the number of words of length n generated by a given unambiguous context-free grammar, as n grows large. The following example is given by Шаблон:Harvtxt: the unambiguous context-free grammar G over the alphabet {0,1} has start symbol S and the following rules

SM | U
M → 0M1M | ε
U → 0S | 0M1U.

To obtain an algebraic representation of the power series Шаблон:Tmath associated with a given context-free grammar G, one transforms the grammar into a system of equations. This is achieved by replacing each occurrence of a terminal symbol by x, each occurrence of ε by the integer '1', each occurrence of '→' by '=', and each occurrence of '|' by '+', respectively. The operation of concatenation at the right-hand-side of each rule corresponds to the multiplication operation in the equations thus obtained. This yields the following system of equations:

S = M + U
M = M²x² + 1
U = Sx + MUx²

In this system of equations, S, M, and U are functions of x, so one could also write Шаблон:Tmath, Шаблон:Tmath, and Шаблон:Tmath. The equation system can be resolved after S, resulting in a single algebraic equation:

Шаблон:Tmath.

This quadratic equation has two solutions for S, one of which is the algebraic power series Шаблон:Tmath. By applying methods from complex analysis to this equation, the number <math>a_n</math> of words of length n generated by G can be estimated, as n grows large. In this case, one obtains <math>a_n \in O(2+\epsilon)^n</math> but <math>a_n \notin O(2-\epsilon)^n</math> for each <math>\epsilon>0</math>.[1]

The following example is from Шаблон:Harvtxt:<math display="block"> \left\{\begin{array} { l } { S \rightarrow X Y } \\ { T \rightarrow a T | T b T | Y c Y } \\ { Y \rightarrow Y a Y | c Y | a b T a Y Y a | X } \\ { X \rightarrow a | b | c } \end{array} \Rightarrow \left\{\begin{array}{l} s(z)=x(z) y(z) \\ t(z)=z t(z)+z t(z)^2+z y(z)^2 \\ y(z)=z y(z)^2+z y(z)+z^4 t(z) y(z)^2+x(z) \\ x(z)=3 z \end{array}\right.\right. </math>which simplifies to<math display="block"> s(z)^8-27\left(z^3-z^2\right) s(z)^5+\ldots+59049 z^{10}=0 </math>

Inherent ambiguity

In classical formal language theory, the theorem can be used to prove that certain context-free languages are inherently ambiguous. For example, the Goldstine language <math>L_G</math> over the alphabet <math>\{a,b\}</math> consists of the words <math>a^{n_1}ba^{n_2}b\cdots a^{n_p}b</math> with <math>p\ge 1</math>, <math>n_i>0</math> for <math>i \in \{1,2,\ldots,p\}</math>, and <math>n_j \neq j</math> for some <math>j \in \{1,2,\ldots,p\}</math>.

It is comparably easy to show that the language <math>L_G</math> is context-free.Шаблон:Sfnp The harder part is to show that there does not exist an unambiguous grammar that generates <math>L_G</math>. This can be proved as follows: If <math>g_k</math> denotes the number of words of length <math>k</math> in <math>L_G</math>, then for the associated power series holds <math>G(x) = \sum_{k=0}^\infty g_k x^k = \frac{1-x}{1-2x}- \frac1x \sum_{k \ge 1} x^{k(k+1)/2-1} </math>. Using methods from complex analysis, one can prove that this function is not algebraic over <math>\mathbb{Q}(x)</math>. By the Chomsky-Schützenberger theorem, one can conclude that <math>L_G</math> does not admit an unambiguous context-free grammar.[2]

Notes

Шаблон:Reflist

References

Шаблон:Refbegin

Шаблон:Refend

Шаблон:Noam Chomsky

  1. See Шаблон:Harvtxt for a detailed exposition.
  2. See Шаблон:Harvtxt for detailed account.