jerith (jerith) wrote,
jerith
jerith

  • Mood:
  • Music:

CF grammar name generator

Be warned: this post contains much geekiness and probably won't make very much sense unless you're a programmer. It's the post I wanted to make here but couldn't find the book. I've found the book and played with the code a bit, so I actually have something to say now. Since it's of interest to a very limited audience (even more limited than the one I have) it's behind a cut.

As I mentioned previously, I have been playing with formal grammars as a tool for generating random names. I chose python as an implementation language because it is particularly suited to text processing and the dictionary data type makes generative grammars so easy to implement.

First, a word on formal grammars. While I'm not going to go into detail here (that's what textbooks are for) I will give a bit of background on formal grammars in general and context-free grammars in particular. A grammar is really nothing more than a set of rules for generating sentences from symbols. These sentences could be a human language (such as english), a programming language (such as python), a musical passage (such as John Williams' Olympic Fanfare) or a mathematical statement (such as a Fourier transform). Basically, if it can be constructed according to a set of rules, a formal grammar will describe it.

The rules basically describe a transformation from one set of symbols to another. The most general grammar (a type 0 or phrase structure grammar) has no restriction on the types of rules that may be present. The grammar I am using here, however, is a type 2 or context free (CF) grammar. As the name implies, the rules may not specify any context for the transformation. This results in a left hand side that consists of only a single symbol. The right hand side, however, is unrestricted and may contain any number of symbols. An example of a CF grammar to generate a list of names could be written as follows:

<sentence> ::= <name> | <list> and <name>
<list> ::= <name>, <list> | <name>
<name> ::= Tom | Dick | Harry | Jerith

The symbols in the angle brackets are non-terminal symbols and they represent symbols that must be expanded further by another (or the same) rule. The bars represent a choice between symbols. The first rule is taken as the starting place. Usually this would be indicated by a subscript "s", but that's difficult to do neatly in HTML and my code always starts with a hardcoded starting rule.

It is not difficult to convince yourself (by applying these rules in various ways) that sentences of the form "Tom, Dick and Harry" or "Jerith, Harry, Jerith, Tom and Tom" are described by this grammar.

Now on to the juicy stuff. I wanted to write a powerful, customisable random name generator. While I am sure grammar based name generators exist out there (it's a fairly obvious use of the techniques) I hadn't come across one back when I last used a name generator. There were approximations, where you could supply a list of beginnings, middles and ends of names, but that's a bit limiting. For example, Tolkien's Elvish has a strict set of phonology rules which describe how words may be constructed to sound Elvish. A name generator following these rules would produce a large array of Elvish names. To do the same thing by providing a three sets of name-parts would produce a much smaller and less varied set.

While it would be nice to have a parser frontend and read a grammar from a separate file, for now the grammar is hardcoded as a dictionary. Of course, since python is interpreted you don't need to recompile when you change something. By using the dictionary data type and a regular expression to find non-terminal symbols (which means you can't use "<" or ">" in your name, but that's a minor issue at present) the actual code for the name generator is fairly small:
import random
import re
import sys

# Regular expression to catch non-terminals, used frequently, so global
reNonTerminal = re.compile(r"<(\w+)>")

def nameGen(nameGrammar):
	nameStr = random.choice(nameGrammar["name"])
	matchNonTerminal = reNonTerminal.search(nameStr)
	while matchNonTerminal:
		subStr = random.choice(nameGrammar[matchNonTerminal.group(1)])
		nameStr = reNonTerminal.sub(subStr, nameStr, 1)
		matchNonTerminal = reNonTerminal.search(nameStr)
	return nameStr

All you need to do now is add a grammar in a dictionary and call the function with it. Each time nameGen() is called, a new name will be generated. Now that, while perhaps not completely trivial, is hardly difficult. Three tasks remain, however. The first is constructing a grammar to use with the generator. The second, and more tedious, is checking the grammar for errors. While most possible issues with the grammar would not confuse the generator overmuch, detecting them results in a better overall grammar and hence better name generation. The third, and by far the most interesting, is an algorithm for generating a grammar from a list of names. In this way, one could simply feed the grammar generator a tall stack of Elvish names and it would provide a general grammar for coming up with fresh, new names.

One of the grammars I used while playing with this code is shown here:
fooGrammar = {
	"name": ["<nameStart><nameMiddle0to2><nameEnd>"],
	"nameMiddle0to2": ["","<nameMiddle>", "<nameMiddle><nameMiddle>"],
	"nameStart": ["<nsCons><nmVowel>", "<nsCons><nmVowel>", "<nsCons><nmVowel>", "<nsVowel>"],
	"nameMiddle": ["<nmCons><nmVowel>"],
	"nameEnd": ["<neCons><neVowel>", "<neCons>", "<neCons>"],
	"nsCons": ["J", "M", "P", "N", "Y", "D", "F"],
	"nmCons": ["l", "m", "lm", "th", "r", "s", "ss", "p", "f", "mb", "b", "lb", "d", "lf"],
	"neCons": ["r", "n", "m", "s", "y", "l", "th", "b", "lb", "f", "lf"],
	"nsVowel": ["A", "Au", "Ei"],
	"nmVowel": ["a", "e", "i", "o", "u", "au", "oa", "ei"],
	"neVowel": ["e", "i", "a", "au"]
}

This grammar clearly shows a few hacks that improve the quality of the output somewhat. The extra <nameMiddle0to2> non-terminal restricts the length of the name. This kind of restriction could be handled far better by syntax similar to that of a regular expression in a parser frontend. Also note that <nameStart> is defined as a choice between three identical <nsCons><nmVowel> combinations and a <nsVowel> symbol to control the relative frequency of vowel or consonant beginnings. This, again, could be handled in syntax or an option could be provided to allow the parser to choose from a combination of rules rather than only a single rule. Some names generated by this grammar follow:

Eilbaulf
Alfumifau
Jin
Peibaulen
Paulf
Audussuth
Maulmeilf

While a CF grammar solves the problem adequately, there remains the possibility for a much more concise definition using a type 1 or context sensitive (CS) grammar. This would allow explicit statement of rules such as "symbol A may only occur before symbols B, C or D" instead of representing "AB", "AC" and "AD" as separate symbols. While this does not necessarily improve the flexibility of the grammar (assuming a manageable number of such contexts) it can significantly improve the representation of the grammar by removing the need to hide such rules in lists of possible symbols. The disadvantage of such a method, however, is that a much more complex generator would be required.
Tags: code
Subscribe
  • Post a new comment

    Error

    default userpic
    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 17 comments