?

Log in

CF grammar name generator - Jerith online [entries|archive|friends|userinfo]
jerith

[ website | jerith.za.net ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

CF grammar name generator [Feb. 8th, 2005|08:45 am]
jerith
[Tags|]
[Current Mood |geeky]
[Current Music |John Williams -- Olympic Fanfare]

Be warned: this post contains much geekiness and probably won't make very much sense unless you're a programmer. It's the post I wanted to make here but couldn't find the book. I've found the book and played with the code a bit, so I actually have something to say now. Since it's of interest to a very limited audience (even more limited than the one I have) it's behind a cut.

As I mentioned previously, I have been playing with formal grammars as a tool for generating random names. I chose python as an implementation language because it is particularly suited to text processing and the dictionary data type makes generative grammars so easy to implement.

First, a word on formal grammars. While I'm not going to go into detail here (that's what textbooks are for) I will give a bit of background on formal grammars in general and context-free grammars in particular. A grammar is really nothing more than a set of rules for generating sentences from symbols. These sentences could be a human language (such as english), a programming language (such as python), a musical passage (such as John Williams' Olympic Fanfare) or a mathematical statement (such as a Fourier transform). Basically, if it can be constructed according to a set of rules, a formal grammar will describe it.

The rules basically describe a transformation from one set of symbols to another. The most general grammar (a type 0 or phrase structure grammar) has no restriction on the types of rules that may be present. The grammar I am using here, however, is a type 2 or context free (CF) grammar. As the name implies, the rules may not specify any context for the transformation. This results in a left hand side that consists of only a single symbol. The right hand side, however, is unrestricted and may contain any number of symbols. An example of a CF grammar to generate a list of names could be written as follows:

<sentence> ::= <name> | <list> and <name>
<list> ::= <name>, <list> | <name>
<name> ::= Tom | Dick | Harry | Jerith

The symbols in the angle brackets are non-terminal symbols and they represent symbols that must be expanded further by another (or the same) rule. The bars represent a choice between symbols. The first rule is taken as the starting place. Usually this would be indicated by a subscript "s", but that's difficult to do neatly in HTML and my code always starts with a hardcoded starting rule.

It is not difficult to convince yourself (by applying these rules in various ways) that sentences of the form "Tom, Dick and Harry" or "Jerith, Harry, Jerith, Tom and Tom" are described by this grammar.

Now on to the juicy stuff. I wanted to write a powerful, customisable random name generator. While I am sure grammar based name generators exist out there (it's a fairly obvious use of the techniques) I hadn't come across one back when I last used a name generator. There were approximations, where you could supply a list of beginnings, middles and ends of names, but that's a bit limiting. For example, Tolkien's Elvish has a strict set of phonology rules which describe how words may be constructed to sound Elvish. A name generator following these rules would produce a large array of Elvish names. To do the same thing by providing a three sets of name-parts would produce a much smaller and less varied set.

While it would be nice to have a parser frontend and read a grammar from a separate file, for now the grammar is hardcoded as a dictionary. Of course, since python is interpreted you don't need to recompile when you change something. By using the dictionary data type and a regular expression to find non-terminal symbols (which means you can't use "<" or ">" in your name, but that's a minor issue at present) the actual code for the name generator is fairly small:
import random
import re
import sys

# Regular expression to catch non-terminals, used frequently, so global
reNonTerminal = re.compile(r"<(\w+)>")

def nameGen(nameGrammar):
	nameStr = random.choice(nameGrammar["name"])
	matchNonTerminal = reNonTerminal.search(nameStr)
	while matchNonTerminal:
		subStr = random.choice(nameGrammar[matchNonTerminal.group(1)])
		nameStr = reNonTerminal.sub(subStr, nameStr, 1)
		matchNonTerminal = reNonTerminal.search(nameStr)
	return nameStr

All you need to do now is add a grammar in a dictionary and call the function with it. Each time nameGen() is called, a new name will be generated. Now that, while perhaps not completely trivial, is hardly difficult. Three tasks remain, however. The first is constructing a grammar to use with the generator. The second, and more tedious, is checking the grammar for errors. While most possible issues with the grammar would not confuse the generator overmuch, detecting them results in a better overall grammar and hence better name generation. The third, and by far the most interesting, is an algorithm for generating a grammar from a list of names. In this way, one could simply feed the grammar generator a tall stack of Elvish names and it would provide a general grammar for coming up with fresh, new names.

One of the grammars I used while playing with this code is shown here:
fooGrammar = {
	"name": ["<nameStart><nameMiddle0to2><nameEnd>"],
	"nameMiddle0to2": ["","<nameMiddle>", "<nameMiddle><nameMiddle>"],
	"nameStart": ["<nsCons><nmVowel>", "<nsCons><nmVowel>", "<nsCons><nmVowel>", "<nsVowel>"],
	"nameMiddle": ["<nmCons><nmVowel>"],
	"nameEnd": ["<neCons><neVowel>", "<neCons>", "<neCons>"],
	"nsCons": ["J", "M", "P", "N", "Y", "D", "F"],
	"nmCons": ["l", "m", "lm", "th", "r", "s", "ss", "p", "f", "mb", "b", "lb", "d", "lf"],
	"neCons": ["r", "n", "m", "s", "y", "l", "th", "b", "lb", "f", "lf"],
	"nsVowel": ["A", "Au", "Ei"],
	"nmVowel": ["a", "e", "i", "o", "u", "au", "oa", "ei"],
	"neVowel": ["e", "i", "a", "au"]
}

This grammar clearly shows a few hacks that improve the quality of the output somewhat. The extra <nameMiddle0to2> non-terminal restricts the length of the name. This kind of restriction could be handled far better by syntax similar to that of a regular expression in a parser frontend. Also note that <nameStart> is defined as a choice between three identical <nsCons><nmVowel> combinations and a <nsVowel> symbol to control the relative frequency of vowel or consonant beginnings. This, again, could be handled in syntax or an option could be provided to allow the parser to choose from a combination of rules rather than only a single rule. Some names generated by this grammar follow:

Eilbaulf
Alfumifau
Jin
Peibaulen
Paulf
Audussuth
Maulmeilf

While a CF grammar solves the problem adequately, there remains the possibility for a much more concise definition using a type 1 or context sensitive (CS) grammar. This would allow explicit statement of rules such as "symbol A may only occur before symbols B, C or D" instead of representing "AB", "AC" and "AD" as separate symbols. While this does not necessarily improve the flexibility of the grammar (assuming a manageable number of such contexts) it can significantly improve the representation of the grammar by removing the need to hide such rules in lists of possible symbols. The disadvantage of such a method, however, is that a much more complex generator would be required.
LinkReply

Comments:
[User Picture]From: unspeakablevorn
2005-02-08 08:03 am (UTC)
Sweeeeeeet.

Vorn
(Reply) (Thread)
[User Picture]From: jerith
2005-02-08 09:02 am (UTC)
I've written some of the grammar checking code, but there's a lot of tedius stuff in there. Loop detection is particularly difficult to implement.

I'd rather mess with a CS grammar for name generation and perhaps use a CF parser for the input. :-)
(Reply) (Parent) (Thread)
[User Picture]From: tears_of_blood_
2005-02-08 12:41 pm (UTC)
*blinks* I read the first few paragraphs then gave up.
This entry was too intelligent for me:P
(Reply) (Thread)
From: (Anonymous)
2005-04-04 07:20 am (UTC)

yay!

yay jerith! you are SO smart and stuff! woo!!!
(Reply) (Thread)
[User Picture]From: arnebab
2007-07-17 01:07 pm (UTC)

fruits?

Did you continue to work on it?

I'm searching for name-generators in Python for a battle-simulator using a free roleplaying roleset as base, and if you did somewhere carry on the name-generator, it would be quite interesting for us.

If you did,. would you license it under the "GPLv2 or later"?
(Reply) (Thread)
[User Picture]From: jerith
2007-07-17 08:21 pm (UTC)

Re: fruits?

I haven't touched it since, but a more complete version is on my website at http://jerith.za.net/code/cfnamegencode.html.

I may need to update the licensing page on there, but all the code (except where otherwise noted) is available under a BSD license. This means pretty much "do what you like with it, just give me the credit when you do". If there's enough interest, I'll toss it in my svn repo and write some of the missing bits.
(Reply) (Parent) (Thread)
[User Picture]From: arnebab
2007-07-24 10:58 am (UTC)

Re: fruits?

I tested it, and it exactly suits our needs, because we can specify certain regions and names for species and similar.

It already works quite well, but it would sure be greate to have it accept language-files (language objects (or easier still: dicts) would also be fine, because we're working with a general object storing backend which stores the data as objects or dicts in human-readable yaml-files).

We're writing the battlefield simulator at http://sf.net/projects/rpg-tools-1d6

PS: Also a great Webcomic: http://xkcd.com/ - todays strip: RTFM :)
(Reply) (Parent) (Thread)
[User Picture]From: jerith
2007-07-24 12:21 pm (UTC)

Re: fruits?

Hmm, good idea. What YAML library are you using? I've got some time on my hands today, so I'll refactor it into a module and throw it in my svn repo.

On the topic of xkcd: http://syndicated.livejournal.com/xkcd_rss/profile :-)
(Reply) (Parent) (Thread)
[User Picture]From: jerith
2007-07-24 01:53 pm (UTC)

Re: fruits?

You can grab the current (and rather more usable, although still pretty dire) code from the svn repo at: http://svn.triumvirate.za.net/svn/pycfnamegen
(Reply) (Parent) (Thread)
[User Picture]From: arnebab
2007-07-24 03:57 pm (UTC)

Re: fruits?

as it looks, I can use the current code like that...
(Reply) (Parent) (Thread)
[User Picture]From: arnebab
2007-07-24 03:51 pm (UTC)

Re: fruits?

We're using PyYAML as lib, but a generalized module to extract and save data which gets called via tag-strings (i.e.: "tag:1d6.eu,2007:Ork").

Tags are simple enough to be remembered, yet reasonable unique (which means: The domain,year part makes sure that every person/entity only needs to make sure not to repeat the same name twice a year).

If you can add a structure to save the grammars in dicts and to call the generator with that grammar, I can adapt it to get the grammars from files.
(Reply) (Parent) (Thread)
[User Picture]From: jerith
2007-07-24 03:57 pm (UTC)

Re: fruits?

It's now a class that takes a grammar dict as input. You do something like the following:
import cfnamegen

fooGrammar = getGrammarDictSomehow()
foong = cfnamegen.CFNameGen(fooGrammar)
print foong.getName()
(Reply) (Parent) (Thread)
[User Picture]From: arnebab
2007-07-24 04:01 pm (UTC)

Re: fruits?

Great!

I'll check it tomorrow (I should be in sickbed today, but couldn't quite get me to be inactive the whole time :) )
(Reply) (Parent) (Thread)
[User Picture]From: arnebab
2007-07-25 08:18 am (UTC)

Re: fruits?

I just included it, and it's quite fast.

Thanks a lot!

PS: You can see what's happening at http://cia.vc/stats/project/rpg-tools-1d6
(Reply) (Parent) (Thread)
[User Picture]From: arnebab
2007-07-25 08:22 am (UTC)

Re: fruits?

Grammar-files now look like the following:

------
name:
-
nameEnd:
-
-
-
nameMiddle:
-
nameMiddle0to2:
- ''
-
-
nameStart:
-
-
-
-
neCons:
- r
- n
- m
- s
- y
- l
- th
- b
- lb
- f
- lf
neVowel:
- e
- i
- a
- au
nmCons:
- l
- m
- lm
- th
- r
- s
- ss
- p
- f
- mb
- b
- lb
- d
- lf
nmVowel:
- a
- e
- i
- o
- u
- au
- oa
- ei
nsCons:
- J
- M
- P
- N
- Y
- D
- F
nsVowel:
- A
- Au
- Ei
------

Using YAMl we could also make them look like this:
------
name: []
nameEnd: [, , ]
nameMiddle: []
nameMiddle0to2: ['', , ]
nameStart: [, , , ]
neCons: [r, n, m, s, y, l, th, b, lb, f, lf]
neVowel: [e, i, a, au]
nmCons: [l, m, lm, th, r, s, ss, p, f, mb, b, lb, d, lf]
nmVowel: [a, e, i, o, u, au, oa, ei]
nsCons: [J, M, P, N, Y, D, F]
nsVowel: [A, Au, Ei]
------

The content of both is equal when loaded with yaml, but we're using a self-written generalized framework to export and import files, and most files look best with the first setting.

maybe we'll add a switch for that later on :)
(Reply) (Parent) (Thread)
[User Picture]From: arnebab
2007-07-25 08:25 am (UTC)

Re: fruits?

sorry, livejournal cuts the tags...

should be
name:
- -nameStart--nameMiddle0to2--nameEnd-
nameEnd:
- -neCons--neVowel-
- -neCons-
- -neCons-
...

and
name: [-nameStart--nameMiddle0to2--nameEnd-]
...
etc
(Reply) (Parent) (Thread)