Macrostructure

I use the word "macrostructure" here to refer to way the lexicon is set up such that users can enter the lexicon and find the desired headword. I do not use it to mean a physical structure of the medium of the lexicon (although in print lexicons, layout of the bound volume is an artifact of the method of access, as is discussed below), but instead the procedural structure of the how the user goes about accessing entries.

In this section I will discuss how macrostructure works essentially differently in online lexicons as compared to print lexicons.

Macrostucture & Indexing in Print Lexicons

In print lexicons, the entries are organized according to one macrostructure. For a user to find anything in the lexicon, he must learn the rules which underly this macrostructure.

For a user to find "patchouli", for example, in a conventional English dictionary, he must understand the English conception of alphabetical order -- e.g., he must know the order of the alphabet, he must know that the sort starts from the left edge of the word, and not the right (except in rhyming dictionaries); if he is a Spanish speaker he must unlearn the Spanish convention of treating "ch" as a letter between "c" and "d"; and so on. The user must then open the volume (assuming it's a one-volume lexicon). Recognizing that page numberings start on the left end of the volume, the user must then use runners at the top of the page to narrow in on the "p" section, then to find the beginning of it to find the "pa" words, and so on until he finds the left-justified boldface headword "patchouli".

The skills for navigating the macrostructure of conventional English lexicons are generally learned in the elementary grades, and it is basically a simple system -- one need only know the spelling of the word, and the straightforward rules for alphabetical sorting in English.

However, consider if a user wanted to find a word based on criteria other than its exact spelling. Suppose one wanted words which rhymed with "enroll", or which referred to a shade of red, or which were reflexes of the Latin etymon "capere", or which were pronounced /si:z/, or which were seven letters long, or which ended in "-ate".

In that case, the macrostructure of the conventional English lexicon is inadequate, and the user must use a dictionary with a specially adapted macrostructure (for example, a rhyming dictionary, a crossword-puzzle dictionary, or an etymological dictionary); or the user may find this information in an index in the back of the conventionally structured dictionary.

It is a basic fact about print lexicons that they have exactly one macrostructure -- no more and no fewer. A lexicon can't be devoid of macrostructure, or else it would be an unsorted wordlist, useless as a reference work. And the only way I can conceive of a print lexicon having two (or more) macrostructures is if the lexicon simply repeated its entire contents twice, once in the first macrostructure, and then again in the second (in a different order). Presumably this would be a extravagant waste of ink and paper. If one did want to have the utility of two macrostructures in a print lexicon, one would presumably reduce one of the macrostructures to an index, i.e., by replacing the full entries with references to the entries in the other lexicon. This "reduced" macrostructure isn't a proper macrostructure anymore, since it consists of references, instead of entries; it is just an index.

Indices and macrostructures are not on an equal footing. To navigate a macrostructure to get to an entry, the user is not obliged to know anything about any indices; but to use an index, he must know how to navigate the index and then how to follow up the index's references into the macrostructure; and to do that follow-up, the user must know the macrostructure's ordering rules.

For example, consider The Pinyin Chinese-English Dictionary (Jingrong 1979) as I would use it to discover the meaning of a Mandarin word. The macrostructure is that of an alphabetical sort of the Mandarin words, as represented in the Pinyin orthography (standard Romanization for Mandarin), and headwords are in Pinyin, followed by the Chinese ideogram. However, there is a large index which indexes entries by the graphic form of their ideogram, and gives the Pinyin spelling of each. If I want to find the meaning of a Mandarin word which I know to write in Pinyin as "yù", I go right to the main part of the lexicon and find "yù" in the macrostructure. In this macrostructure I use the rules of alphabeticization which, in this lexicon, are the same as for English alphabeticization. If, instead, I were going to the dictionary to find the meaning of an unfamiliar ideogram, I could not look up the ideogram directly to discover its meaning, as I could with the Pinyin spelling. Instead, I have to consult the ideogram index to find out the Pinyin spelling of ideogram, and then look that up in the macrostructure. In either of these cases, I must know alphabetical order, since I always end up in the Pinyin macrostructure. In short, the macrostructure is like Rome: all roads, or references, lead to it.

Macrostucture in Online Lexicons

A print lexicon is a fixed, physical artifact, and the macrostructure is mapped onto the storage medium of that artifact-- i.e., the start of an English language lexicon is at the physical left end of the volume, the end is at the physical right end, and the middle is physically inbetween.

Online lexicons do indeed exist as physical objects; their data is encoded in a material object which exists in a specific place. But the nature of digital media has made irrelevant all details of where information is stored, or in what sequence. An online lexicon is not perceived as a physical object any more than a movie or a video game is, even though all of these are stored and accessed only through physical objects. An online lexicon, like any online resource, is perceived as data presented in whatever way the interface chooses to present it -- the implication being that the user may be able to reconfigure his interface to display the entries differently. In this way, an online lexicon is essentially dynamic, whereas a print lexicon is inherently static.

The reader may find my use of "macrostructure" unusual, since in other works on lexicographic theory (e.g., Landau 1984), it refers to the designed arrangement of entries in the physical medium of the lexicon. However, I see the physical structure as being merely an artifact of the steps the user is meant to follow in getting to entries, and I instead use "macrostructure" to refer to the these steps, to this plan of action; this sense happens to imply the physical structure of the print lexicon -- but it has no such implication with online lexicons, given the lack of essential physicality, as discussed above. Since this lack of physicality does not, in my experience, disorient users or keep them from learning a given online lexicon's macrostructure, I cannot help but conclude that the physical artifacts of macrostructure in print dictionaries are not an essential design feature of lexicons in general.

So, viewing macrostructure as the procedures that the user is meant to follow in getting to the entries he wants, we arrive at the basic novelty of online macrostructure: there are as many macrostructures in a given lexicon as there are search methods that the programmers and lexicographers have provided. Dodd (1989:88), in referring to "routes" (synonymous with what I call "macrostructures"), says:

"In a truly dynamic dictionary, it should be possible to gain access to an entry by means of any of the pieces of information composing it. Potential routes are thus limited only to the frontiers of what is contained in the dictionary, combined with possible manipulations or intersections of these items of data."
This is a tall order, but it is a goal that designers of online lexicons should try to meet. At every stage of the design of the lexicon, designers should ask "is there another way I can make this lexicon searchable? Is there another way to link to the entries?" Of course, making a lexicon searchable by "any piece of information" in entries is feasable only where that information is not merely present in entries, but is also systematically coded in a form amenable to search routines. For example, in an English dictionary, if argument structures of verbs being defined are not explicitly stated, but instead are merely demonstrated in example sentences (as is often the case in English dictionaries), then it will be difficult if not impossible to write a search routine so that users can search for verbs having particular argument structures. In that case, it would probably be simpler to edit all the verbal entries in the lexicon to have an explicit formalization of their argument structure, in a form usable by search routine.

It has to be decided on a lexicon-by-lexicon basis what aspects of the content of entries is worth encoding so as to be searchable. But in existing print lexicons, lexicographers have shown what kinds of information they consider important enough to enshrine as an aspect of the macrostructure (e.g., the spelling of a word); or important enough to consistently declare in entries (as with part-of-speech, etymology, etc.); or important enough to compile into indices. These kinds of information are exactly the kinds of information that lexicographers of online lexicons should consider making accessible as macrostructures. To wit:

Of course, in an online lexicon with a well developed and powerful search system, one should be able to compose queries consisting of various criteria from each of the above macrostructures, such that one could, for example, search a Chinese lexicon for words which belonging to the literary register of Chinese, whose glyphs contain a given graphic radical, but which do not start with "b".

I emphasize that what I call "macrostructures" and what Dodd calls "routes" (see above) need not be seen merely in terms of the process of submitting a query to the online lexicon and having it return a list of matching entries. If the coding for a given entry represents it as, say, belonging to the semantic field "kinship terms", it is probably because the lexicographer expects users to formulate queries searching for kinship terms. However, in the case of lexicons which are heavily hypertextual, e.g., Lachler, McElwain, and Burke (1995), the datum that a given entry belongs to semantic field "kinship terms" is represented, when that entry is displayed, as a hyperlink to a list of all other words belonging to that semantic field. (Or, similarly, in a Chinese dictionary, this hyperlinking can be to all other words based on a given graphic radical; or in an English lexicon, to words which are cognates of a given entry, and so on with all the macrostructures discussed above.) Strictly speaking, such hyperlinking adds nothing to the content of the lexicon; however, as an interface feature, it shows users that, regardless of which macrostructure they used to get to the entry in question, that entry is similar to other entries in various other dimensions accessible thru other macrostructures. For naîve users, this provides a painless way to start exploring the various macrostructures of a given online lexicons. For more advanced users, it allows for the kind of half-structured browsing that so often leads one to stumble on the kinds of correlations that are the raw material of lexical research.

Fuzzy Matching & Stemming

I anticipate that the primary macrostructure for online lexicons will continue to be variations on the general theme of headword lookup, where a user enters a search key and expects to see any headwords containing that search key.

However, significant extensions to this basic "substring match" algorithm can be made. First off, "fuzzy matching" can be incorporated into the matching algorithm. That is, instead of merely looking for headwords which exactly match the user's query, the "fuzzy match" algorithm will be able to match headwords which approximately match the user's query. This feature is now used in spellcheckers to identify misspelled words and to suggest corrections. The fuzzy matching, integrated into a lexicon's lookup routines would be able, for example, to tell a user searching an English lexicon for an entry for "perogative" that there is no such word, but that "prerogative" is likely to be what he was after. This feature is present in the Internet webster (Unknown ?1983).

Fuzzy matching algorithms could extend from repairing spelling mistakes native speakers make, to repairing spelling mistakes common to non-natives who are likely to be using the lexicon. For example, Sherman Wilcox has included in his Multimedia Dictionary of American Sign Language (See Wilcox et al 1994) a fuzzy matching algorithm which (among other things) corrects for kinds of misperceptions of signs that non-Signers most often make. The details of the implementation of fuzzy matching algorithms depend on the language in question, as well as the kinds errors that potential users are likely to make.

In a similar vein, lookup routines should be able to accept orthographic variance which is not objectively incorrect. For example, a user searching a German online lexicon for the word "hoeren" should be redirected to the entry for "hören" without being accused of bad spelling. A dictionary of Arabic should be able to accept vowelled or unvowelled input; a Mongolian lexicon should accept Cyrillic or Old Script input; and so on.

The second significant extension to the matching algorithm is the integration of a stemmer algorithm. "Stemmer" here refers to an algorithm which can take an occurring (declined, conjugated, etc.) form of a word and return its headword form. Dodd (1989:89-90) says:

Where such a morphological analyzer might be of most worth would be in languages with initial mutations, such as Welsh, Cornish, and Breton. These mutations lead to extreme difficulty in alphabetically ordered books in those cases where the language [i.e., in deriving noncanonical forms --SMB] respells the words affected, leaving no indication of the original form. [...] It would also be of great value in languages with considerable morphological marking and numerous irregular forms. Such complexities can lead to major problems in a normal printed dictionary, obliged to be of much greater bulk than otherwise, through including the varying forms at least as cross-references to the normal headword, with no certainty of success. Some examples from Welsh are relevant: -- all of these are mutated variants of plural nouns. [...]
In other words, stemmers ("morphological analyzers" as Dodd calls them) can solve one of the most difficult problems found in lexicography -- namely, how to make lexicons of languages where morphology is not just something that happens at the end of words. Dodd mentions the solution of listing all derived forms, but in many languages this is impractical. The alternate solution involves either choosing some form as the "canonical" headword form (Zgusta 1971:120-1), or, as Young & Morgan (1992), using roots as headwords. Whether it's a canonical occurring form or a root which ends up being the headword in the macrostructure of a given lexicon, a potentially huge amount of linguistic and metalinguistic knowledge is required of the user.

To use the Welsh example, a user wanting to find "ddeurudd" in a Welsh dictionary must be familiar with the morphological and morphophonemic processes which have been brought to bear on "ddeurudd"; he must be familiar with the analysis of "ddeurudd" as a mutated form of one word from a paradigm of words which are all differently inflected forms of a common base; and he must know that a certain word among that paradigm, "grudd", is what the dictionary in front of him uses as a headword form. If the user does not have such metalinguistic skills (which may require knowledge of extremely complex analyses of the morphology of the language) as well as knowledge of possibly arbitrary and unintuitive decisions that were made in the organization of the given dictionary, he will be unable to find words in the dictionary, even though he may be fluent as well as literate. In the particular case of Native American languages, it's generally unrealistic to expect users to possess such metalinguistic knowledge.

But a stemmer absolves the user of having to possess such knowledge. Once a stemmer has been integrated into the lookup algorithm, the user no longer has to learn to produce citation forms; he can feed any occurring form into the search box, because the stemmer will deduce the citation form and direct him to the appropriate entry.

It may not be easy to write a stemmer for a given language. It is likely to be quite difficult for languages with complex phonologies or morphophonologies (such as Yawelmani or Mingo) or difficult writing systems (such as Hebrew or Tibetan). However difficult it may be to develop smart stemmers, it is worthwhile, since it will make the lexicons usable by (and less frustrating to) people who are not fluent with the principles of what is and isn't a canonical form for the given languages.

The implementational details of stemmer algorithms are beyond the scope of this document. Anyone wanting to develop a stemmer algorithm for a particular language would profit from a reading of Sproat (1992), and especially from looking at the algorithms used in existing stemmers for languages typologically similar to the one in mind. A word of warning is necessary, though: whereas I use "stemmer" in the electronic dictionary sense of the word, to mean an algorithm that takes an existing form (from a user's query) as input and returns its headword form, the term is also used in much of the literature on computational morphology in a different sense, to refer to algorithms which take an existing form and return an abstract stem. The distinction is crucial in two ways: if the headwords in a given lexicon are abstract stems, the stemmer's formalization of stems needs to agree with the formalization used by the lexicographer in composing headwords. But more importantly, if the headwords in a given lexicon are not abstract stems, the algorithm needed for a stemmer (in the electronic dictionary sense) may have little or no relation whatsoever to the one needed for a stemmer (in the computational morphology sense), and in fact may be degrees of magnitude more complex.

Consider, for example, the case of Navajo verbal morphology. Two largest dictionaries of Navajo, Young & Morgan (1987) and Young & Morgan (1992), both use the same analysis of verbal morphosemantics, namely, that Navajo verbs consist of a word-final monosyllabic root, which provides the core meaning of the verb form, and a prefix complex, which modulates the meaning. Roots are slightly modified versions of an underlying form, conventionally called the "stem" (Lachler 1997). For example:

adi'ní
adi' - ní
IMPERFECTIVE - thunder.DURATIVE
"Thunder is rumbling."
(where "ní" is the durative root for the stem "nih", meaning "thunder rumbling")
Where Young & Morgan (1987) and Young & Morgan (1992) differ notably is in their macrostructural treatment of verbs. In Young & Morgan (1987), verbs are made into entries based on a canonical existing form, with that form being the headword. A user looking for a definition for "adi'ní" would simply look for a headword "adi'ní". In Young and Morgan (1992), however, verbs are arranged into entries by stem, with subsections for the different prefix complexes. A definition for "adi'ní", therefore, would be under the headword "nih", in the subsection for "adi'". (For sake of simplicity, I am ignoring Young and Morgan's further analysis of the prefix complex.)

In the case of an electronic dictionary organized like Young & Morgan (1992), a stemmer that would get the user to the appropriate verbal entry would consist of merely an algorithm to identify the last syllable of the user's query, account for minor root/stem alternation, and look for that headword. That is, in this case, a stemmer (in the computational morphology sense of the word) works fine as a stemmer (in the electronic dictionary sense of the word), since headwords and abstract stems are synonymous in Young & Morgan (1992).

However, a stemmer for an electronic dictionary organized like Young & Morgan (1987) would need to be much more complex; it would have to go from one occurring form (the one in the user's query) to another (the canonical form used as the headword). This would require that the algorithm model at least a significant subset of the morphology of Navajo prefix complexes. In fact, given the incredible complexity of precisely this aspect of Navajo morphology (consider Kari's book-length treatment (1976) of the subject), writing such an algorithm would be a major undertaking, requiring, say, a lookup table containing correspondences of tens of thousands of possible prefix complexes (in all combinations of object and subject person and number, in all tenses, etc.) to their canonical forms; or, alternately, a comparable number of lines of program code to model the morphonology underlying the generation of the these complexes.

While Navajo is quite an extreme case as far as the problems in stemmer design, comparable issues are to be found in constructing stemmers for languages such as Arabic, where we also find complex nonconcatenative morphology, very different formalization of headwords in different dictionaries (see Haywood 1965), and different formalizations of stem structure in existing stemmer algorithms. (And this is to say nothing of the special difficulties presented by complexity and variability of the Arabic writing system.)

As daunting a task as stemmer design may be for some languages, it is precisely such languages which most need stemmers in the lookup routines for their electronic dictionaries; if it's difficult for a lexicographer-programmer to write a stemmer for a such a language, then it's certainly harder still for a user (especially a metalinguistically naïve one) to use a dictionary which lacks a stemmer in its lookup routine.

Multiword Queries

Compared to the task of developing fuzzy matching routines and stemmers for single word queries, it is relatively simple to then add functionality to the lookup routine to handle multi-word lexical items, such as compounds or idioms. This solves (or obviates) a longstanding lexicographic problem: where in a dictionary should one define, for example, "North Star"? In the entry for "north"? In the entry for "star"? In an entry of its own? Whatever principled solution a particular dictionary settles on for dealing with multi-word lexical items such as "North Star", it will be arbitrary. However, in an online lexicon, the lookup routine can and should be designed so as to know the right place to look when the user runs a search on "North Star".


[Next] or [Up to Index]