The Voynich Manuscript has been described as the most mysterious book in the world. It’s a book written in a unique script, illustrated with images which are a mixture of the ordinary and the bizarre. It’s never been deciphered.
This article describes what happened when I used Search Visualizer to look at the distribution of common syllables in the manuscript.
Update: there’s more detail about this work on the Hyde & Rugg website:
The manuscript was discovered by Wilfried Voynich in 1912. It has been an enigma ever since. When you look at the text, you swiftly start to see regularities in it, even though you can’t understand what it’s saying. For instance, the characters that look like “4o” almost always occur at the start of a word, and the characters that look like “89” usually occur at the end of a word.
When researchers tried to decipher the manuscript, the regularities in the text became increasingly puzzling.
One obvious assumption was that the manuscript was simply written in an unidentified language, whether an invented language or a natural one like Basque or Georgian. That was plausible; there are plenty of cases where someone has invented a new script for a language. However, the regularities in “Voynichese” were very different from the regularities in every other known language. For instance, Voynichese is much more repetitive than any known language; there are numerous cases of the same very common word appearing two or three times in a row. That can occasionally happen in real languages, but not on anything like the scale that occurs in Voynichese. Conversely, Voynichese doesn’t have the regular word patterns that do occur in real languages: for instance, when you see the words “on top” in English, you normally see the word “of” occurring after them. Similar patterns occur in all other known languages, but they don’t occur in Voynichese. If it’s an unknown language, it’s very different from anything known.
Another obvious assumption was that the manuscript contained only meaningless gibberish, perhaps produced as a deliberate hoax, perhaps produced via speaking in tongues, or maybe as an art project. The problem with this theory is that there are a lot of regularities in some features of Voynichese. Although Voynichese doesn’t have many regularities at the level of which words occur together, it has a lot of regularities about which syllables occur together. For example, Voynichese words typically have a three-syllable structure, of prefix, root and suffix. There are regularities about which syllables occur as prefixes, or as roots, or as suffixes; for instance the syllable “4o” almost always occurs as a prefix, whereas the syllable “89” usually occurs as a suffix, occasionally occurs as a prefix, but hardly ever occurs as a root. There are also statistical regularities in the lengths of Voynichese words, which form a statistical pattern (a binomial distribution) that occurs in some real languages. It’s hard to see how anyone could have produced so many regularities as a hoax or an art project at the time when the manuscript appears to have been produced (somewhere between the 1420s when the vellum for the manuscript was produced, and 1588, when it is first documented).
By a process of exclusion, that left a code as the explanation that seemed most likely. The trouble is that there were deep problems with that explanation as well. Some of the world’s greatest codebreakers tried to crack the manuscript, without success. That’s a significant absence of success. By modern standards, codes from the fifteenth and sixteenth century are easy to crack; a good modern codebreaker could expect to crack a code from that period within a few days at most. But after ninety years of attempts by great modern codebreakers, the Voynich manuscript was still uncracked.
One possible explanation was that it was an extremely unusual code. That would have major implications, since the Internet and e-commerce depend on safe codes for security, and since the best modern codes are now nearing the end of their shelf life. The security industry is very, very keen to discover a new type of code, and there was the tantalising possibility that the Voynich Manuscript contained a radically different type of code that some unknown genius had invented centuries ago. That’s possible, but hard to reconcile with the way that the manuscript contains a lot of features that would normally make a code very easy to crack. It contains identifiable words, with an identifiable syllable structure; those are features that code makers normally avoid at all costs, because those features usually offer an easy way in for codebreakers.
There’s another odd feature of Voynichese which is hard to reconcile with a code. It’s long been recognised that Voynichese consists of two “dialects” known as Voynich A and Voynich B, with different syllable frequencies. That might be because of a switch between two different coding systems, but a problem is that the difference between the two dialects isn’t clear-cut; various researchers have argued for there being “flavours” of the two dialects, or for some sections of the manuscript showing features of both dialects. That doesn’t fit well with the idea of switching between two coding systems. It might fit with a switch between several coding systems, but there are problems with that explanation as well. We’ll return to this topic later.
Another possible explanation was that the codebreakers had failed because the manuscript simply didn’t contain a code. However, there appeared to be major problems with the other possible explanations for the manuscript, as outlined above. Most researchers concluded that although there were problems with the code theory, it looked like the least implausible, so most research focused on that approach.
The situation changed when I published an article in 2004 showing that meaningless gibberish text very similar to Voynichese could be easily produced using old, low-tech methods. That made the hoax theory much more plausible than it had been before. My method involved using very large tables of gibberish syllables, which were combined into words using a piece of card (a “grille”) with three holes cut into it to choose the syllables for each gibberish word. The tables were structured in a way that produced words consisting of prefix, root and suffix, like Voynichese. The user moved the card across the table semi-systematically (to break up any regularities in the output text) and wrote down the gibberish words that appeared in the three holes in the card. This method produces gibberish as fast as the user can write it down. Using different cards, each with a different set of positions for the holes, makes it easy to produce numerous different combinations of syllables from the same table.
When I started using this approach to produce large quantities of gibberish text, I noticed that some of the odd regularities in Voynichese could be simply explained as accidental side-effects of the tables and grilles. For instance, it’s easy to lose your place on the table, and to misalign the card, so that the “word” it shows begins with a suffix rather than a prefix; if you don’t notice the error in time, you’ll start writing a word that begins with a suffix. That would make sense of why some common suffixes in Voynichese also sometimes appear as prefixes, but not vice-versa.
One feature of the table and grille method is that even if you use different combinations of holes in the grille to produce different combinations of syllables from the same table, you eventually run out of new combinations, and have to create a new table so that you don’t start repeating text that you’ve created previously. I estimated that creating the Voynich Manuscript would require about half a dozen tables.
I found that it’s horribly easy to make mistakes when creating a new table, so that the new table has very different frequencies for some syllables from the old table. For instance, you may be planning to have 75 instances of a particular syllable in your new table, so that the syllable is as common there as it was in the old table. Suppose, though, that you lose your place in your list of syllables, and only notice that you’ve overlooked that syllable when there are only 32 free spaces left in the new table. If that happens, you either have to start the new table again from scratch, or settle for that syllable only appearing half as often as you originally planned. Starting again from scratch isn’t a very appealing option, and there’s also a good chance that you’ll make a similar error when you’re doing the started-from-scratch version, so a hoaxer would be strongly tempted just to keep going with the table with half the intended instances of that syllable, rather than starting again.
Showing syllable frequencies and distributions is something that can be done easily with Search Visualizer. I tried this on the Voynich Manuscript and some texts from real languages as a comparison. The results are as follows.
English syllable distributions.
The image below shows common syllables in a Shakespeare play, Macbeth. This is a single-author text written for an audience.
Each column shows the frequency of a different syllable throughout the play. The syllables are, from left to right, de er ed ing. Within each column, the distribution is homogeneous; there is no distinct banding that would reflect one syllable being unusually common or rare in a given section.
The next example is from a text deliberately chosen to be very different. It’s one of the official war records from the American Civil War. It consists of collated documents written by multiple authors about many events.
The column on the left shows the distribution of the syllable er and the column on the right shows the distribution of de. Again, the distribution within each column is homogeneous from beginning to end of the column, with no banding.
German syllable distributions
As a comparison, here is an illustration of the distribution of the same two common syllables in a German book, with a set of English footnotes at the end (visible as a band of slightly lighter intensity).
This shows the same pattern of homogeneity within a real language, and also shows the variation between two related languages for the same syllable. The German and English sections are similar to each other in their frequency distributions, but they’re also different enough from each other for the change in languages to be clearly visible.
Voynich syllable distributions
When I looked at the distributions of common syllables in the Voynich Manuscript, the pattern was very different from the pattern within a single natural language.
The image below shows the distribution of four common syllables across the whole manuscript. The column on the left shows the distribution of “8AM”. The remaining columns show the distributions of 40 89 and OE respectively.
The syllable 8AM is slightly different from the others. It typically occurs on its own, rather than embedded in a larger word; it’s one of the commonest short words, and at first glance looks as if it might be a particle, like the word “the” or “a” in English, or der in German or le in French. However, it quite often occurs twice in a row, and sometimes three times in a row, which is very different from how particles behave in real languages. I’ve included it because it’s a common single-syllable word, so that we can compare its distribution with the distributions of syllables that normally occur within longer Voynichese words.
The syllable 4o almost always appears at the start of a word. The syllable 89 usually occurs at the end of a word, but can appear at the start of a word, or on its own; the syllable OE can occur at the start of a word, at the end of a word, or on its own.
Unlike the examples from real languages, these images show banding within each column, where a syllable changes abruptly in frequency. These frequency changes appear to occur at the same place for different syllables. The image below shows horizontal lines drawn where these changes appear to occur.
The first frequency shift occurs about a fifth of the way down the image, in the two rightmost columns, both of which show a syllable that often occurs as a suffix. 89 suddenly becomes more common, and OE becomes less common.
The second frequency shift occurs about halfway down the image. There’s a narrow band where all of the syllables become less common.
In the next band, 8AM becomes even less common than in the narrow band. 40 suddenly becomes very common, as does 89. OE becomes about as common as it was in the first band, but with a more even distribution than in the first band, where it showed a lot of clumping.
There is then another narrow band, where 8AM becomes slightly more common, 4o and 89 become less common, and OE becomes clumpy again.
In the final band, 8AM begins sparsely, and then becomes moderately common. 4o and 89 each become more common than in the preceding narrow band, but not as common as in the broad band above that. OE becomes more evenly distributed than in the preceding narrow band, but not as common as in the broad band above that.
So what’s going on in the Voynich Manuscript?
The syllables 89 and OE both occur commonly as a suffix, and are in effect competing for the same space, so we’d expect that if one becomes more common, then the other will become less common. That’s just what we see. However, that doesn’t explain why one should suddenly become more common. Nor does it explain why there should be several bands each showing a different frequency for the same syllable, or why some bands show more internal clumping than others, or why the frequencies of a the prefix 4o and the word 8AM should change at the same point as the suffixes change frequency. Several bands also show strong hints of sub-bands within a bigger band – the third band, for 89, has particularly noticeable sub-banding.
This pattern is completely different from what happened in the real languages shown above. It’s completely inconsistent with the theory that Voynichese is a single unidentified language, or with the theory that Voynichese consists of two dialects of a single unidentified language.
If we’re looking at dialects, then there are at least six of them, and some appear to be more different from each other than English is from German, at least on the preliminary results from my work so far (I looked at other German texts, and saw the same distribution patterns as in the book example above).
If we’re looking at a coded text, then there appear to be at least half a dozen different versions of the code, or at least half a dozen different codes producing similar but not identical types of text.
Finally, we could be looking at what the table and grille theory predicts, namely about half a dozen different tables being used to produce text, with each table producing its own flavour of text. My guess is that there were two original “master” tables, each used by a different person, which were then both copied a couple of times. This would fit with the consensus in the field that there are at least two different handwriting styles in the manuscript. I think that the two “master” tables were different from each other because of the copying problems described above; that would produce the two main “dialects” of Voynichese. I also think that each master table was copied a couple of times, fairly accurately but not completely accurately, which would produce the “flavours” of the two “dialects”.
My view is that the table and grille model is the simplest explanation for the evidence above; it predicts just this sort of change in frequency. With the other theories, in contrast, the frequency changes are another complication which requires a further set of explanations.
The question is still far from settled, but I think that the evidence is increasingly consistent with the hoax explanation, and inconsistent with the unidentified language explanation and the code explanation. To use a historical misquote: Much can be said for and against this view, and doubtless will be…
Notes, references and links
There’s an article by Batya Ungar-Sargon about the broader context of this story in The Tablet:
René Zandbergen’s website gives an excellent overview of the Voynich Manuscript: http://www.voynich.nu/
The Beinecke library website Voynich Manuscript section is at: http://beinecke.library.yale.edu/digitallibrary/voynich.html
My Cryptologia paper on the Voynich Manuscript:
Rugg, G. (2004). ‘An elegant hoax? A possible solution to the Voynich manuscript’, Cryptologia, XXVIII (1), January 2004, pp 31-46.
My Scientific American paper about the Voynich Manuscript:
Rugg, G. (2004). ‘The mystery of the Voynich manuscript’. Scientific American, July 2004, pp. 104-109. It’s available online:
Joe D’Agnese’s article about the Voynich Manuscript in Wired is: WIRED Magazine: ‘Scientific Method Man: Gordon Rugg cracked the 400-year-old mystery of The Voynich Manuscript.’
by Joseph D’Agnese [co-author], September 2004.
Andreas Schinner’s Cryptologia paper about the Voynich Manuscript:
Schinner, A. (2007). ‘The Voynich manuscript: Evidence of the hoax hypothesis;. Cryptologia 31 (2) pp 95-107.
There’s a lot more about this story in my book with Joe D’Agnese:
Blind Spot: Why we fail to see the solution right in front of us
published by HarperOne, available on Amazon:
The Search Visualizer is available for online use, free, at:
Notes on VMS transcription systems
There are several systems for transliterating Voynichese into Roman script. They encounter numerous problems from the nature of Voynichese. For instance, several common characters in Voynichese are combinations of characters which also occur on their own. Some transliteration systems use a single Roman symbol to represent a Voynichese compound character composed of two single characters joined together; other transliteration systems use two or more Roman characters to represent a Voynichese compound character.
For the visualisations above, I used an old transcription edited by Jim Gillogly and others. It probably contains some errors, since it was produced from old low-quality images of the manuscript, and it contains some commentary in English . I edited out most of that commentary, but left some in because the comments would be useful if any odd features emerged during the visualisation. The common Voynichese syllables that I chose do not appear in the English commentary.
From eyeballing the transcript, I’d guesstimate that the errors and English commentary should together compose less than 2% of the text. Given that the visualisations are of common Voynichese syllables that do not appear in the commentary, any confounding effects from the errors and commentary should be swamped by the volume of true positives.
I don’t know the copyright situation for that transcription, which is why I haven’t put it up on the SV website.
If anyone has a modern, high-quality transcription on their website, and would be happy for readers to search it with Search Visualizer, then I’ll be happy to pass that information on.
Notes on pagination in the manuscript
Because of the way that the pages in the manuscript were bound together, in traditional bookbinding style, there will probably be some cases where the sequence of pages in the manuscript won’t reflect the order in which the pages were written. That’s why I haven’t made a big issue of the cases of internal banding within the main frequency bands above – those cases may be due to pagination effects, or may just be a statistical fluke.