Visualizing textual structures in the Voynich Manuscript

The Voynich Manuscript has been described as the most mysterious book in the world. It’s a book written in a unique script, illustrated with images which are a mixture of the ordinary and the bizarre. It’s never been deciphered.

vms page

This article describes what happened when I used Search Visualizer to look at the distribution of common syllables in the manuscript.

Update: there’s more detail about this work on the Hyde & Rugg website:

http://hydeandrugg.wordpress.com/

Background

The manuscript was discovered by Wilfried Voynich in 1912. It has been an enigma ever since. When you look at the text, you swiftly start to see regularities in it, even though you can’t understand what it’s saying. For instance, the characters that look like “4o” almost always occur at the start of a word, and the characters that look like “89” usually occur at the end of a word.

voynich closeup

When researchers tried to decipher the manuscript, the regularities in the text became increasingly puzzling.

One obvious assumption was that the manuscript was simply written in an unidentified language, whether an invented language or a natural one like Basque or Georgian. That was plausible; there are plenty of cases where someone has invented a new script for a language. However, the regularities in “Voynichese” were very different from the regularities in every other known language. For instance, Voynichese is much more repetitive than any known language; there are numerous cases of the same very common word appearing two or three times in a row. That can occasionally happen in real languages, but not on anything like the scale that occurs in Voynichese. Conversely, Voynichese doesn’t have the regular word patterns that do occur in real languages: for instance, when you see the words “on top” in English, you normally see the word “of” occurring after them. Similar patterns occur in all other known languages, but they don’t occur in Voynichese. If it’s an unknown language, it’s very different from anything known.

voynich repetitive text

Another obvious assumption was that the manuscript contained only meaningless gibberish, perhaps produced as a deliberate hoax, perhaps produced via speaking in tongues, or maybe as an art project. The problem with this theory is that there are a lot of regularities in some features of Voynichese. Although Voynichese doesn’t have many regularities at the level of which words occur together, it has a lot of regularities about which syllables occur together. For example, Voynichese words typically have a three-syllable structure, of prefix, root and suffix. There are regularities about which syllables occur as prefixes, or as roots, or as suffixes; for instance the syllable “4o” almost always occurs as a prefix, whereas the syllable “89” usually occurs as a suffix, occasionally occurs as a prefix, but hardly ever occurs as a root. There are also statistical regularities in the lengths of Voynichese words, which form a statistical pattern (a binomial distribution) that occurs in some real languages. It’s hard to see how anyone could have produced so many regularities as a hoax or an art project at the time when the manuscript appears to have been produced (somewhere between the 1420s when the vellum for the manuscript was produced, and 1588, when it is first documented).

By a process of exclusion, that left a code as the explanation that seemed most likely. The trouble is that there were deep problems with that explanation as well. Some of the world’s greatest codebreakers tried to crack the manuscript, without success. That’s a significant absence of success. By modern standards, codes from the fifteenth and sixteenth century are easy to crack; a good modern codebreaker could expect to crack a code from that period within a few days at most. But after ninety years of attempts by great modern codebreakers, the Voynich manuscript was still uncracked.

One possible explanation was that it was an extremely unusual code. That would have major implications, since the Internet and e-commerce depend on safe codes for security, and since the best modern codes are now nearing the end of their shelf life. The security industry is very, very keen to discover a new type of code, and there was the tantalising possibility that the Voynich Manuscript contained a radically different type of code that some unknown genius had invented centuries ago. That’s possible, but hard to reconcile with the way that the manuscript contains a lot of features that would normally make a code very easy to crack. It contains identifiable words, with an identifiable syllable structure; those are features that code makers normally avoid at all costs, because those features usually offer an easy way in for codebreakers.

There’s another odd feature of Voynichese which is hard to reconcile with a code. It’s long been recognised that Voynichese consists of two “dialects” known as Voynich A and Voynich B, with different syllable frequencies. That might be because of a switch between two different coding systems, but a problem is that the difference between the two dialects isn’t clear-cut; various researchers have argued for there being “flavours” of the two dialects, or for some sections of the manuscript showing features of both dialects. That doesn’t fit well with the idea of switching between two coding systems. It might fit with a switch between several coding systems, but there are problems with that explanation as well. We’ll return to this topic later.

Another possible explanation was that the codebreakers had failed because the manuscript simply didn’t contain a code. However, there appeared to be major problems with the other possible explanations for the manuscript, as outlined above. Most researchers concluded that although there were problems with the code theory, it looked like the least implausible, so most research focused on that approach.

The situation changed when I published an article in 2004 showing that meaningless gibberish text very similar to Voynichese could be easily produced using old, low-tech methods. That made the hoax theory much more plausible than it had been before. My method involved using very large tables of gibberish syllables, which were combined into words using a piece of card (a “grille”) with three holes cut into it to choose the syllables for each gibberish word. The tables were structured in a way that produced words consisting of prefix, root and suffix, like Voynichese. The user moved the card across the table semi-systematically (to break up any regularities in the output text) and wrote down the gibberish words that appeared in the three holes in the card. This method produces gibberish as fast as the user can write it down. Using different cards, each with a different set of positions for the holes, makes it easy to produce numerous different combinations of syllables from the same table.

When I started using this approach to produce large quantities of gibberish text, I noticed that some of the odd regularities in Voynichese could be simply explained as accidental side-effects of the tables and grilles. For instance, it’s easy to lose your place on the table, and to misalign the card, so that the “word” it shows begins with a suffix rather than a prefix; if you don’t notice the error in time, you’ll start writing a word that begins with a suffix. That would make sense of why some common suffixes in Voynichese also sometimes appear as prefixes, but not vice-versa.

One feature of the table and grille method is that even if you use different combinations of holes in the grille to produce different combinations of syllables from the same table, you eventually run out of new combinations, and have to create a new table so that you don’t start repeating text that you’ve created previously. I estimated that creating the Voynich Manuscript would require about half a dozen tables.

I found that it’s horribly easy to make mistakes when creating a new table, so that the new table has very different frequencies for some syllables from the old table. For instance, you may be planning to have 75 instances of a particular syllable in your new table, so that the syllable is as common there as it was in the old table. Suppose, though, that you lose your place in your list of syllables, and only notice that you’ve overlooked that syllable when there are only 32 free spaces left in the new table. If that happens, you either have to start the new table again from scratch, or settle for that syllable only appearing half as often as you originally planned. Starting again from scratch isn’t a very appealing option, and there’s also a good chance that you’ll make a similar error when you’re doing the started-from-scratch version, so a hoaxer would be strongly tempted just to keep going with the table with half the intended instances of that syllable, rather than starting again.

Showing syllable frequencies and distributions is something that can be done easily with Search Visualizer. I tried this on the Voynich Manuscript and some texts from real languages as a comparison. The results are as follows.

English syllable distributions.

The image below shows common syllables in a Shakespeare play, Macbeth. This is a single-author text written for an audience.

Each column shows the frequency of a different syllable throughout the play. The syllables are, from left to right, de er ed ing. Within each column, the distribution is homogeneous; there is no distinct banding that would reflect one syllable being unusually common or rare in a given section.

macbeth common syllables

The next example is from a text deliberately chosen to be very different. It’s one of the official war records from the American Civil War. It consists of collated documents written by multiple authors about many events.

The column on the left shows the distribution of the syllable er and the column on the right shows the distribution of de. Again, the distribution within each column is homogeneous from beginning to end of the column, with no banding.

acw er de

German syllable distributions

As a comparison, here is an illustration of  the distribution of the same two common syllables in a German book, with a set of English footnotes at the end (visible as a band of slightly lighter intensity).

This shows the same pattern of homogeneity within a real language, and also shows the variation between two related languages for the same syllable. The German and English sections are similar to each other in their frequency distributions, but they’re also different enough from each other for the change in languages to be clearly visible.

german er de

Voynich syllable distributions

When I looked at the distributions of common syllables in the Voynich Manuscript, the pattern was very different from the pattern within a single natural language.

common voynichese syllables

The image below shows the distribution of four common syllables across the whole manuscript. The column on the left shows the distribution of “8AM”. The remaining columns show the distributions of 40 89 and OE respectively.

The syllable 8AM is slightly different from the others. It typically occurs on its own, rather than embedded in a larger word; it’s one of the commonest short words, and at first glance looks as if it might be a particle, like the word “the” or “a” in English, or der in German or le in French. However, it quite often occurs twice in a row, and sometimes three times in a row, which is very different from how particles behave in real languages. I’ve included it because it’s a common single-syllable word, so that we can compare its distribution with the distributions of syllables that normally occur within longer Voynichese words.

The syllable 4o almost always appears at the start of a word. The syllable 89 usually occurs at the end of a word, but can appear at the start of a word, or on its own; the syllable OE can occur at the start of a word, at the end of a word, or on its own.

common syllables in VMS no lines

Unlike the examples from real languages, these images show banding within each column, where a syllable changes abruptly in frequency. These frequency changes appear to occur at the same place for different syllables. The image below shows horizontal lines drawn where these changes appear to occur.

common voynich syllables with lines

The first frequency shift occurs about a fifth of the way down the image, in the two rightmost columns, both of which show a syllable that often occurs as a suffix. 89 suddenly becomes more common, and OE becomes less common.

The second frequency shift occurs about halfway down the image. There’s a narrow band where all of the syllables become less common.

In the next band, 8AM becomes even less common than in the narrow band. 40 suddenly becomes very common, as does 89. OE becomes about as common as it was in the first band, but with a more even distribution than in the first band, where it showed a lot of clumping.

There is then another narrow band, where 8AM becomes slightly more common, 4o and 89 become less common, and OE becomes clumpy again.

In the final band, 8AM begins sparsely, and then becomes moderately common. 4o and 89 each become more common than in the preceding narrow band, but not as common as in the broad band above that. OE becomes more evenly distributed than in the preceding narrow band, but not as common as in the broad band above that.

Discussion

So what’s going on in the Voynich Manuscript?

The syllables 89 and OE both occur commonly as a suffix, and are in effect competing for the same space, so we’d expect that if one becomes more common, then the other will become less common. That’s just what we see. However, that doesn’t explain why one should suddenly become more common. Nor does it explain why there should be several bands each showing a different frequency for the same syllable, or why some bands show more internal clumping than others, or why the frequencies of a the prefix 4o and the word 8AM should change at the same point as the suffixes change frequency. Several bands also show strong hints of sub-bands within a bigger band – the third band, for 89, has particularly noticeable sub-banding.

This pattern is completely different from what happened in the real languages shown above. It’s completely inconsistent with the theory that Voynichese is a single unidentified language, or with the theory that Voynichese consists of two dialects of a single unidentified language.

If we’re looking at dialects, then there are at least six of them, and some appear to be more different from each other than English is from German, at least on the preliminary results from my work so far (I looked at other German texts, and saw the same distribution patterns as in the book example above).

If we’re looking at a coded text, then there appear to be at least half a dozen different versions of the code, or at least half a dozen different codes producing similar but not identical types of text.

Finally, we could be looking at what the table and grille theory predicts, namely about half a dozen different tables being used to produce text, with each table producing its own flavour of text. My guess is that there were two original “master” tables, each used by a different person, which were then both copied a couple of times. This would fit with the consensus in the field that there are at least two different handwriting styles in the manuscript. I think that the two “master” tables were different from each other because of the copying problems described above; that would produce the two main “dialects” of Voynichese. I also think that each master table was copied a couple of times, fairly accurately but not completely accurately, which would produce the “flavours” of the two “dialects”.

My view is that the table and grille model is the simplest explanation for the evidence above; it predicts just this sort of change in frequency. With the other theories, in contrast, the frequency changes are another complication which requires a further set of explanations.

The question is still far from settled, but I think that the evidence is increasingly consistent with the hoax explanation, and inconsistent with the unidentified language explanation and the code explanation. To use a historical misquote: Much can be said for and against this view, and doubtless will be…

Notes, references and links

There’s an article by Batya Ungar-Sargon about the broader context of this story in The Tablet:

http://www.tabletmag.com/

René Zandbergen’s website gives an excellent overview of the Voynich Manuscript: http://www.voynich.nu/

The Beinecke library website Voynich Manuscript section is at: http://beinecke.library.yale.edu/digitallibrary/voynich.html

My Cryptologia paper on the Voynich Manuscript:

Rugg, G. (2004). ‘An elegant hoax? A possible solution to the Voynich manuscript’, Cryptologia, XXVIII (1), January 2004, pp 31-46.

My Scientific American paper about the Voynich Manuscript:

Rugg, G. (2004). ‘The mystery of the Voynich manuscript’. Scientific American, July 2004, pp. 104-109. It’s available online:

http://www.scientificamerican.com/article.cfm?id=the-mystery-of-the-voynic-2004-07

Joe D’Agnese’s article about the Voynich Manuscript in Wired is: WIRED Magazine: ‘Scientific Method Man: Gordon Rugg cracked the 400-year-old mystery of The Voynich Manuscript.’

by Joseph D’Agnese [co-author], September 2004.

http://www.wired.com/wired/archive/12.09/rugg.html

 

Andreas Schinner’s Cryptologia paper about the Voynich Manuscript:

Schinner, A. (2007). ‘The Voynich manuscript: Evidence of the hoax hypothesis;. Cryptologia 31 (2) pp 95-107.

There’s a lot more about this story in my book with Joe D’Agnese:

Blind Spot: Why we fail to see the solution right in front of us

published by HarperOne, available on Amazon:

http://www.amazon.com/Blind-Spot-Solution-Right-Front/dp/0062097903

The Search Visualizer is available for online use, free, at:

www.searchvisualizer.com

Notes on VMS transcription systems

There are several systems for transliterating Voynichese into Roman script. They encounter numerous problems from the nature of Voynichese. For instance, several common characters in Voynichese are combinations of characters which also occur on their own. Some transliteration systems use a single Roman symbol to represent a Voynichese compound character composed of two single characters joined together; other transliteration systems use two or more Roman characters to represent a Voynichese compound character.

For the visualisations above, I used an old transcription edited by Jim Gillogly and others. It probably contains some errors, since it was produced from old low-quality images of the manuscript, and it contains some commentary in English . I edited out most of that commentary, but left some in because the comments would be useful if any odd features emerged during the visualisation. The common Voynichese syllables that I chose do not appear in the English commentary.

From eyeballing the transcript, I’d guesstimate that the errors and English commentary should together compose less than 2% of the text. Given that the visualisations are of common Voynichese syllables that do not appear in the commentary, any confounding effects from the errors and commentary should be swamped by the volume of true positives.

I don’t know the copyright situation for that transcription, which is why I haven’t put it up on the SV website.

If anyone has a modern, high-quality transcription on their website, and would be happy for readers to search it with Search Visualizer, then I’ll be happy to pass that information on.

Notes on pagination in the manuscript

Because of the way that the pages in the manuscript were bound together, in traditional bookbinding style, there will probably be some cases where the sequence of pages in the manuscript won’t reflect the order in which the pages were written. That’s why I haven’t made a big issue of the cases of internal banding within the main frequency bands above – those cases may be due to pagination effects, or may just be a statistical fluke.

Advertisements

About searchvisualizer

We welcome debate and disagreement, but not abuse, trolling or thread derailment. We reserve the time-honoured right of blog owners and moderators to be arbitrary, capricious and autocratic in our wielding of the ban hammer. Gordon Rugg is a former timberyard worker, archaeologist and English lecturer who ended up in computer science via psychology. He’s the same Gordon Rugg who did the Voynich Manuscript work, and the books with Marian Petre about research. He’s co-inventor of the Search Visualizer.
This entry was posted in textual analysis, Voynich Manuscript and tagged . Bookmark the permalink.

22 Responses to Visualizing textual structures in the Voynich Manuscript

  1. Pingback: Gordon Rugg, the Search Visualizer, and the Voynich... | Cipher Mysteries

  2. Pingback: Forbidden News » New signs of language surface in mystery Voynich text

  3. Pingback: New signs of language surface in mystery Voynich text

  4. Pingback: RedJediEvolution.com: OTROS • Re: 7 sistemas antiguos de escritura sin descifrar | CONSCIENCIA, MEDITACION, SANACION, ESPIRITUALIDAD, METAFISICA, SECRETOS, LIBERACION EMOCIONAL, GNOSIS, CONOCIMIENTO, SABIDURIA, COACHING, CRECIMIENTO PERSONAL, AUTOAYUDA

  5. R.K. says:

    Though I’m still scratching my head over your conclusions you work was fascinating enough that I had to start hammering out code to duplicate your results.

    After comparing my graphical results, my belief is that your test are very subjective. You’re comparing apples to oranges. First, comparing the Voynich to Macbeth or a Civil War history is no comparison. Macbeth is a play and a Civil War history deals with one subject. The Voynich has very obvious sections and deals with different subjects using very different sentence structure in each. While the herbal section is in paragraphs, the astrology section is mostly single words and short sentences. I would almost expect some banding to occur.

    Secondly, eyeballing a transcription is not exactly a scientific method for determining margin of error. All of the transcriptions are wrong, some much worse than others. My comparison so far has only been on Voynich suffixes and yes, I found similar banding. But there’s an obvious difference in banding even between the transcriptions of Takahasi and Friedman. One reason for this is that one transcriber may see a letter as an o where another sees a. With Macbeth and a history book, there is no such discrepancy.

    Finally, your banding model doesn’t work in every case. After repeating your tests and finding banding in vastly different areas and with very different patterns it would appear that you first created a hypothesis and then presented only the data that fits.

    Again, fascinating work and creative use of a data modeling tool, but as you said, the questions are far from settled.

    • Thanks for the comment; you’re right about words being likely to form bands, but I was looking at common syllables that don’t occur as words in their own right. There’s no inherent reason why they should form bands, and in the natural language texts I looked at, they didn’t form bands. With Voynichese, by contrast, there were very striking bands. That’s exactly what you’d expect from text produced using a succession of tables, but it’s hard to explain as being from a natural language.

      Gordon

  6. B.R. says:

    My first thought, having studied botany is that apples should be compared to apples, so to speak. I’m interested in how say the botanical section would compare to a modern botanical such as the diagnostic Jepson Manual. In other words let’s assume the Voynich is exactly what it appears to be. In such a diagnostic, words would often be repeated and patterns would change as the book progresses through the different families of plants. I would expect the same patterns in all of nature, i.e. a book about insects. The same would be said for the astrological section, etc. Just thoughts….I really enjoyed this article, thank you.

    • Exactly. There’s a significant absence of the regular patterns that you describe, which implies that if the text is meaningful, then its content isn’t related to the images (which would be odd), or that the text isn’t meaningful. (There is the possibility that the text is a highly sophisticated code that has somehow managed to hide those regularities, but I find that hard to believe.)

  7. P.A. says:

    I’m a bit surprised the topic of schizophrenia or unusual psychologies doesn’t come up here. The relation between the text and any possibility that it is purely invented – the difficulty of inventing it at such a sophisticated level… the fact that it is extraordinarily complex *yet* not quite like ‘real’ (familiar) languages suggests the possibility of an extraordinary mind in an extraordinary state. It could then be gibberish to the world, non-gibberish from the point of view of that mind in that state – isn’t this the most probable explanation?

    • The main problem with this idea is the statistics, in particular the binomial distribution of word lengths. It’s easy to produce that distribution as an unintended side-effect of a hoax, but I find it hard to imagine that happening in glossolalia, or in schizophrenic speech. I’ve posted about this on the Hyde & Rugg blog site (hydeandrugg.wordpress.com).

  8. Nora Wertz says:

    Thank you for the fascinating analysis. It does seem likely that the manuscript is indeed a hoax. Why would anyone go to such trouble to create such an elaborate fake? Mental illness or religious rapture seem to be the circumstances most likely to produce such a labor intensive and expensive product. My entirely unscientific desire is that it will one day be decoded and provide insight and delight at the workings of the mind of its creator. With that sentiment as a disclaimer, is it possible that inaccuracies in the transcription created at least some of the banding so brilliantly illustrated in this article?

  9. Oscar V says:

    Loved your article…. when i first saw the images of the manuscript, my first impression was that i was seeing some kind of mantras and/or hymns

  10. R. Phillips says:

    My impression is that the style of it is poetic in nature, for lack of a better word. It feels like the writer is trying to be clever and entertaining in their manner of writing, despite the possible factual nature of the subjects. If so, assonance, alliteration, and rhyming as we know them in English may be present here.
    Just a thought.

    • It’s very repetitive, and there are features that resemble assonance, but it’s not in any poetic form that anyone has identified. There are paragraph breaks that look just like ordinary prose, and the lines don’t show end-rhyme or alliteration or metre.

      There are also a lot of odd statistical regularities, such as some letters being very rare at the start of a line, but very common in the middle of it. The “straightforward unidentified language” theory soon falls apart when you get into the details.

      There’s more about this on the Hyde & Rugg website:
      http://hydeandrugg.wordpress.com/

  11. Pingback: The Voynich Manuscript | Thinking Sideways Podcast

  12. Norwegian Blue says:

    A fascinating and very detailed examination of the statistical properties of the text.

    I was almost tempted to try some similar type of analysis myself. But having read your analysis, and seen your graphs of the banding properties of the text, I think I’ll skip, and rather save myself the time and energy. 🙂

    I think you have me convinced… It’s a hoax.

    The question is why someone would create a work like this. My guess is that it has to do with the high price of books around year 1500, and the high status associated with owning one. Probably someone saw an opportunity to earn some money by creating a fake book, and selling it at a high price…

    The simplicity of most of the drawings in the book also seem to suggest that it was kind of done in a hurry.

    Though, I do kind of like some of the imaginative illustrations of bathing women, in the mid section of the book. 🙂

  13. Rose S says:

    My opinion is that it’s the Genetic Code of Life. Not all of it. Just the important parts of it that reference the images. Pretty cool. If you were to overlay all the graphs it seems that it’s mainly 4 parts. Just like our code, it only consist of 4 letters (A,T,C,G).

    • Anonymous says:

      The whole genetic code is important. Much of it has been labeled “junk DNA” by the ignorant and arrogant simply because it is not understood.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s