They Cracked This 250 Year-Old Code, And Found a Secret Society Inside

kens · on Nov 17, 2012

I suspect there's a second code hidden in there. From the article, describing the code symbols that are Roman letters:

    These unaccented Roman letters appeared with the frequency 
    you’d expect in a European language. But they don’t 
    represent letters—they mark the spaces between words.

It's implausible that these characters just happen to appear with a language-like frequency distribution and are all meaningless spaces. I suspect they actually have a meaning and provide a second message.

To clarify, it's like taking "SthisEisCtheRfirstEmessageT" and assuming all the capitals just indicate spaces.

ChuckMcM · on Nov 17, 2012

I wouldn't doubt it, since codes within codes was common. However its not clear that they are necessarily related. There where some secret codexes where the additional codes acted like a watermark to tell you whose copy of the notes it was so if they leaked out you could go deal with them.

As a kid I was always making up codes and ciphers. Much more of a spy vs spy kid than a cops and robbers kid.

vidarh · on Nov 17, 2012

Unless they simply took random letters from some other text to get their spaces, in which case you'd get the correct frequency, in a sufficiently large text, with no meaning.

shabble · on Nov 18, 2012

My initial thought was that they were inserted as a diversion to those who thought a simple frequency analysis would break what appeared to be a simple substitution cipher. Whereas in practice, the 'expected' substitutions yield gibberish, and the actual message is encoded using poly-alphabetic substitutions to throw off simple attacks.

jefftk · on Nov 17, 2012

    It's implausible that these characters just happen to
    appear with a language-like frequency distribution and
    are all meaningless spaces

Really? If I were to try to pick random letters I suspect I would end up mirroring the frequency that they appeared in English.

jlgreco · on Nov 17, 2012

Probably depends on how you were picking them. If you were using something like DiceWare to pick letters from an English document, then yeah, you would nail the frequency. But if you were picking letters at random in your head, I would be very surprised if you got anything more than a very roughest of distributions. Anything more than "lots of e's, few q's" would surprise me.

Humans rather suck at picking random numbers. We skew towards picking numbers that seem "more random", whatever that means (which would probably work in your favour for picking random letters with an English frequency, though I'm not too sure), but we also avoid "randomly" generating streaks of numbers, because we feel those are "less random".

If you put two teams in a room, one flipping a fair coin (and writing down the results), and the other pretending to flip a coin but just faking the results, it is usually very trivial to pick out which team actually flipped the coin. They are going to have surprisingly long streaks of heads or tails.

I don't have any evidence to necessarily suggest it, but I suspect this anti-streak tendency will tend to be strong enough to interfere with any correct frequencies which may otherwise appear. ("Oh my, this is far too many e's in a row..")

Groxx · on Nov 17, 2012

Probably not. People are bad at random: http://scienceblogs.com/cognitivedaily/2007/02/05/is-17-the-...

wlievens · on Nov 17, 2012

I think his point is valid exactly because people are bad at randomness.

Groxx · on Nov 18, 2012

/me rereads

Oh, heh, I can see it that way now. I had intended my comment to say that, since you'd be trying to reach that set of ratios to hide things, you'd probably fail miserably against any competent analysis.

raverbashing · on Nov 18, 2012

I think his computer/software is defective.

Testing the "random distribution" like it was done - with a small sample size - is ineffective at best

xenonite · on Nov 17, 2012

The article suggests this as well. In the last paragraph, they say that a second coding using numbers could be treasured in the text.

danso · on Nov 17, 2012

A wonderful read. I know a little bit about frequency analysis and was surprised to see how straightforward its application was (in theory). I'm even more surprised that after a decade of Google, that this approach wouldn't be one of the first things tried out given the length of the text. As the OP describes, it was a chance encounter at a conference that machine learning was finally introduced into the problem. Until that point, the linguist had been trying in vain to decipher the text...is there still such a gap between the researchers and the computational experts who know how to implement solutions?

* to put it in a less-polite way: how the F else would you solve a problem like this, with non-computational methods?

Avshalom · on Nov 17, 2012

>Until that point, the linguist had been trying in vain to decipher the text

Well no, the linguist tried in vain to do frequency analysis by hand on ~88 symbols for ~100 pages for a couple months before saying "bugger this for a game of soldiers" and went on with her life.

"She tried a few times to catalog the symbols, in hopes of figuring out how often each one appeared. This kind of frequency analysis is one of the most basic techniques for deciphering a coded alphabet. But after 40 or 50 symbols, she’d lose track. After a few months, Schaefer put the cipher on a shelf."

dyselon · on Nov 18, 2012

Like a lot of people that played Fez, I recently did some frequency analysis by hand, to crack the alphabet in that game. It was pretty tedious, and I messed up frequently. I wouldn't blame her for giving up after a few mistakes.

dizzystar · on Nov 17, 2012

I suppose so. The article said Schaefer didn't have an internet connection. But how else? This is hard to say. How many people really worked in this one? The article doesn't say, but it does seem to suggest that not many people seen or heard of the Oculists or this piece of writing. I would think many other ciphers have been cracked in the pre-computer era that are of equal or similar difficulty of this one, but the difference is how many eyes were on it.

This is an excellent article. When wired writes a good article, it is always amazing.

_delirium · on Nov 17, 2012

I don't think there's really such a gap in general. Classicists have been using computer-assisted frequency analysis for decades now to do things such as authorship attribution, and classics is otherwise a pretty staid and conservative field. I would guess this particular researcher just wasn't familiar with computational methods.

Turing_Machine · on Nov 17, 2012

The next time I'm at the eye doctor, I'm going to be wondering what that eye chart really means. :-)

Another poster mentioned the Voynich manuscript. It's available on archive.org if anyone wants to try their hand:

http://archive.org/details/TheVoynichManuscript

Here's a list of others:

http://www.omniglot.com/writing/undeciphered.htm

gebe · on Nov 17, 2012

Wow, not often accomplishments from people you actually know and have had as teachers end up on the frontpage of HN. I was at the same talk by Kevin Knight as Schaefer and I can vouch for that it was a mighty interesting one! I actually changed my curriculum a bit (to include cryptography) as a result of his talk.

keithpeter · on Nov 17, 2012

Good catch, nice read, with a computational angle.

Take a walk down some of the older lanes in London, say near Borough Market or back up towards Southwark, or the other side between Brick Lane and Petticoat Lane, and imagine yourself back in the 1700s.

Coffee houses, close groups having meetings, private rooms upstairs in narrow houses. The feeling that true knowledge was being passed on. The meaning people found in the processes of the primitive technology.

It strikes me that the boring bits of the decoding (tokenising the symbols, entering the tokens) could be farmed out using a web site hosting scans of texts. The computational resource could perhaps be spare cycles on a PC with an appropriate application. Scope for lay science of a particularly interesting kind, and the refinement of algorithms as they are applied to a larger corpus of texts.

Leszek · on Nov 17, 2012

> Eventually we turned to the last items in the Oculist trove: nine copies of a four-page document written in a mixture of old German, Latin, and the Copiale’s coded script. The message was more or less identical in every set.

I feel kind of sorry for them, that at the end of their journey they found what was essentially a Rosetta Stone for the code they were decoding.

mathattack · on Nov 18, 2012

Actually it validates their intellectual experiment. That this is the way to correct answers.

Avshalom · on Nov 18, 2012

That sentence says the nine copies (sets) were more or less identical not that the german latin and copiale were translations of each other.

Leszek · on Dec 1, 2012

Oops, you're right, parsing fail.

nnq · on Nov 17, 2012

this: "The unaccented Roman letters didn’t spell out the code. They were the spaces that separated the words of the real message, which was actually written in the glyphs and accented text." makes me think of a cyphertext within a cyphertext, something like an ancient form of stenography.

...maybe the symbold used as spaces are not actually random and there's another message hidden there, with another cypher, offering the writers of this "plausible deniability" regarding its existence: they could only give the way to decipher the first level of encryption and say that's all there is, while the really important information was hidden in the "space characters"...

(... now putting my tinfoil hat back in the closet :) )

stcredzero · on Nov 18, 2012

Actually, they cracked a 250 year old code and found a secret society inside a secret society. (True. Read the article!)

Jun8 · on Nov 17, 2012

And now if only someone cracked the Voynich manuscript!

fsiefken · on Nov 17, 2012

yes, unfortunately the frequency and language analysis didn't result in anything useful except for some vague hints the encoded language might be asian, perhaps written by a westerner who traveled there.

BerislavLopac · on Nov 17, 2012

I'll be calling my rock band "Quiet Bulldozer". ;-)

ansgri · on Nov 17, 2012

There's a composition I like very much, by GY!BE, "She Dreamt She Was a Bulldozer, She Dreamt She Was Alone in an Empty Field". Maybe you could do similar genre?

tsunamifury · on Nov 17, 2012

This introduction feels eerily similar to an opening interview at Google.

chime · on Nov 18, 2012

How so?

BaconJuice · on Nov 17, 2012

Enjoyed reading this. Thank you.

k2xl · on Nov 18, 2012

Question (maybe a dumb one) but how does an algorithm account for symbols that might mean a series of letters? Or a symbol that stands for a different letter depending on the symbol before or after it?

shabble · on Nov 18, 2012

In general, using n-grams[1], probably at the character level. (So, as the article mentions, the bigram "ch" is common in German, and "qu" is much more common than "qX" for any X in English)

You can analyse texts you believe to be similar (in language, period, subject, etc) to the coded message you are attempting to crack, and use that to build tables of these n-grams in various semantic units.

Of course, these are useful in many more things than code-breaking, and Google have various datasets they make publically available.

The Google books ngram viewer[2] is a fun tool to play around with, or for the more serious, you can download a corpus of ~24GB of analysed web data they've crawled (from around 1 trillion source words)[3]

One actual example of a code constructed in the manner described is the Playfair cipher[4] which was used for a time in the late 1800s, but is now thoroughly broken.

[1] https://en.wikipedia.org/wiki/N-gram

[2] http://books.google.com/ngrams

[3] http://googleresearch.blogspot.co.uk/2006/08/all-our-n-gram-...

[4] https://en.wikipedia.org/wiki/Playfair_cipher

Roelven · on Nov 18, 2012

Woah. Awesome story but was kinda disappointed with the ending, just leads to more riddles & codes.

myWordBiLLY · on Nov 18, 2012

This was a fun read. Thanks for sharing.