I suspect there's a second code hidden in there. From the article, describing the code symbols that are Roman letters:
These unaccented Roman letters appeared with the frequency
you’d expect in a European language. But they don’t
represent letters—they mark the spaces between words.
It's implausible that these characters just happen to appear with a language-like frequency distribution and are all meaningless spaces. I suspect they actually have a meaning and provide a second message.
To clarify, it's like taking "SthisEisCtheRfirstEmessageT" and assuming all the capitals just indicate spaces.
I wouldn't doubt it, since codes within codes was common. However its not clear that they are necessarily related. There where some secret codexes where the additional codes acted like a watermark to tell you whose copy of the notes it was so if they leaked out you could go deal with them.
As a kid I was always making up codes and ciphers. Much more of a spy vs spy kid than a cops and robbers kid.
Unless they simply took random letters from some other text to get their spaces, in which case you'd get the correct frequency, in a sufficiently large text, with no meaning.
My initial thought was that they were inserted as a diversion to those who thought a simple frequency analysis would break what appeared to be a simple substitution cipher. Whereas in practice, the 'expected' substitutions yield gibberish, and the actual message is encoded using poly-alphabetic substitutions to throw off simple attacks.
Probably depends on how you were picking them. If you were using something like DiceWare to pick letters from an English document, then yeah, you would nail the frequency. But if you were picking letters at random in your head, I would be very surprised if you got anything more than a very roughest of distributions. Anything more than "lots of e's, few q's" would surprise me.
Humans rather suck at picking random numbers. We skew towards picking numbers that seem "more random", whatever that means (which would probably work in your favour for picking random letters with an English frequency, though I'm not too sure), but we also avoid "randomly" generating streaks of numbers, because we feel those are "less random".
If you put two teams in a room, one flipping a fair coin (and writing down the results), and the other pretending to flip a coin but just faking the results, it is usually very trivial to pick out which team actually flipped the coin. They are going to have surprisingly long streaks of heads or tails.
I don't have any evidence to necessarily suggest it, but I suspect this anti-streak tendency will tend to be strong enough to interfere with any correct frequencies which may otherwise appear. ("Oh my, this is far too many e's in a row..")
Oh, heh, I can see it that way now. I had intended my comment to say that, since you'd be trying to reach that set of ratios to hide things, you'd probably fail miserably against any competent analysis.
A wonderful read. I know a little bit about frequency analysis and was surprised to see how straightforward its application was (in theory). I'm even more surprised that after a decade of Google, that this approach wouldn't be one of the first things tried out given the length of the text. As the OP describes, it was a chance encounter at a conference that machine learning was finally introduced into the problem. Until that point, the linguist had been trying in vain to decipher the text...is there still such a gap between the researchers and the computational experts who know how to implement solutions?
* to put it in a less-polite way: how the F else would you solve a problem like this, with non-computational methods?
>Until that point, the linguist had been trying in vain to decipher the text
Well no, the linguist tried in vain to do frequency analysis by hand on ~88 symbols for ~100 pages for a couple months before saying "bugger this for a game of soldiers" and went on with her life.
"She tried a few times to catalog the symbols, in hopes of figuring out how often each one appeared. This kind of frequency analysis is one of the most basic techniques for deciphering a coded alphabet. But after 40 or 50 symbols, she’d lose track. After a few months, Schaefer put the cipher on a shelf."
Like a lot of people that played Fez, I recently did some frequency analysis by hand, to crack the alphabet in that game. It was pretty tedious, and I messed up frequently. I wouldn't blame her for giving up after a few mistakes.
I suppose so. The article said Schaefer didn't have an internet connection. But how else? This is hard to say. How many people really worked in this one? The article doesn't say, but it does seem to suggest that not many people seen or heard of the Oculists or this piece of writing. I would think many other ciphers have been cracked in the pre-computer era that are of equal or similar difficulty of this one, but the difference is how many eyes were on it.
This is an excellent article. When wired writes a good article, it is always amazing.
I don't think there's really such a gap in general. Classicists have been using computer-assisted frequency analysis for decades now to do things such as authorship attribution, and classics is otherwise a pretty staid and conservative field. I would guess this particular researcher just wasn't familiar with computational methods.
Wow, not often accomplishments from people you actually know and have had as teachers end up on the frontpage of HN. I was at the same talk by Kevin Knight as Schaefer and I can vouch for that it was a mighty interesting one! I actually changed my curriculum a bit (to include cryptography) as a result of his talk.
Good catch, nice read, with a computational angle.
Take a walk down some of the older lanes in London, say near Borough Market or back up towards Southwark, or the other side between Brick Lane and Petticoat Lane, and imagine yourself back in the 1700s.
Coffee houses, close groups having meetings, private rooms upstairs in narrow houses. The feeling that true knowledge was being passed on. The meaning people found in the processes of the primitive technology.
It strikes me that the boring bits of the decoding (tokenising the symbols, entering the tokens) could be farmed out using a web site hosting scans of texts. The computational resource could perhaps be spare cycles on a PC with an appropriate application. Scope for lay science of a particularly interesting kind, and the refinement of algorithms as they are applied to a larger corpus of texts.
> Eventually we turned to the last items in the Oculist trove: nine copies of a four-page document written in a mixture of old German, Latin, and the Copiale’s coded script. The message was more or less identical in every set.
I feel kind of sorry for them, that at the end of their journey they found what was essentially a Rosetta Stone for the code they were decoding.
this: "The unaccented Roman letters didn’t spell out the code. They were the spaces that separated the words of the real message, which was actually written in the glyphs and accented text." makes me think of a cyphertext within a cyphertext, something like an ancient form of stenography.
...maybe the symbold used as spaces are not actually random and there's another message hidden there, with another cypher, offering the writers of this "plausible deniability" regarding its existence: they could only give the way to decipher the first level of encryption and say that's all there is, while the really important information was hidden in the "space characters"...
(... now putting my tinfoil hat back in the closet :) )
yes, unfortunately the frequency and language analysis didn't result in anything useful except for some vague hints the encoded language might be asian, perhaps written by a westerner who traveled there.
There's a composition I like very much, by GY!BE, "She Dreamt She Was a Bulldozer, She Dreamt She Was Alone in an Empty Field".
Maybe you could do similar genre?
Question (maybe a dumb one) but how does an algorithm account for symbols that might mean a series of letters? Or a symbol that stands for a different letter depending on the symbol before or after it?
In general, using n-grams[1], probably at the character level. (So, as the article mentions, the bigram "ch" is common in German, and "qu" is much more common than "qX" for any X in English)
You can analyse texts you believe to be similar (in language, period, subject, etc) to the coded message you are attempting to crack, and use that to build tables of these n-grams in various semantic units.
Of course, these are useful in many more things than code-breaking, and Google have various datasets they make publically available.
The Google books ngram viewer[2] is a fun tool to play around with, or for the more serious, you can download a corpus of ~24GB of analysed web data they've crawled (from around 1 trillion source words)[3]
One actual example of a code constructed in the manner described is the Playfair cipher[4] which was used for a time in the late 1800s, but is now thoroughly broken.
To clarify, it's like taking "SthisEisCtheRfirstEmessageT" and assuming all the capitals just indicate spaces.