That insight led to a generation of statistics-based language programs like Google Translate — and, not so incidentally, to new tools for breaking codes that go back to the Middle Ages.
Now a team of Swedish and American linguists has applied statistics-based translation techniques to crack one of the most stubborn of codes: the Copiale Cipher, a hand-lettered 105-page manuscript that appears to date from the late 18th century. They described their work at a meeting of the Association for Computational Linguistics in Portland, Ore.
Discovered in an academic archive in the former East Germany, the elaborately bound volume of gold and green brocade paper holds 75,000 characters, a perplexing mix of mysterious symbols and Roman letters. The name comes from one of only two non-coded inscriptions in the document.
Kevin Knight, a computer scientist at the Information Sciences Institute at the University of Southern California, collaborated with Beata Megyesi and Christiane Schaefer of Uppsala University in Sweden to decipher the first 16 pages. They turn out to be a detailed description of a ritual from a secret society that apparently had a fascination with eye surgery and ophthalmology.
It began as a weekend project this year, Dr. Knight said in an interview, adding: “I don’t have much experience in cryptography. My background is primarily in computational linguistics and machine translation.”
Uncertain of the original language, the researchers went down several blind alleys before following their hunches. First, they assumed the Roman characters and not the abstract symbols contained all of the information.
But when that approach failed, they figured that the code was what cryptographers call a homophonic cipher — a substitution code that does not have a straightforward correspondence between the original and encoded information. And they decided the original language was probably German.
Eventually they concluded that the Roman letters were so-called nulls, meant to mislead the code breaker, and that the letters represented spaces between words made up of elaborate symbols. Another crucial discovery was that a colon indicated the doubling of the previous consonant.
The researchers used language-translation techniques like expected word frequency to guess what a symbol might equal in German.
“It turned out that we can apply a lot those techniques to code breaking,” Dr. Knight said.
The work is being praised by other experts. “Cracking the Copiale Cipher was a neat bit of work by Kevin Knight and his collaborators,” said Nick Pelling, a British software designer and a security specialist who maintains Cipher Mysteries, a cryptography news blog.
But while the cipher was a notable success, Dr. Knight and his colleagues have been frustrated by other, more impenetrable ciphers.
“There are these books and ancient languages of real historical value that contain historical information that we just can’t get out yet, and that’s of interest to a lot of people,” he said in a filmed interview describing the Copiale project.
The work has value to historians who are trying to understand the spread of political ideas. Secret societies were all the rage in the 18th century, Dr. Knight said, and they had an influence on both the American and French Revolutions. He recently shared the decoded Copiale text with Andreas Onnerfors, a historian at Lund University in Sweden and an expert on secret societies.
“When he saw the book and the decoded version, he was very excited about it,” Dr. Knight said. “He found a political commentary at the end that talked about the natural rights of man. That was pretty interesting and early.”
Modern examples of challenging ciphers include the communications the Zodiac killer sent to the police in California in the 1960s and ’70s, and the “Kryptos” sculpture, commissioned for the C.I.A. headquarters, which has been only partly decoded.
But the white whale of the code-breaking world is the Voynich manuscript. Comprising 240 lavishly illustrated vellum pages, it has defied the world’s best code breakers. Though cryptographers have long wondered if it is a hoax, it was recently dated to the early 1400s.
With a University of Chicago computer scientist, Dr. Knight this year published a detailed analysis of the manuscript that falls short of answering the hoax question, but does find some evidence that it contains patterns that match the structure of natural language.
“It’s been called the most mysterious manuscript in the world,” he said. “It’s super full of patterns, and so for somebody to have created something like that would have been a lot of work. So I feel that it’s probably a code.”
By JOHN MARKOFF
taken from http://www.nytimes.com/2011/10/25/science/25code.html?_r=3