Jiaming Luo grew up in mainland China thinking about neglected languages. When he was younger, he wondered why the different languages his mother and father spoke were often lumped together as Chinese “dialects.”
When he became a computer science doctoral student at MIT in 2015, his interest collided with his advisor’s long-standing fascination with ancient scripts. After all, what could be more neglected — or, to use Luo’s more academic term, “lower resourced” — than a long-lost language, left to us as enigmatic symbols on scattered fragments? “I think of these languages as mysteries,” Luo told Rest of World over Zoom. “That’s definitely what attracts me to them.”
In 2019, Luo made headlines when, working with a team of fellow MIT researchers, he brought his machine-learning expertise to the decipherment of ancient scripts. He and his colleagues developed an algorithm informed by patterns in how languages change over time. They fed their algorithm words in a lost language and in a known related language; its job was to align words from the lost language with their counterparts in the known language. Crucially, the same algorithm could be applied to different language pairs.
Luo and his colleagues tested their model on two ancient scripts that had already been deciphered: Ugaritic, which is related to Hebrew, and Linear B, which was first discovered among Bronze Age–era ruins on the Greek island of Crete. It took professional and amateur epigraphists — people who study ancient written matter — nearly six decades of mental wrangling to decode Linear B. Officially, 30-year-old British architect Michael Ventris is primarily credited with its decipherment, although the private efforts of classicist Alice Kober lay the groundwork for his breakthrough. Sitting night after night at her dining table in Brooklyn, New York, Kober compiled a makeshift database of Linear B symbols, comprising 180,000 paper slips filed in cigarette boxes, and used those to draw important conclusions about the nature of the script. She died in 1950, two years before Ventris cracked the code. Linear B is now recognized as the earliest form of Greek.
Luo and his team wanted to see if their machine-learning model could get to the same answer, but faster. The algorithm yielded what was called “remarkable accuracy”: it was able to correctly translate 67.3% of Linear B’s words into their modern-day Greek equivalents. According to Luo, it took between two and three hours to run the algorithm once it had been built, cutting out the days or weeks — or months or years — that it might take to manually test out a theory by translating symbols one by one. The results for Ugaritic showed an improvement on previous attempts at automatic decipherment.
The work raised an intriguing proposition. Could machine learning assist researchers in their quests to crack other, as-yet undeciphered scripts — ones that have so far resisted all attempts at translation? What historical secrets might be unlocked as a result?
British India, 1872-1873. Alexander Cunningham, an English army engineer turned archeological surveyor, clomped about the ruins of a town in Punjab province that locals called Harappa. On the face of it, there wasn’t much to survey: about two decades earlier, engineers working to link the cities of Lahore and Multan had stumbled across the site and used many of the bricks they found — perfectly preserved, fire kilned — as ballast for nearly 100 miles of railway track, blithely unaware they were remnants of one of the world’s oldest civilizations.
Cunningham didn’t know this either — the Indus Valley civilization wouldn’t be formally “discovered” until the 1920s — but he knew the site had some historical value. Burrowing through the ruins, he and his team chanced upon stone implements they surmised were used for scraping wood or leather. They gathered shards of ancient pottery and what appeared to be a clay ladle. The most striking discovery, though, was a tiny stone tablet, roughly 1.5 inch by 1.5 inch. “On it is engraved very deeply a bull, without a hump, looking to the right, with two stars under the neck,” Cunningham wrote in his report. “Above the bull there is an inscription in six characters, which are quite unknown to me. They are certainly not Indian letters; and as the bull which accompanies them is without a hump, I conclude that the seal is foreign to India.”
I have a cheap replica of that first seal, bought years ago from a museum gift shop at one of the Indus Valley sites: the animal on it has a thick neck, a lumpen torso, and a single swooping horn. Some people insist it is a unicorn. The inscription scrawled above it resembles a string of hieroglyphics; one character looks like a fish. In the century and a half since the discovery of the first seal, thousands more have been unearthed: 90% of them along the Indus River in modern-day Pakistan, the remaining in India or as far afield as modern-day Iraq.
We know now that these tablets, described by one excavator as “little masterpieces of controlled realism,” are indigenous to the Indian subcontinent; researchers believe they were probably used to close documents and mark packages of goods, which is why they are referred to as seals. In part because of how the symbols in the inscriptions jostle each other at one end, almost as if the inscriber had run out of space, researchers have concluded that the inscriptions are meant to be read right to left. But we still don’t know what they actually say.
This isn’t from a lack of trying. Scholars often point out that the Indus script, as the collection of some 4,000 excavated inscriptions, comprising between 400 and roughly 700 unique symbols, is known, might be one of the most deciphered scripts in history. More than a hundred attempts have been published since the 1920s. One theory links it to the Rongorongo script of Easter Island, also still undeciphered; another, offered by a German tantric guru claiming to have achieved his solution through meditation, links it to the cuneiform script used to write the Sumerian language.
For some groups in South Asia, the quest to decode the Indus script is almost existential. India and Pakistan, increasingly riven by their respective strains of religious nationalism, have markedly different relationships to their shared ancient past. The Pakistani state, deeply wedded to the idea of itself as a Muslim homeland, largely ignores its pre-Islamic heritage; its Indian counterpart, on the other hand, has taken to scouring history to find justification for the claim that India has always been a Hindu nation.
Up until the discovery of Harappa, the earliest Indians were believed to be people who lived between 1500 and 500 B.C. and composed the Vedas, the Sanskrit texts that form the basis of modern-day Hinduism. The discovery of a civilization of people who lived before the Vedic people upended the story of India. Given that it undermines their claims of indigeneity, proponents of Hindutva — the most mainstream strain of Hindu nationalism — balk at the theory of a pre-Vedic civilization, even as evidence for it accumulates across disciplines, including archaeology, genetics, and linguistics.
The smallest of advances in Indus Valley research, therefore, tends to reverberate far beyond the confines of academics. Attempts to prove that the Indus people worshipped Hindu gods and spoke an earlier form of Sanskrit continue unabated. In 2000, one researcher even digitally distorted an image of an Indus seal to make the animal on it look like a horse, which figures prominently in Sanskrit literature.
Politics aside, it is remarkable how little we know about the original people of the Indus Valley, who at one point constituted nearly 10% of the world’s inhabitants. It is especially galling given how much more we know about their contemporaries, such as the people of the Egyptian and Mesopotamian civilizations. Part of the reason for this is the continued elusiveness of the Indus script.
Putting machines to work on the Indus script is trickier than using them to reverse-engineer Linear B. We don’t have a great deal of information about the Indus script: most crucially, we don’t know what other language it may be related to. As a result, a model like Luo’s wouldn’t work for the Indus script. That’s not to say technology can’t help, though. In some ways, computer modeling has already played a crucial role: by showing that the Indus script is a language at all.
For most of the 20th century, the Indus inscriptions were widely accepted as representations of an undeciphered language. Then, in 2004, a group of Harvard researchers — cultural neurobiologist and comparative historian Steve Farmer, computational theorist Richard Sproat, and philologist Michael Witzel — published a paper essentially rubbishing nearly all existing research on the matter. The Indus seals, they claimed, were nothing more than a collection of religious or political symbols — similar to, say, highway signs — and all attempts to decipher them as a language were a waste of time. To underscore their point, Farmer offered a $10,000 reward to anyone who could find an Indus inscription containing at least 50 symbols.
Most Indologists and other Indus script researchers dismissed these arguments. One group of mathematicians, however, turned to computers to investigate the claims. Ronojoy Adhikari, a professor of statistical physics at the University of Cambridge, was one of them.
Before Cambridge, Adhikari worked at the Institute of Mathematical Sciences, in Chennai. In 2009, he attended a talk by Iravatham Mahadevan, an Indian civil servant turned epigraphist. Mahadevan, who died in 2018, had already cracked Tamil-Brahmi, another undeciphered script, then turned his attention to the Indus script.
Adhikari remembers being fascinated. “I’m a person from the sciences; I don’t have a humanities background,” he said. “But what I found very attractive in Mahadevan’s way of looking at the problem was that he had a very quantitative, almost scientific, approach. He was asking, how many times does a particular symbol occur? What does it occur against? What is the context in which it is occurring? And it appeared to me that because it had already been so quantified, it would be easy to translate this into a formal mathematical analysis.”
A few other data scientists in attendance joined forces with Adhikari. They knew they couldn’t decipher the script. “So the question we asked was: Can we at least tell whether it’s conveying any sort of linguistic information?”
Led by computer scientist Rajesh Rao, the researchers devised a computer program to see if they could answer this question: Was the Indus script a language? “You can give me any sequence of symbols, I don’t care what they are — hieroglyphics, written language, sheet music, computer code — and I will look at them from the point of view of a mathematician,” explained Adhikari. “Meaning, I will simply count how many times one sign occurs next to another.”
Their program drew on the work of Claude E. Shannon, a mid-century American mathematician, engineer, and decoder of wartime codes, who formulated the notion of information entropy — essentially a mathematical measure of disorder. In linguistic systems, symbols occur with somewhat fixed frequencies. “For instance, I just can’t pick up a letter from the alphabet, string it with another letter from the alphabet, and expect to get an English word,” explained Adhikari. In common English, for instance, the letter “q” is nearly always followed by “u.” This semiflexibility is a marker of all linguistic systems. Computer code, on the other hand, is completely rigid: the slightest deviation, and it falls apart.
The researchers fed their program the 4,000 inscriptions that form the entirety of the Indus script. For good measure, they also ran the program on other linguistic samples (English characters and words, Sanskrit, Tamil, Sumer, and Tagalog) and some nonlinguistic scripts (DNA, protein, Beethoven’s Sonata no. 32, and a computer code called Fortran). The program took about 45 minutes.
“I remember the first time that plot was generated,” recalled Adhikari. On the graph, the curves depicting music, protein, and DNA sequences hovered high, close to the maximum level of entropy, indicating a high level of randomness. Lower down, the known languages are all in a tight cluster. Fortran appears further below.
As for the Indus script, it appears with the other languages, just under Sanskrit and mapping almost cleanly onto Tamil. “It felt fantastic. It really felt very good. It’s nice to have a hunch, but to be able to prove it — I remember thinking, Yes, we’ve really got something here.”
There is a big difference, of course, between showing that a script encodes a language and decoding what it says.
Bahata Ansumali Mukhopadhyay met Adhikari over a decade ago. At the time, she was a disenchanted software developer looking for an escape route. When Adhikari, who had begun exploring deep learning approaches to work on the script, was in the market for an assistant, she eagerly volunteered.
Deep learning is the dominant technique in artificial intelligence today. It is primarily a form of pattern recognition: the more data you feed a machine, the better it becomes at interpreting future data. But the large-dataset approach isn’t particularly useful when it comes to low-resource (to use Luo’s term) subjects, such as the Indus script, where data is limited. Mukhopadhyay was quick to realize this.
“I was supposed to be coding,” she said sheepishly. “But, I spent most of my time reading.”
Mukhopadhyay went down one rabbit hole after another. She parsed Mesopotomian, Akkadian, Sumerian, and Old Persian dictionaries. She taught herself how to read Egyptian hieroglyphics. “I realized just how subtle symbolism can be,” she said. “Like the god Horus, his eye was torn into fragments. Each part is imagined as a fraction — and then from there, the ancient Egyptians created their symbols for fractions.”
Even as she helped build software to aid research on the Indus script, her doubts about the approach were building. “See, if the Indus script were an alpha syllabary [a writing system split into units of consonants and vowels, as in Urdu/Hindi], then machine learning and artificial intelligence would have been very suitable,” she explained. But because the inscriptions appear to be pictorial in nature, they posed a greater challenge. “Here you have to understand the historical symbolism used in India. How will artificial intelligence tackle that? How would AI know these symbols represent the fragments of Horus’ eye?”
For the past few years, Mukhopadhyay has been independently researching the Indus inscriptions, focusing on individual symbols. This involves coming up with a particular theory and then testing it — something computers aren’t very good at.
Mukhopadhyay’s theory, for which she made a case in a peer-reviewed paper in Nature, is that the Indus seals were used for taxation and trade control — a collector might carry one around, for instance, as a sort of license. In a subsequent paper, by examining words used for “elephant” — piri, piru, pilu —and “ivory” — pirus — in near Eastern languages at the time of the Indus civilization, she has argued that the Indus people spoke an earlier form of Dravidian, the linguistic ancestor of current languages like Telugu, Tamil, and Kannada. If researchers can successfully identify a contemporary linguistic relation to the Indus script, it could hold the key to deciphering it. As Mukhopadhyay explains her work, her earrings jiggle. They are artsy depictions of elephant heads. “Pilu,” she said, smiling.
Current iterations of AI aren’t designed to deploy the sort of approach adopted by Mukhopadhyay. Adhikari, who is now also less bullish about the prospect of machine decipherment, is skeptical it ever will be. “I think there are many aspects of cognition we cannot encode in a convenient framework,” he said. “I wouldn’t hazard a guess, but I don’t see it happening in my lifetime. I think we need to understand our brains much better.” Moreover, he added, not all information is quantifiable in a way that computers can understand. “A machine understands one, two, three very well. Two plus two equals four, yes. But …” His gaze drifted beyond his computer screen. “But that this sunset here looks like a beautiful flame — well, it is this sort of abstraction that holds the key to this script.”
Regardless of the approach used, AI is dependent on high-quality data being available in a machine-readable format. This remains a key challenge when it comes to ancient texts, given that they often come to us chipped, eroded, or incomplete in some other form. Scholars can spend decades debating the uniqueness of symbols: Is that a scratch next to a known character, for instance, or a new character altogether? Given how little there is to work with when it comes to long-lost languages, noisy or incomplete data can seriously curtail decipherment efforts.
For the past two decades, Vancouver-based Bryan K. Wells and Berlin-based Andreas Fuls have been quietly digitizing all known Indus seals and symbols. They append contextual information — such as where they were excavated, when, and alongside what artifacts — and add new ones as they are excavated. The Interactive Corpus of Indus Texts (ICIT) currently contains information about 4,537 inscribed artifacts, 5,509 texts, and 19,616 sign occurrences, featuring a total of 707 unique Indus symbols — a much higher number than the 417 previously identified.
The earlier corpora were compiled by hand. As a result, Wells argues, they were so limited that they risked undermining script research. “You know the old computer saying,” he said recently over Skype, “Garbage in, garbage out.” Nearly 50 researchers around the world currently use the database.
For now, the mysteries of the Indus script continue to elude decipherment. Last year, in a follow-up paper to their work automating the decoding of Ugaritic and Linear B, Luo and his team made a small but crucial advance: an algorithm aimed at identifying possible related languages of undeciphered writing systems. Potentially, this could help address the problem of deciphering scripts that don’t yet have a known language they can be compared against. When Luo and his team tested their model on the Iberian language, which has historically been linked to Basque, their findings suggested the two languages were not in fact close enough to be related — a conclusion that corroborated recent scholarship on the matter.
But while Iberian, said Luo, has at least 80 unique symbols, the Indus script has at least 400, making it exponentially more challenging. Still, theoretically speaking, modern machines can handle this level of computation. Could it be possible simply to “brute force” a problem like the Indus script — to analyze it against all contemporary South Asian languages and see which emerges as its closest linguistic relation? “That’s a good thought,” Luo said, after pausing to think. “If I had time, I would definitely try that.”
Luo is quick to point out that he doesn’t expect any decipherment of lost languages to be fully automated. “My thinking is: Let the system propose a list of candidates and let the experts see, Okay, maybe this theory is more correct than the other,” he said. “It definitely reduces the effort and the number of hours that experts have to expend.”
Not everyone is willing to entertain help from machines. Before settling on Iberian, Luo and his colleagues had considered tackling Etruscan, an undeciphered script from pre-Roman Italy. “One of our co-authors emailed a bunch of professors working in this field,” recalled Luo, chuckling. One of them wrote back, shooing them away. “He replied in quite angry tones, ‘machines can never compete with humans.’”