Editor’s note: This article was initially published in The Daily Gazette, Swarthmore’s online, daily newspaper founded in Fall 1996. As of Fall 2018, the DG has merged with The Phoenix. See the about page to read more about the DG.
“Sexual overall picture of health, which currently circulates the campus, hopes, to take instantaneous sexual health on the campus.” This would have been the opening sentence of this paper’s leading article yesterday if machine translation had been in charge of interpreting its meaning. Even with typologically similar languages like English, French, and German, the syntax of this sentence is seriously mangled. The second installment of Swarthmore’s linguistics lecture series, “Capturing Crosslinguistic Generalizations: Multilingual Metagrammars,” was presented yesterday by Tatjana Scheffler, a PhD candidate at UPenn. The talk focused on an approach to computational linguistics that takes advantage of broad similarities between languages. The series is organized by sophomores Miranda Weinberg and Elizabeth Bogal-Allbritten.
Computational linguistics (CL) exists in the intersection between computer science and linguistics. CL can be broken up into theoretical and applied areas. Theoretical CL deals with finding formalized descriptions of languages as well as modeling cognition of language with the aid of computers. Applied CL was first inspired during the Cold War when American dreamed of creating computers that could translate Russian into English. Beyond machine translation, applied CL also involves the study of natural language processing, human-machine interaction, and search engine technology. Though Google may be expert at finding the website you want, automated telephone services are renowned for their inability to understand human speech. The main focus of the talk was on the formalization of syntax in machine translation.
As the opening sentence highlighted, products of machine translation are often wordy and poorly ordered. It is highly difficult to construct a generative, that is to say strict stepwise approach to translation. All the same, it would be useful to automate the translation of mundane information. Even better, it would be best to not need linguists to write every rule for translation between every possible pair of languages. Such resources are not available to the speakers of many languages, as evidenced by the mere 11 languages represented in Google’s online machine translation service.
When translating between two languages machines often apply some sort of analysis, though direct transformation into the target language is another possibility. This analysis can also be put into some sort of interlingua before it is generalized into the target language. In order to make well formed sentences some sort of grammar must exist for the target language. In order to simplify and generalize this process, computation linguists are attempting to describe metagrammars. A metagrammar describes the different parts of speech that languages may contain, the possible relationships they can have to each other, and the various constraints on how those can be put together. These are in effect the universals that span languages. The specific grammar of one language is then compiled from these possibilities.
The results of this approach were show in two example problems of machine translation. The first was the problem of word order scrambling. As a native speaker of German, Scheffler described how free the ordering of clauses in a German sentence can be. With four elements, there can often exist 24 perfectly intelligible variants of the same sentence. Meanwhile, other languages have stricter restraints. Machine translation has problems with this when it tries to go between languages of contrasting word order. This was further elucidated by the second example: Languages can order their subject (S), verb (V) and object (O) in various ways. Common orderings are SVO, that of English, and SOV, the more common ordering found in many Indic, Turkic, and Germanic languages.
Scheffler’s solution to these syntactic problems is to map elements from the original sentence onto linguistic categories in the metagrammar. The specific grammar of the target language is then used to dictate what relationships the categories can have to each other. The parts of speech are tagged with what restraints need to be fulfilled, and their positioning in the output sentence is checked for all these qualities. The result is that ungrammatical sentences are weeded out as other solutions are found. Scheffler’s work in applied CL shows one of the many ways new abilities in information processing have revolutionized the social sciences.
Students are encouraged to watch for future installments of the linguistics lecture series.