Archive for August, 2006

A new view on Translation

Sunday, August 20th, 2006

So the other day, I was sitting in the hallway of my University’s Residence Halls, around midnight, and listening to a theology discussion which the RA’s were having. There were people of all different backgrounds there, but the most vocal was a young man of the Mormon faith. At one point, the question arose of Bible translation and the fallibility of human translators.

The young Mormon piped up with a very innovative analogy on translation which he learned in Seminary, which I felt was quite interesting. I’ll roughly paraphrase below:

The word of God is a lot like a picture hanging on a bulletin board. It only has one tack to secure it [representing the Old and New Testament], so anybody can spin it around as they’d like, changing the perspective, even though the picture stays the same. The translators each tilt it a bit differently, and it’s tough to see exactly what the right orientation is.

For us [those of the Mormon Faith], the Book of Mormon is a second tack. It provides a second hold, and keeps you from spinning the picture. Whenever there’s a question about the perspective and translation in one, you can consult the other. What might be unsure with one tack, is securely locked with two.

Whether you believe in the validity of either work, this is an interesting analogy. It seems to imply a distinct split between the actual “word” or message of God, and the written words used to pass it on, much like the split between concept and language used to describe it.

A similar idea is actually used frequently in the translation of a seminal work in Mahayana Buddhism, the Bodhicharyavatara (’Guide to the Bodhisattva’s Way of Life’) by Shantideva. Very early after its transcription (originally in Sanskrit), two highly authoritative versions were created of the work, one in Tibetan, and one in Sanskrit, and both are treated as equal by the Buddhist community. In modern translations, many of the translators choose to base their work off one version or the other, but use the other version to clarify difficult passages. My personal favorite translation, by Stephen Batchelor, was based on a 12th Century Commentary on the Tibetan text, but uses the Sanskrit for clarification in footnotes. When you’re dealing with differences as extreme as that between “May all women become men” and “May all women attain the rights and privileges of men”, a point of clarification is wonderful.

Now, let’s use a similar idea in a secular sense. I would like to describe an event, something complex, emotional, and generally slightly vague. Take, for example, an account of one’s first day leaving for College. Imagine a bilingual author were to write the story, once in, say, English, and once in Spanish. Not so much translating one into the other, but actually telling the story twice (with an effort to include much of the same information in both). Would the Spanish be a “second tack” for the English version and vice-versa? Could one use the Spanish to clarify the English ambiguities, and vice-versa? Most importantly, would another bilingual reader have a better idea what the author meant by reading both versions, rather than just one?

The more I look at it, translation seems messier and messier. I’ve begun to suspect that there is no such thing as a one-to-one translation, and that any time you switch languages or rephrase, something is lost or gained. This isn’t necessarily bad, but it, like all other things, needs to be studied further.

I hope this post made sense. If not, maybe I’ll try writing the same thing right next to it in Spanish. If it helps, I’ve just found a thesis.

Automated Censorship B*******

Tuesday, August 8th, 2006

(By the way, the title of the post is “Automated Censorship Breakage”, but it shows how people assume the worst when unnecessary asterisks are involved.)

Hot on the heels of my post on automated language translation, Arnold Zwicky over at the Language Log made a wonderful post about Automated Censorship systems, entitled “C*m sancto spirito” and the havoc they can wreak.

The example that jumped out at me the most is from the title, the fact that the Latin “Cum” (’with’) was censored because it’s spelled like the American slang word for “Semen”. Incidentally, Wikipedia tells me that “Cum” is also Irish for ‘invent’ and Bengal for ‘kiss’, so Latin isn’t just the exception when it comes to innocent meanings.

This isn’t merely an example of context insensitive automatic censorship (”My poor p***y cat is sick!”) or overzealous word blocking (”Fertilization occurs when the s***m travels through the v****a into the uterus..”), but this is a whole new area. This is censoring English letter combinations in other languages, regardless of what they may actually mean.

This starts to set a dangerous precedent. In addition to quickly losing words like the Latin “cum”, it’s only a matter of time before other words start getting blocked too. Also, it starts to lead to troubles even within a language. For instance, in British English, “fag” means “Cigarette” (coming from the original “faggot”, a small bundle of wood for starting a fire), but in American English, “fag” is a slur referring to a homosexual man. So, would a Londoner step outside for a fag or for a f**?

Other examples come to mind. In Castillian Spanish (Spain), “coger” means “to grab”, but in Latin American Spanish, it is a vulgar word, meaning (roughly) ‘to fuck’. If one started a Spanish language forum, would that be censored? How about in a Spanish song on a music store like iTunes?

Russian-American comic Yakov Smirnoff has a section of his routine where he discusses the fact that in Russian, “Yep” (or “Yeb”) is a vulgar root for “to fuck”, and his initial shock when, in America, a vulgar word in his language is used so frequently in conversation (I believe his routine went something like “Do you want to go out?” “Yep” “Wow! Great!”). Of course, this is an extreme (and exaggerated) example, but the concept still stands.

Many languages share parts of their phonetic inventories (library of sounds), so it’s certainly not uncommon to see such surface correspondences in spelling or pronunciation between innocent words and swear words in among different languages. The trick is training computers to understand when you’re dealing with “cum” and when you’re dealing with “c*m”.

The trouble with Automated language anything is that computers are pretty bad at figuring out which language is being used. There are language guesser programs, and although they’re amazingly good at what they do (and handy in a pinch), they’re also a whole new step in the process, and one that I suspect most webmasters and content providers would be reluctant to include. For a service like iTunes (or an internet forum) to first run everything through a language guesser and then censor appropriately would require much more computer time, and most language guessers need large banks of text to function well. Not to mention that, it’s frequent to find little dabs of languages like Latin in vast English posts.

One potential way of training these programs is to look for context and correspondence. For instance, pussy is not censored if “cat” appears directly afterwards, or “cock” is not censored if “hen” or “fight” occur within that same sentence. This too creates additional computing time, and it’s not entirely reliable. For instance, in a google search for “cum latin”, even though much of the page deals with latin grammar, the first two sponsored links are to adult sites (In retrospect, I should’ve known better, but I was looking for other examples of usage). So, filtering based purely on context and correspondence will likely have issues too.

Perhaps the most obvious solution is to sort by hand, and use a little common sense. However, for those of you seeking a thesis, an intelligent, language-aware filter could probably make life a whole lot easier for a number of forums and services. Who knows, if you do well, you might even graduate summa c*m laude.

0ccasional probable he translation of the machine of controll with

Monday, August 7th, 2006

I’d like to showcase a wonderful little piece of webcoding, a program by Carl Tashian called Multibabel.

This program takes any given English sentence, and then runs it through Altavista’s Babelfish translation engine 7 times, translating into one language, then plugging the result in and translating back to English, then translating that into the next language, continuing through Japanese, Chinese, French, Italian, Portuguese, Spanish and German. It takes a little while, but in the end, you’re presented with a step-by-step view of what happened. The results are sometimes strange, occaisionally funny, and always dead wrong.

For instance, the title of this post is 0ccasional probable he translation of the machine of controll with. This, as you might suspect, is the product of Multibabel. The original phrase I entered was “Computer translation can sometimes prove troublesome”.

Although this program is a slightly biased test (I doubt that asking a series of professional translators to pass along a phrase like this would result in the same phrase at the end), it does show the difficulties of current web based automated translation engines. I particularly enjoyed this quote, by the program author on the program website:

As of September 2003, translation software is almost good enough to turn grammatically correct, slang-free text from one language into grammatically incorrect, barely readable approximations in another.

Sadly, this hasn’t changed all that much since then. There are people working hard to improve it, and strides are being taken, but it’s still a long journey ahead.

So, try Multibabel, get a few chuckles, and just in case anybody you know is tempted to just translate that essay for their language class online, remember:

if immediate translation of the stupid we-machine or the thymus of
vist two, collapse; of that disabled person, you with you

(Which was originally: “When you use machine translation to cheat, you’ll look stupid, dishonest, or both; and no matter what, you’ll fail.”)