(By the way, the title of the post is “Automated Censorship Breakage”, but it shows how people assume the worst when unnecessary asterisks are involved.)
Hot on the heels of my post on automated language translation, Arnold Zwicky over at the Language Log made a wonderful post about Automated Censorship systems, entitled “C*m sancto spirito” and the havoc they can wreak.
The example that jumped out at me the most is from the title, the fact that the Latin “Cum” (‘with’) was censored because it’s spelled like the American slang word for “Semen”. Incidentally, Wikipedia tells me that “Cum” is also Irish for ‘invent’ and Bengal for ‘kiss’, so Latin isn’t just the exception when it comes to innocent meanings.
This isn’t merely an example of context insensitive automatic censorship (“My poor p***y cat is sick!”) or overzealous word blocking (“Fertilization occurs when the s***m travels through the v****a into the uterus..”), but this is a whole new area. This is censoring English letter combinations in other languages, regardless of what they may actually mean.
This starts to set a dangerous precedent. In addition to quickly losing words like the Latin “cum”, it’s only a matter of time before other words start getting blocked too. Also, it starts to lead to troubles even within a language. For instance, in British English, “fag” means “Cigarette” (coming from the original “faggot”, a small bundle of wood for starting a fire), but in American English, “fag” is a slur referring to a homosexual man. So, would a Londoner step outside for a fag or for a f**?
Other examples come to mind. In Castillian Spanish (Spain), “coger” means “to grab”, but in Latin American Spanish, it is a vulgar word, meaning (roughly) ‘to fuck’. If one started a Spanish language forum, would that be censored? How about in a Spanish song on a music store like iTunes?
Russian-American comic Yakov Smirnoff has a section of his routine where he discusses the fact that in Russian, “Yep” (or “Yeb”) is a vulgar root for “to fuck”, and his initial shock when, in America, a vulgar word in his language is used so frequently in conversation (I believe his routine went something like “Do you want to go out?” “Yep” “Wow! Great!”). Of course, this is an extreme (and exaggerated) example, but the concept still stands.
Many languages share parts of their phonetic inventories (library of sounds), so it’s certainly not uncommon to see such surface correspondences in spelling or pronunciation between innocent words and swear words in among different languages. The trick is training computers to understand when you’re dealing with “cum” and when you’re dealing with “c*m”.
The trouble with Automated language anything is that computers are pretty bad at figuring out which language is being used. There are language guesser programs, and although they’re amazingly good at what they do (and handy in a pinch), they’re also a whole new step in the process, and one that I suspect most webmasters and content providers would be reluctant to include. For a service like iTunes (or an internet forum) to first run everything through a language guesser and then censor appropriately would require much more computer time, and most language guessers need large banks of text to function well. Not to mention that, it’s frequent to find little dabs of languages like Latin in vast English posts.
One potential way of training these programs is to look for context and correspondence. For instance, pussy is not censored if “cat” appears directly afterwards, or “cock” is not censored if “hen” or “fight” occur within that same sentence. This too creates additional computing time, and it’s not entirely reliable. For instance, in a google search for “cum latin”, even though much of the page deals with latin grammar, the first two sponsored links are to adult sites (In retrospect, I should’ve known better, but I was looking for other examples of usage). So, filtering based purely on context and correspondence will likely have issues too.
Perhaps the most obvious solution is to sort by hand, and use a little common sense. However, for those of you seeking a thesis, an intelligent, language-aware filter could probably make life a whole lot easier for a number of forums and services. Who knows, if you do well, you might even graduate summa c*m laude.
Tagged with Computational Linguistics, Conventional Linguistics, Language Usage, Language, Computers, and the Internet | 3 Comments
I’d like to showcase a wonderful little piece of webcoding, a program by Carl Tashian called Multibabel.
This program takes any given English sentence, and then runs it through Altavista’s Babelfish translation engine 7 times, translating into one language, then plugging the result in and translating back to English, then translating that into the next language, continuing through Japanese, Chinese, French, Italian, Portuguese, Spanish and German. It takes a little while, but in the end, you’re presented with a step-by-step view of what happened. The results are sometimes strange, occaisionally funny, and always dead wrong.
For instance, the title of this post is 0ccasional probable he translation of the machine of controll with. This, as you might suspect, is the product of Multibabel. The original phrase I entered was “Computer translation can sometimes prove troublesome”.
Although this program is a slightly biased test (I doubt that asking a series of professional translators to pass along a phrase like this would result in the same phrase at the end), it does show the difficulties of current web based automated translation engines. I particularly enjoyed this quote, by the program author on the program website:
As of September 2003, translation software is almost good enough to turn grammatically correct, slang-free text from one language into grammatically incorrect, barely readable approximations in another.
Sadly, this hasn’t changed all that much since then. There are people working hard to improve it, and strides are being taken, but it’s still a long journey ahead.
So, try Multibabel, get a few chuckles, and just in case anybody you know is tempted to just translate that essay for their language class online, remember:
if immediate translation of the stupid we-machine or the thymus of
vist two, collapse; of that disabled person, you with you
(Which was originally: “When you use machine translation to cheat, you’ll look stupid, dishonest, or both; and no matter what, you’ll fail.”)
Tagged with Computational Linguistics, Conventional Linguistics, Translation and Translation Theory | 1 Comment
Site Information
- About the author
- About this site (and the title text)
- Non-language-related posts
- Our Advertising Policy
- Our Pronoun Policy
- LinguisticMystic RSS Feed
Search the Site
Categories
- Computers and Software (19)
- Conventional Linguistics (104)
- Computational Linguistics (5)
- Dialects and Idiolects (6)
- Etymology (3)
- Language Acquisition (4)
- Language and Music (2)
- Language Change (10)
- Linguistic Anthropology (4)
- Phonetic Phriends (3)
- Phonetics and Phonology (27)
- Psycholinguistics (4)
- Sociolinguistics (22)
- Translation and Translation Theory (10)
- Words, Phrases, and Idioms (29)
- Language and Thought (11)
- Language Censorship (3)
- Language Creation (6)
- Language Humor (26)
- Language Usage (79)
- Linguistic Mysticism (11)
- Notes (53)
- Reader Questions (2)
- Recommended Links (2)
Latest Non-linguistic Posts
Language Sites and Blogs
Linguistics and Language Resources
Links for Corrections
Unrelated-yet-awesome
Archives
- January 2012
- December 2011
- September 2011
- July 2011
- April 2011
- June 2010
- March 2010
- September 2009
- March 2008
- December 2007
- November 2007
- October 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
- August 2006
- July 2006
Site features
- Entries RSS
- Powered by Wordpress
- Theme based on Vertigo Squared.
- Hosting by Joyent