(By the way, the title of the post is “Automated Censorship Breakage”, but it shows how people assume the worst when unnecessary asterisks are involved.)
Hot on the heels of my post on automated language translation, Arnold Zwicky over at the Language Log made a wonderful post about Automated Censorship systems, entitled “C*m sancto spirito” and the havoc they can wreak.
The example that jumped out at me the most is from the title, the fact that the Latin “Cum” (‘with’) was censored because it’s spelled like the American slang word for “Semen”. Incidentally, Wikipedia tells me that “Cum” is also Irish for ‘invent’ and Bengal for ‘kiss’, so Latin isn’t just the exception when it comes to innocent meanings.
This isn’t merely an example of context insensitive automatic censorship (“My poor p***y cat is sick!”) or overzealous word blocking (“Fertilization occurs when the s***m travels through the v****a into the uterus..”), but this is a whole new area. This is censoring English letter combinations in other languages, regardless of what they may actually mean.
This starts to set a dangerous precedent. In addition to quickly losing words like the Latin “cum”, it’s only a matter of time before other words start getting blocked too. Also, it starts to lead to troubles even within a language. For instance, in British English, “fag” means “Cigarette” (coming from the original “faggot”, a small bundle of wood for starting a fire), but in American English, “fag” is a slur referring to a homosexual man. So, would a Londoner step outside for a fag or for a f**?
Other examples come to mind. In Castillian Spanish (Spain), “coger” means “to grab”, but in Latin American Spanish, it is a vulgar word, meaning (roughly) ‘to fuck’. If one started a Spanish language forum, would that be censored? How about in a Spanish song on a music store like iTunes?
Russian-American comic Yakov Smirnoff has a section of his routine where he discusses the fact that in Russian, “Yep” (or “Yeb”) is a vulgar root for “to fuck”, and his initial shock when, in America, a vulgar word in his language is used so frequently in conversation (I believe his routine went something like “Do you want to go out?” “Yep” “Wow! Great!”). Of course, this is an extreme (and exaggerated) example, but the concept still stands.
Many languages share parts of their phonetic inventories (library of sounds), so it’s certainly not uncommon to see such surface correspondences in spelling or pronunciation between innocent words and swear words in among different languages. The trick is training computers to understand when you’re dealing with “cum” and when you’re dealing with “c*m”.
The trouble with Automated language anything is that computers are pretty bad at figuring out which language is being used. There are language guesser programs, and although they’re amazingly good at what they do (and handy in a pinch), they’re also a whole new step in the process, and one that I suspect most webmasters and content providers would be reluctant to include. For a service like iTunes (or an internet forum) to first run everything through a language guesser and then censor appropriately would require much more computer time, and most language guessers need large banks of text to function well. Not to mention that, it’s frequent to find little dabs of languages like Latin in vast English posts.
One potential way of training these programs is to look for context and correspondence. For instance, pussy is not censored if “cat” appears directly afterwards, or “cock” is not censored if “hen” or “fight” occur within that same sentence. This too creates additional computing time, and it’s not entirely reliable. For instance, in a google search for “cum latin”, even though much of the page deals with latin grammar, the first two sponsored links are to adult sites (In retrospect, I should’ve known better, but I was looking for other examples of usage). So, filtering based purely on context and correspondence will likely have issues too.
Perhaps the most obvious solution is to sort by hand, and use a little common sense. However, for those of you seeking a thesis, an intelligent, language-aware filter could probably make life a whole lot easier for a number of forums and services. Who knows, if you do well, you might even graduate summa c*m laude.
Tagged with Computational Linguistics, Conventional Linguistics, Language Usage, Language, Computers, and the Internet | 3 Comments
Comments
Leave a Comment
If you would like to make a comment, please fill out the form below.
Site Information
Search all posts
Tags
- Computers and Software (12)
- Conventional Linguistics (95)
- Computational Linguistics (4)
- Dialects and Idiolects (6)
- Etymology (2)
- Language Acquisition (4)
- Language Change (8)
- Linguistic Anthropology (4)
- Phonetic Phriends (3)
- Phonetics and Phonology (20)
- Psycholinguistics (4)
- Sociolinguistics (22)
- Translation and Translation Theory (10)
- Words, Phrases, and Idioms (27)
- Language and Thought (10)
- Language Censorship (3)
- Language Creation (6)
- Language Humor (25)
- Language Usage (76)
- Linguistic Mysticism (10)
- Notes (47)
- Reader Questions (2)
- Recommended Links (1)
Language Sites and Blogs
Linguistics and Language Resources
Links for Corrections
Unrelated-yet-awesome
Archives
- June 2010
- March 2010
- September 2009
- March 2008
- December 2007
- November 2007
- October 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
- August 2006
- July 2006
Site features
- Entries RSS
- Comments RSS
- Powered by Wordpress
- Theme based on Vertigo Squared.
- Hosting by Joyent
[...] Posted in Translation, Language Humor at 10:09 pm by will So, I was sent a magnificent link today. Nominally, it’s an article about offensive terms sneaking their way onto personalized (or “vanity”) license plates. Some of them are a little humorous, but one in particular jumped out at me. From the letter (uncorrected): “I would like to share my deepest concern about custom plates that your department issuing to the customers.” … “I would like to give you an example of such custom plate. The number is “CTO XYEB” registered in NY. In Russian it mean “one hundred penises” in a very dirty language.” Now, having studied some Russian in the past, I nearly fell out of my chair laughing at this. Although it could easily have been an unfortunate random letter combination, the English letters “CTO XYEB” correspond to the cyrillic letters spelling a vulgar equivalent of “one hundred penises” in Russian, and with amazing grammatical correctness, too. [...]
Well, “Два хуй” is incorrect. The plural form of “two dicks” would end with the letter “я” as in “Два хуя”.
… and no, this plate is no accident either…
And neither is mine for that matte, which meaning shall remain concealed for the time being.
Thanks Anonymous. I had edited that out in the original post (as I wasn’t sure), but it seems to have lived on in the Trackback. If you ever feel the need to share your plate (anonymously), I’d happily post it up.