Notes from a Linguistic Mystic

(By the way, the title of the post is “Automated Censorship Breakage”, but it shows how people assume the worst when unnecessary asterisks are involved.)

Hot on the heels of my post on automated language translation, Arnold Zwicky over at the Language Log made a wonderful post about Automated Censorship systems, entitled “C*m sancto spirito” and the havoc they can wreak.

The example that jumped out at me the most is from the title, the fact that the Latin “Cum” (‘with’) was censored because it’s spelled like the American slang word for “Semen”. Incidentally, Wikipedia tells me that “Cum” is also Irish for ‘invent’ and Bengal for ‘kiss’, so Latin isn’t just the exception when it comes to innocent meanings.

This isn’t merely an example of context insensitive automatic censorship (“My poor p***y cat is sick!”) or overzealous word blocking (“Fertilization occurs after the s***m travels through the v****a into the uterus..”), but this is a whole new area. This is censoring English letter combinations in other languages, regardless of what they may actually mean.

This starts to set a dangerous precedent. In addition to quickly losing words like the Latin “cum”, it’s only a matter of time before other words start getting blocked too. Also, it starts to lead to troubles even within a language. For instance, in British English, “fag” means “Cigarette” (coming from the original “faggot”, a small bundle of wood for starting a fire), but in American English, “fag” is a slur referring to a homosexual man. So, would a Londoner step outside for a fag or for a f**?

Other examples come to mind. In Castillian Spanish (Spain), “coger” means “to grab”, but in Latin American Spanish, it is a vulgar word, meaning (roughly) ‘to fuck’. If one started a Spanish language forum, would that be censored? How about in a Spanish song on a music store like iTunes?

Russian-American comic Yakov Smirnoff has a section of his routine where he discusses the fact that in Russian, “Yep” (or “Yeb”) is a vulgar root for “to fuck”, and his initial shock when, in America, a vulgar word in his language is used so frequently in conversation (I believe his routine went something like “Do you want to go out?” “Yep” “Wow! Great!”). Of course, this is an extreme (and exaggerated) example, but the concept still stands.

Many languages share parts of their phonetic inventories (library of sounds), so it’s certainly not uncommon to see such surface correspondences in spelling or pronunciation between innocent words and swear words in among different languages. The trick is training computers to understand when you’re dealing with “cum” and when you’re dealing with “c*m”.

The trouble with Automated language anything is that computers are pretty bad at figuring out which language is being used. There are language guesser programs, and although they’re amazingly good at what they do (and handy in a pinch), they’re also a whole new step in the process, and one that I suspect most webmasters and content providers would be reluctant to include. For a service like iTunes (or an internet forum) to first run everything through a language guesser and then censor appropriately would require much more computer time, and most language guessers need large banks of text to function well. Not to mention that, it’s frequent to find little dabs of languages like Latin in vast English posts.

One potential way of training these programs is to look for context and correspondence. For instance, pussy is not censored if “cat” appears directly afterwards, or “cock” is not censored if “hen” or “fight” occur within that same sentence. This too creates additional computing time, and it’s not entirely reliable. For instance, in a google search for “cum latin”, even though much of the page deals with latin grammar, the first two sponsored links are to adult sites (In retrospect, I should’ve known better, but I was looking for other examples of usage). So, filtering based purely on context and correspondence will likely have issues too.

Perhaps the most obvious solution is to sort by hand, and use a little common sense. However, for those of you seeking a thesis, an intelligent, language-aware filter could probably make life a whole lot easier for a number of forums and services. Who knows, if you do well, you might even graduate summa c*m laude.

Have a question, comment, or concern about this post? Contact me!