Periodically, one goes through periods of deep metaphysical malaise. You look around at the world, wondering how such evil could flourish and such suffering could endure. You descend deeper into darkness, your faith in humanity waning, wondering why we were ever born into this cruel world. Then, suddenly, you realize that somebody has written a programming language based off of the dialect of Lolcats/Cat Macros, and your faith in humanity’s inherent good is completely restored.

LOLCode is a computer programming language concept which draws its vocabulary from the recent internet sensation of captioned cat pictures. Although not fully functional yet, it’s still linguistically fascinating on many different levels, and deserves mention.

i has dialect

One of the most interesting parts of this programming language is that it can exist at all, and the fact that it can goes a long way towards establishing the legitimacy of a feline dialect.

Imagine that I wanted to create a programming language based solely off of star wars vocabulary. I would likely start by finding a donor language, whose basic syntax and ideas I would borrow. Then, I would begin to slowly find equivalents and their translations.

Some equivalent/translation pairs might be obvious. ‘Death Star’ for a verb which meant “remove file”, maybe ‘carbonite’ for “pause process”. One could even get a bit more ornate and incorporate some movie quotes. Perhaps “there is an error” could be coded with ‘It’s a Trap!’, and “load this program” could be ‘Commence Primary Ignition’.

However, no matter how nerdy I felt at the time, my plan would be fatally flawed from the outset. Sooner or later, I would find an expression that was too niché (fulfilling just a small purpose) to have a Star Wars equivalent. I’d have to rely on a set canon of phrases to fill in the blanks, and there’s no way to work around it and still maintain the Star Wars theme.

The reason that LOLCode is so awesome is that, based on what I’ve seen so far, it doesn’t seem to have that limit. Based on my highly scientific research at icanhascheezburger.com, it would appear that LOLCat has become a full fledged dialect. There are many captioned images there, each slightly different, and each seems to fit a coherent grammatical pattern. Some linguists are starting to pick up on distinct patterns and grammatical rules, and based on the fact that any sentence can now be LOLCatted, I’m quite tempted to say that LOLCat has become a productive and functional dialect of English.

Because of this productivity of the LOLCat dialect, it would be quite possible for somebody to take any given sentence or idea and put into LOLCat, thus ensuring that LOLCode could, in theory, become fully functional without ever breaking character. This is very exciting, and very awesome.

mai translationz r not straitforwerd

LOLCode is a very special sort of translation. Conventionally, when one sits down to label a cat, the source is an English sentence (I’m yet to find any cats “en mi refrigeradora, comiendo mis comidaz”). However, here, what people are doing is finding equivalents in human/feline language for concepts, verbs, and ideas within a computer language.

Rather than being able to simply translate, they’re forced to create the inflexible, ambiguity free grammar required to tell a computer what to do. This is tough enough to do even using all sorts of abstract symbols, but to do it within LOLCat dialect and syntax is wonderfully difficult. They’re adapting a human language into a dialect, then bending it into a computer language. This is by no means an easy ask, and it’s a far more complex sort of translation than many.

For this alone, I salute the creator and contributors to LOLCode. Although it may seem silly to some, this is really some top-of-the-line linguistic work.

d00d. ur dialect is teh suxx0rs

Perhaps the even interesting than the mere fact that LOLCat has become a translatable dialect is the fact that, well, there are already people who are arguing about the “correct” way to say something in LOLCat. Take, for instance, this post on the LOLCode wiki:

I know VISIBLE is the current output command, but it’s so not LOLCAT. What if we used LOL as the output instead? So, the Count-1 example becomes:

(Code)

I think this works very well, is funny to read and matches actual LOLCAT protocol, sorta. I guess the LOL would be at the end normally.

As a linguist, this is really, really exciting. People are already trying to step in and enforce the “rules” of the LOLCat dialect. It seems like, as a “native speaker” of LOLCat, the author of this page had a distinct intuition about the “proper” means of expressing a concept in this dialect. Truly incredible.

Although this community of people has only arisen recently, I’m very excited at the potential for the later discussions of “proper” LOLCat, and the sociolinguistic goodness sure to arise from it.

o hai. i discussed ur werk.

So, author of (and contributors to) LOLCode: I salute you. This is a unique, wonderful, and groundbreaking project, and I really hope that it continues to yield such fascinating linguistic insight into the future.

Keep up the good work, and don’t let anybody convince you that what you’re building is silly or unnecessary. If there are two things that the world of technology needs, it’s probably humor and cute, fuzzy animals, and really, I can’t think of a better way to combine the two.

Alright, I’m done. kthxbye

Tagged with Computational Linguistics, Conventional Linguistics, Dialects and Idiolects, Language Humor, Language Usage, Language, Computers, and the Internet, Sociolinguistics, Translation and Translation Theory | 32 Comments


So, today, my girlfriend sent me a link to this comic by XKCD (CAUTION: Not Safe for Work language, namely, the F-Bomb). What’s funny about it is not so much the content, but the fact that they’re ripping not only on Linguistics, but specifically on computational linguistics. Having done a bit of corpus linguistics myself, I couldn’t help but laugh, and, quite frankly, he does have a bit of a point at times. So, I applaud him for taking a strong stand.

As a counterstrike, I propose that some bored Computational Linguist create a corpus of all the text from XKCD’s website. That way, no matter their feelings, fears, or secret desires, XKCD will always be aiding the cause of Computational Linguistics. Our revenge will be swift and searchable.

Tagged with Computational Linguistics, Language Humor, Notes, Tirades | 2 Comments


(By the way, the title of the post is “Automated Censorship Breakage”, but it shows how people assume the worst when unnecessary asterisks are involved.)

Hot on the heels of my post on automated language translation, Arnold Zwicky over at the Language Log made a wonderful post about Automated Censorship systems, entitled “C*m sancto spirito” and the havoc they can wreak.

The example that jumped out at me the most is from the title, the fact that the Latin “Cum” (‘with’) was censored because it’s spelled like the American slang word for “Semen”. Incidentally, Wikipedia tells me that “Cum” is also Irish for ‘invent’ and Bengal for ‘kiss’, so Latin isn’t just the exception when it comes to innocent meanings.

This isn’t merely an example of context insensitive automatic censorship (“My poor p***y cat is sick!”) or overzealous word blocking (“Fertilization occurs when the s***m travels through the v****a into the uterus..”), but this is a whole new area. This is censoring English letter combinations in other languages, regardless of what they may actually mean.

This starts to set a dangerous precedent. In addition to quickly losing words like the Latin “cum”, it’s only a matter of time before other words start getting blocked too. Also, it starts to lead to troubles even within a language. For instance, in British English, “fag” means “Cigarette” (coming from the original “faggot”, a small bundle of wood for starting a fire), but in American English, “fag” is a slur referring to a homosexual man. So, would a Londoner step outside for a fag or for a f**?

Other examples come to mind. In Castillian Spanish (Spain), “coger” means “to grab”, but in Latin American Spanish, it is a vulgar word, meaning (roughly) ‘to fuck’. If one started a Spanish language forum, would that be censored? How about in a Spanish song on a music store like iTunes?

Russian-American comic Yakov Smirnoff has a section of his routine where he discusses the fact that in Russian, “Yep” (or “Yeb”) is a vulgar root for “to fuck”, and his initial shock when, in America, a vulgar word in his language is used so frequently in conversation (I believe his routine went something like “Do you want to go out?” “Yep” “Wow! Great!”). Of course, this is an extreme (and exaggerated) example, but the concept still stands.

Many languages share parts of their phonetic inventories (library of sounds), so it’s certainly not uncommon to see such surface correspondences in spelling or pronunciation between innocent words and swear words in among different languages. The trick is training computers to understand when you’re dealing with “cum” and when you’re dealing with “c*m”.

The trouble with Automated language anything is that computers are pretty bad at figuring out which language is being used. There are language guesser programs, and although they’re amazingly good at what they do (and handy in a pinch), they’re also a whole new step in the process, and one that I suspect most webmasters and content providers would be reluctant to include. For a service like iTunes (or an internet forum) to first run everything through a language guesser and then censor appropriately would require much more computer time, and most language guessers need large banks of text to function well. Not to mention that, it’s frequent to find little dabs of languages like Latin in vast English posts.

One potential way of training these programs is to look for context and correspondence. For instance, pussy is not censored if “cat” appears directly afterwards, or “cock” is not censored if “hen” or “fight” occur within that same sentence. This too creates additional computing time, and it’s not entirely reliable. For instance, in a google search for “cum latin”, even though much of the page deals with latin grammar, the first two sponsored links are to adult sites (In retrospect, I should’ve known better, but I was looking for other examples of usage). So, filtering based purely on context and correspondence will likely have issues too.

Perhaps the most obvious solution is to sort by hand, and use a little common sense. However, for those of you seeking a thesis, an intelligent, language-aware filter could probably make life a whole lot easier for a number of forums and services. Who knows, if you do well, you might even graduate summa c*m laude.

Tagged with Computational Linguistics, Conventional Linguistics, Language Usage, Language, Computers, and the Internet | 3 Comments


Site Information

Search all posts

Tags


Archives


Site features