« WarszawaIntroducing Loaf »

Your Literary Masterpiece Was Delicious

For the past few days I've been working on a little project that reads Russian novels and converts them to pretty pictures. Specifically, it turns them into pastel-colored graphs that show you the contextual relationships between characters, based on how many times their names occur together in the text.

The point of the exercise is to try and create character maps for novels that can then serve as clickable, interactive maps to the text, and to do so with as little supervision as possible. Combining such an interface with flexible, full-text search and annotation tools could open up a whole new way of interacting with literary texts.

Here, for example, is a section of the graph for Mikhail Bulgakov's "The Master and Margarita". The size of the rectangles indicates the relative importance of each character, and the thickness of the connections shows you the strength of the relationship between them (you can click through to a full version of this graph if you have the SVG plugin):

This project poses an interesting problem in natural language processing - given a text, how can you isolate the names of characters? Is it possible to do it just based on capitalization patterns? And how do you resolve characters that have multiple variant names? For example, the character of Pontius Pilate appears in the Russian text as "Pontii Pilat", "Pilat", "prokuror", "hegemon" - is there any way to flag the many variants as one?

If you're working in Russian (as in this example), you also have to face the issue of normalization (Russian names can have one of seven case endings, which depend on gender and number), and in whatever language you work in, you run into the problem of anaphoresis (figuring out which antecedent words like "he" and "it" refer to, sometimes over a span of many paragraphs). The key to success in NLP is knowing where the computer should apply brute force, and where it needs to call in a human being to help with the hard bits. For example, to generate the graph above, the machine asked me to validate some guesses it made ("Is 'Margarita' the same person as 'Margo'?"), and then I had to give it a final push over the cliff, as a Spinal Tap fan might say, by disabusing it of the notion that "Petersburg" was a proper name, or that "Ivan" and "Bosoi" were two different characters.

More importantly, this kind of project poses an interesting cultural litmus test. To a computer hacker, treating a literary work like a data stream and mining it for patterns is an amusing challenge, but there are many people who would frown on this kind of thing as a reductionist attack on a living text. When I was first learning Perl, I remember writing a toy program that read all my letters to my girlfriend, and her letters to me, and spit out lists of words that only one of us had ever used. I thought it was a neat trick, while she was livid with me for treating our correspondence as just so much raw material for my computer games. It was a surprisingly touchy topic.

For as long as there has been writing, there have been cabalists eager to quantify, rearrange, and play with texts to reveal hidden inner meanings. The humanities have an understandable suspicion of such textual games, and a less understandable reluctance to explore the many worlds that fast computers and the mass digitization of texts are opening up. But as natural language programming techniques improve, the humanities are going to have to come to terms with computers, whether they like it or not. We geeks are going to eat all of their favorite books.

Teachers are already battling the problem of Google plagiarism, or in its mild form, Google-only research. And soon it will be possible to auto-generate meaningful summaries and essays on many topics with minimal effort - software already exists that grades essays as well as a human TA, it's not hard to imagine a nice NLP essay generator that would withstand the scrutiny of a harried graduate student with two hundred papers to grade. The only way around this challenge will be to adapt to the new technologies, or find ways to successfully exclude them. Just like the programming example I gave, the trick will be finding where to draw the line between the computer and the reader. It will be interesting to see how the humanities cope with the problem that math ran into twenty years ago, with the advent of the cheap pocket calculator.

My own interest is finding a way to make these pretty pictures useful in some way to the reader. For example, the image above links to a full-size version of the graph in scalable vector graphics (SVG) format, which among other things lets you create hyperlinks to web pages and interact with the drawn image. So you can do things like link every edge in the graph to a page that displays all the paragraphs where those two characters in the book interact, or even start to explore connections across many books (consider the grand cycles of Balzac or Zola). Perhaps scripted visualizations like this will open the texts to more intensive study, and enable some kinds of scholarship that up to now have been very tedious. There's already been a small revolution in lexicography thanks to automated corpora, why not extend it to literature?

I'm happy to announce that my talk proposal on hacking literature has been accepted for this year's O'Reilly Open Source Conference, which takes place in July. If the topic is something you find interesting, I hope you will come to the presentation, or download the notes and code from this site as they become available. And please email me your own "wouldn't it be cool if" ideas about visualizing literature, so I can steal them and win glory and fame. I'll keep posting little demos (and begin posting links to code) as time allows.

For the technically curious, this graph was generated by an unholy combination of Perl, neato graph layout software), SVG and the amazing lib.ru website, which has full-text copies of every Russian novel you can name. Because of the way Russian copyright law works, you can use the texts for educational purposes without restriction.

« WarszawaIntroducing Loaf »
Idle Words

brevity is for the weak

Greatest Hits

The Alameda-Weehawken Burrito Tunnel
The story of America's most awesome infrastructure project.

Argentina on Two Steaks A Day
Eating the happiest cows in the world

Scott and Scurvy
Why did 19th century explorers forget the simple cure for scurvy?

No Evidence of Disease
A cancer story with an unfortunate complication.

Controlled Tango Into Terrain
Trying to learn how to dance in Argentina

Dabblers and Blowhards
Calling out Paul Graham for a silly essay about painting

Attacked By Thugs
Warsaw police hijinks

Dating Without Kundera
Practical alternatives to the Slavic Dave Matthews

A Rocket To Nowhere
A Space Shuttle rant

Best Practices For Time Travelers
The story of John Titor, visitor from the future

100 Years Of Turbulence
The Wright Brothers and the harmful effects of patent law

Every Damn Thing

2015 May Jul
2014 Jul Aug
2013 Feb Dec
2012 Feb Sep Nov Dec
2011 Aug
2010 Mar May Jun Jul
2009 Jan Feb Mar Apr May Jun Jul Aug Sep
2008 Jan Apr May Aug Nov
2007 Jan Mar Apr May Jul Dec
2006 Feb Mar Apr May Jun Jul Aug Sep Oct Nov
2005 Jan Feb Mar Apr Jul Aug Sep Oct Nov Dec
2004 Jan Feb Mar Apr May Jun Jul Aug Oct Nov Dec
2003 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2002 May Jun Jul Aug Sep Oct Nov Dec

Your Host

Maciej Cegłowski


Please ask permission before reprinting full-text posts or I will crush you.