Computer-generated articles are a new “long tail” of journalism, not the downfall of journalism
Tuesday, October 25th, 2011The New York Times recently ran a story about Narrative Science, a startup with a very clever software for computer-generating new articles based on structured data feeds. Automatically-generated content as a strategy has been around for a long time, and for an SEO-minded online publisher, it makes a ton of sense. Here are some examples:
- If you look at espn.com’s page about a sporting event that hasn’t happened yet, you’ll find it’s full of all kinds of stats, scheduling data, and general information about the venue and the two teams involved, presented in a prose-like format. What this means SEO-wise is that their page gets a head start in search engines’ indices, and when game time comes and passes, they are locked and loaded to be the top search result when people search for scores and recaps of the game.
- I personally developed a quiz site that takes in facts about movies from the Netflix API and automatically generates a trivia quiz, one for every title in Netflix’s catalog — over 100,000 movie quizzes.
- Google Finance aggregates raw financial data in a way that is compelling and useful, and seems almost editorial, but isn’t.
- SEO spam blogs have gone farther than any of these examples, grabbing snippets of real, topical text from around the internet, concatenating them together into a psuedo-grammatical amalgamate prose that feels like it’s saying something but isn’t. Often, the approximation of real, meaningful text is close enough to fool a search engine, attract search traffic, and monetize that traffic through contextually-targeted advertising. If it fools the search engines, it’ll fool the contextual targeting algorithms, too, and what better way to get people to go straight for your ads than making your ads relevant and your content meaningless!.
Of course, the SEO blog and twitter community picked up on this article and it spread like wildfire. For someone in the business of doing whatever it takes to get search engine traffic, at a minimum cost, machine-generated content has great appeal. However, the appeal to a true journalistic organization remains minimal. As Times writer Steve Lohr points out, the typical problem with this type of content is the lack of editorial quality. ESPN’s pre-game SEO warmup pages are not particular interesting to read a month ahead of the match (e.g.), and I confess some of my machine-generated Netflix movie quizzes are not particular compelling (here’s a real gem).
The disadvantage to these, of course, is that the editorial quality is quite poor. While Google might be fooled by this prose, no intelligent reader is.
Lohr goes on to point out that unlike the text-generators that have come before, Narrative Science’s articles have a very natural tone, and they use contextual cues to create a real narrative story with a theme, rather than just listing facts (e.g., an article about a sports team’s victory might create a narrative thread around the fact the team had been a losing streak).
This seems to be enough to send the author –who is of course a professional journalist — into a mild panic about the prospect of computers replacing humans in journalism more generally. He quotes Narrative Science co-founder Kriss Hammond:
“In five years,” he says, “a computer program will win a Pulitzer Prize — and I’ll be damned if it’s not our technology.”
This is the sort of puffery for which we cannot really fault a CEO, who is courting venture capital, but it is by no means realistic. Nonetheless, many journalists, Lohr included, feel threatened by the prospect of the machines taking over their jobs. Those who worry about that fail to understand the market for ultra-low-quality, cheaply-produced news. It is not replacing articles that are currently written by humans. Rather, it is replacing articles that would never have been written if a human had to write them. It is creating a whole new long-tail market for brand-name news that didn’t exist before. Not every minor stock price fluctuation or Junior Varsity fencing match is deemed worthy of a write-up when a human journalist must actually write it, but that does not mean there isn’t some monetary value to these content assets. When the incremental cost of new articles is next-to-nothing, the threshold for whether the article gets written is much lower. When it comes to creative work like journalism, there’s no substitute for the human touch, but Narrative Science presents a viable substitute for creating no work at all.
Incremental advances in natural language products like Narrative Science and Apple’s “Siri” should be considered with a measure of historical perspective on the field of computational linguistics. This is a discipline with a long history of bold claims about how soon computers will be performing all kinds of human functions. During the past half century, computer hardware and software have improved enormously, but computers that communicate like humans do have proved far more elusive than anyone anticipated.
There is another crucial component to the success of machine-generated prose: structured data streams. No one ever won a Pulitzer Prize for merely reporting the facts, let alone facts so clear uniform as to be represented as structured data in a database. However, that’s what Narrative Science needs in order to work their magic. Consider this blog post. If it were written by a computer, what would the configuration file look like?
- format= review/response
- tone= skeptical
- reference_uri= http://www.nytimes.com/2011/09/11/business/computer-generated-articles-are-gaining-traction.html
- reference_ author= Steve Lohr
What’s missing from the above database config params? The opinions in this post are greatly informed by my personal experiences with computational linguistics, journalism, and online publishing, which are intangible and unique to me.
Alternatively, consider a hypothetical news story in which a member of congress is caught up in some bizarre sexual fiasco. Finding the maximally intriguing angle for this story depends greatly on whether the congressperson was a liberal eccentric, a right-wing bastion of family values or something else. But what role that factor plays in the interestingness of the news story is very subtle and abstract — it is not even close to being the kind of structured data that Narrative Science needs. Successful journalists on topics that are not purely informational must appreciate irony, have a sense of humor, and possess a deep understanding of their topic area and why other people care about it. Computer software may take on more of the grunt work of low-profile journalism, but it will not be winning any Pulitzer Prizes in our lifetime, or even putting good journalists out of work
Still, it is worth pointing out a threat this might pose to journalism in the long run. Those Junior Varsity fencing matches do occasionally get coverage, and when they do it’s student journalist and unpaid interns who write them. Taking work away from these budding journalists and giving it to a computer may be good for a particular publisher, but in the long run, it’s a bad move for the field. If all the entry level jobs reporting rote facts dried up — if the talent development system itself were replaced by automation, that might indeed spell the beginning of the end of journalism as we know it.