Computer-generated articles are a new “long tail” of journalism, not the downfall of journalism

October 25th, 2011

The New York Times recently ran a story about Narrative Science, a startup with a very clever software for computer-generating new articles based on structured data feeds.  Automatically-generated content as a strategy has been around for a long time, and for an SEO-minded online publisher, it makes a ton of sense.  Here are some examples:

  • If you look at espn.com’s page about a sporting event that hasn’t happened yet, you’ll find it’s full of all kinds of stats, scheduling data, and general information about the venue and the two teams involved, presented in a prose-like format.  What this means SEO-wise is that their page gets a head start in search engines’ indices, and when game time comes and passes, they are locked and loaded to be the top search result when people search for scores and recaps of the game.
  • I personally developed a quiz site that takes in facts about movies from the Netflix API and automatically generates a trivia quiz, one for every title in Netflix’s catalog — over 100,000 movie quizzes.
  • Google Finance aggregates raw financial data in a way that is compelling and useful, and seems almost editorial, but isn’t.
  • SEO spam blogs have gone farther than any of these examples, grabbing snippets of real, topical text from around the internet, concatenating them together into a psuedo-grammatical amalgamate prose that feels like it’s saying something but isn’t.  Often, the approximation of real, meaningful text is close enough to fool a search engine, attract search traffic, and monetize that traffic through contextually-targeted advertising.  If it fools the search engines, it’ll fool the contextual targeting algorithms, too, and what better way to get people to go straight for your ads than making your ads relevant and your content meaningless!.

Of course, the SEO blog and twitter community picked up on this article and it spread like wildfire.  For someone in the business of doing whatever it takes to get search engine traffic, at a minimum cost, machine-generated content has great appeal.  However, the appeal to a true journalistic organization remains minimal. As Times writer Steve Lohr points out, the typical problem with this type of content is the lack of editorial quality. ESPN’s pre-game SEO warmup pages are not particular interesting to read a month ahead of the match (e.g.), and I confess some of my machine-generated Netflix movie quizzes are not particular compelling (here’s a real gem).

The disadvantage to these, of course, is that the editorial quality is quite poor.  While Google might be fooled by this prose, no intelligent reader is.

Lohr goes on to point out that unlike the text-generators that have come before, Narrative Science’s articles have a very natural tone, and they use contextual cues to create a real narrative story with a theme, rather than just listing facts (e.g., an article about a sports team’s victory might create a narrative thread around the fact the team had been a losing streak).
This seems to be enough to send the author –who is of course a professional journalist — into a mild panic about the prospect of computers replacing humans in journalism more generally.  He quotes Narrative Science co-founder Kriss Hammond:

“In five years,” he says, “a computer program will win a Pulitzer Prize — and I’ll be damned if it’s not our technology.”

This is the sort of puffery for which we cannot really fault a CEO, who is courting venture capital, but it is by no means realistic.  Nonetheless, many journalists, Lohr included, feel threatened by the prospect of the machines taking over their jobs.  Those who worry about that fail to understand the market for ultra-low-quality, cheaply-produced news.  It is not replacing articles that are currently written by humans.  Rather, it is replacing articles that would never have been written if a human had to write them.  It is creating a whole new long-tail market for brand-name news that didn’t exist before.  Not every minor stock price fluctuation or Junior Varsity fencing match is deemed worthy of a write-up when a human journalist must actually write it, but that does not mean there isn’t some monetary value to these content assets.  When the incremental cost of new articles is next-to-nothing, the threshold for whether the article gets written is much lower.  When it comes to creative work like journalism, there’s no substitute for the human touch, but Narrative Science presents a viable substitute for creating no work at all.

Incremental advances in natural language products like Narrative Science and Apple’s “Siri” should be considered with a measure of historical perspective on the field of computational linguistics.  This is a discipline with a long history of bold claims about how soon computers will be performing all kinds of human functions.  During the past half century, computer hardware and software have improved enormously, but computers that communicate like humans do have proved far more elusive than anyone anticipated.

There is another crucial component to the success of machine-generated prose: structured data streams.  No one ever won a Pulitzer Prize for merely reporting the facts, let alone facts so clear uniform as to be represented as structured data in a database.  However, that’s what Narrative Science needs in order to work their magic.  Consider this blog post.  If it were written by a computer, what would the configuration file look like?

  • format= review/response
  • tone= skeptical
  • reference_uri= http://www.nytimes.com/2011/09/11/business/computer-generated-articles-are-gaining-traction.html
  • reference_ author= Steve Lohr

What’s missing from the above database config params? The opinions in this post are greatly informed by my personal experiences with computational linguistics, journalism, and online publishing, which are intangible and unique to me.

Alternatively, consider a hypothetical news story in which a member of congress is caught up in some bizarre sexual fiasco.  Finding the maximally intriguing angle for this story depends greatly on whether the congressperson was a liberal eccentric, a right-wing bastion of family values or something else. But what role that factor plays in the interestingness of the news story is very subtle and abstract  — it is not even close to being the kind of structured data that Narrative Science needs.  Successful journalists on topics that are not purely informational must appreciate irony, have a sense of humor, and possess a deep understanding of their topic area and why other people care about it.  Computer software may take on more of the grunt work of low-profile journalism, but it will not be winning any Pulitzer Prizes in our lifetime, or even putting good journalists out of work
Still, it is worth pointing out a threat this might pose to journalism in the long run.  Those Junior Varsity fencing matches do occasionally get coverage, and when they do it’s student journalist and unpaid interns who write them.  Taking work away from these budding journalists and giving it to a computer may be good for a particular publisher, but in the long run, it’s a bad move for the field.  If all the entry level jobs reporting rote facts dried up — if the talent development system itself were replaced by automation, that might indeed spell the beginning of the end of journalism as we know it.

What if Super Mario Brothers was made by Zynga?

May 6th, 2010

Bazynga!

(via)

Brand Names and Linguistics

April 22nd, 2010

I gave a lecture to Kyle Rawlins‘ “Language and Advertising Class” at Johns Hopkins today, about brand names from the perspectives of the advertiser, consumer, and linguist.  PDF of slides is below.

Don’t Always Believe Google Trends

March 25th, 2009

Any guesses at what happened in January 2006?  I’m stumped!

Foot Fetish vs Fat Porn in Google Trends

PageRank and Social Network “Authority”

October 7th, 2008

Google just applied for a patent to apply it’s PageRank algorithm to Social Networks.  On the surface, this seems like a great idea for Google and for Advertisers. But is it actually useful?

Search results are based off of two completely independent factors, relevance and importance.  PageRank is about importance, but other parts of Google’s organic search algorithm address relevance.  Applying PageRank to social nets makes tons of sense for determining importance, but targeting ads requires relevance, not importance. (Google already has the best-in-class social network page *relevance* determining engine: Google AdSense / Content Match.

The only new added value I see for Google/Advertisers in determining which social network pages are more *important* are:
a) Improving organic search results by differently valuing links from different profiles (although PR already does this indirectly, since friends=links).
b) Charging more for ads placed on more prestigious pages, like on Tila Tequila’s MySpace profile.

This seems really cool on the surface, but the more I think about it, the less impressed I am.

Collaborative Polling

April 30th, 2008

Quibblo.com just came out with a new “Collaborative Polling” feature, whereby flash quizzes that you display on your site will generate trackbacks, and will also display all of the other places where the quiz or poll has been taken. The poll below is one good example.


Quizzes by Quibblo.com

Misspelling Quizes: How easy is it to be #1 in google for something that doesn’t exist

April 5th, 2008

There are, generally speaking, two philosophically different kinds of SEO. Honest SEO is a process applied to a site to obtain more organic search traffic from the target audience. Very often, this involves creating content that is tailored to the way that search engine users phrase their search queries. This is just smart, customer-centric marketing. The other kind of SEO — maybe we could call it spam SEO or black hat SEO — seeks to trick search engines into believing a page is about something that it’s not actually about. For example, this post is trying to capture Miley Cyrus related search traffic, while this post is trying to capture traffic where people search for (non-existent) photos or video of Zac Efron nude. These pages provide a poor experience to the users who land on them, because they patently do not offer what the user was looking for. They are just designed to attract traffic where traffic is available.

However, not everything that is a “trick” is a black hat technique, in the sense of deceiving the user. Misspellings are a prime example of that. Suppose someone searches google for “quizes”, when they probably meant “quizzes”. Someone has to rank #1 for “quizes”, and hopefully for the user’s sake it is a page that is all about quizzes. All to often, however, misspelled queries return results from whatever rinky-dink web pages happen to have also misspelled the same word. It provide a service to the user when marketers create targeted content around the misspelling, such as this page about “quizes“. This page offers users a chance to get the quiz content they wanted from a high-quality source. So yes, it’s an SEO trick, but not a dirty one.I actually just created the quizes page. I’m working on creating some more “quize-related” content for it. I’ll update this post with info on the progress of this page in terms of organic search rankings and traffic.

By the way, here’s an example of quizes-related content:

Porn Star, Pony, or Politician?

April 4th, 2008

Racehorses, politicians, and adult film stars all have funny names. Can you tell who’s who when it comes to running hard, and beating out the competition?

porn star, pony or politician quiz | digg story

View this quiz on Quibblo
More quizzes on Quibblo
Quibblo

Maybe Your Keywords Do Not Mean What You Think They Mean

April 3rd, 2008

Mass media can drive search traffic on the internet like nothing else. And if you’re a performance based search advertiser, you’ve gotta be on the lookout for 800 pound gorillas impinging on your keyword space. Take, for example, this poor fellow, who is just trying to sell jars of gourmet French foods in the UK, and cannot figure out why he gets such huge volumes of search traffic for the keyword ratatouille. Someone had to break it to him that this was not click fraud, but simply the fact that the biggest animated film of the past three years had the same title as his delicious eggplant ragout.  So sad.

Click-through Magnet. “Wanna Get Me Drunk?”

September 20th, 2007

Just noticed this ad on YouTube, which I though warranted some public attention.  A girl looks to be about 15 asks, invitingly “Wanna get me drunk?”

Wanna Get Me Drunk?
I found this ad to be a little bit ethically problematic. So what did I do? I clicked on it to see what lay on the other side.  Alas, there’s some damn effective advertising…that is, as long as this ad isn’t CPC.  Turns out, fubar is an virtual cantina with–gasp–no drinking age limit.  That is to say, it is a cleverly skinned mainstream social networking site.

Usually, you can tell that an online ad is not running an a cost-per-click basis when its content is absurd salacious, or otherwise destined to attract a flurry of very speculative clicks. If your ad is CPM or CPA, there is no harm in eliciting impulsive clicks from anyone impulsive or curious enough to give it a click just because “eh, what the hell”. Consider the now-famous “Fart Button” ad.
Fart Button

Think of all the clicks this must get on a ‘tween gaming site (it was my 12-year-old cousin who first brought this glorious advertisement to my attention). After all, even if you’re not a fart fan, per se, the ad speaks to you: “you know you want to”. Who knows what lies behind the fart button, but who really cares? It’s a fart button. Click.
However, if you’re buying CPC ads, you should watch out with ads like this. You don’t want to get charged individually for all those fart-fancying clicks by people who probably don’t want whatever you’re selling. (Flatulence?)  There is a common misconception that click-throughs are inherently beneficial to an ad campaign, and that click-through rate is a stat to be monitored.  This is simply not the case for a PPC campaign.  Still, I think these ads can be effective, because they engage people, and some fraction of those people will be in the target market.  And, when you’re running CPM ads and hoping to get traffic, lewd, lascivious, and absurd may be just the way to go.