Aron Pilhofer acts as editor of interactive news technologies at The New York Times, overseeing a news-focused team of journalist/developers who build dynamic, data-driven applications to enhance the Times' reporting online. He joined The Times in 2005. Previously, he was at the Center for Public Integrity in Washington, and before that at Investigative Reporters and Editors (IRE.org).
In this interview, we discuss:
- The purpose served by DocumentCloud
- Lack of technology in newsrooms, and how the cloud is making information more attainable and process-able by journalists
- How the elastic capabilities of cloud computing match with the event-based spikes in demand around news
- How a "document dump" may cause thousands of documents to appear at one time, and are better processed by the elasticity of the cloud
- Use of "CloudCrowd", a Ruby based MapReduce library
Robert Duffner: Could you take a moment to introduce yourself and to give us some background on DocumentCloud?
Aron Pilhofer: Sure. I wear a couple of different hats. At The New York Times, I'm editor of interactive news, which is a team of developers in the newsroom who are journalists. What we do is both editorially and data-driven. We operate like a news desk, but we also were a technology team.
My other day job is on DocumentCloud, which is a nonprofit funded by the Knight Foundation. I proposed a grant to fund it with Eric Umansky and Scott Cline. We were awarded the grant, and we're entering our second year right now.
The goal of the project is to improve journalism by creating a site that allows journalists to analyze, upload, share, and search public source documents that would be otherwise extremely difficult to find or analyze.
Robert: There's an old issue in journalism that journalists often cite documents that aren't available to the reader. DocumentCloud lets the journalist post those source document in a public place, so the reader can go back to the source, just as a journalist could.
As far back as the '20s, though, guys like Walter Lippmann argued that the public just isn't that interested in the details. Do you find that people aside from journalists are benefiting from DocumentCloud?
Aron: Actually, no. Let me just explain a little bit what DocumentCloud is, how it started, and why the answer is no. There's DocumentCloud the software, which is one part of what we're building. It sort of sits on top of OpenCalais, which is an open API that does entity extraction and semantic markup.
Think of it as a set of tools we're providing to journalists to give them the ability to treat unstructured text more like structured data, so they can find links between documents that they could not have found through traditional means.
As an example, think of a case where you send through a document that includes a reference to the CIA. CIA is meaningful to a human being. You and I can look at that and go "Oh, that probably means the Central Intelligence Agency." Or, in other context, it might be the Culinary Institute of America. It's less clear to a traditional text search.
The Calais engine allows us, in an automated way, to go through and say "OK, that's the Central Intelligence Agency. And by the way, here's this other document also about the Central Intelligence Agency, and both of them reference the same individual that you are curious about." So, that's an example of some of the tools we're building with journalism in mind. That's DocumentCloud the software.
Then there's DocumentCloud the community, which is the other piece of what we're trying to put together. Right now, it includes about 150 journalists and journalism organizations, with that number growing by leaps and bounds. They're joining the community to use this tool to improve their reporting.
In order to join that community, you pretty much need to be a journalist, by our definition. That is, you must be someone whose job, either paid or unpaid, involves the acquisition, analysis, and ultimately publishing of public source documents to benefit the public. Normally that means government documents, and a lot of those documents are acquired through FOIA, or they might exist on some other site.
Having said all that, we have been approached by any number of non-journalism organizations, such as law firms. We've gotten the sense that there is a need out there for sort of a lightweight document management tool, and we may explore that as a potential revenue generator, but that isn't really our main focus.
Robert: You talked about this idea of document management. One of the reasons that self publishing has been so popular has been the ease by which you can actually publish to a platform. Can you talk a little bit about how DocumentCloud removes some of the impediments traditionally associated with IT departments?
Aron: The genesis of DocumentCloud was from a piece of software we developed at the Times called DocumentViewer, which is a really straightforward piece of software. It will take a PDF or a Word document, pretty much anything OpenOffice can open or a PDF, break it up, extract the text, make it searchable, and then publish it to the web in an attractive way.
Our thinking going in was that most news organizations, even the smallish ones, would want something similar. So our original conception was that DocumentCloud would be sort of the hub. We would want your metadata, but generally speaking, we thought that all the member organizations would want this sort of viewer to be on their hardware, behind their firewalls.
We could not have been more wrong about that, for both good reasons and unfortunate ones. My perception is that newsrooms lack fundamental technology to deal with documents, and that is sort of scary. The traditional way that newsrooms deal with big document dumps is to split them up and have people sit down with yellow legal pads and pens and highlighters.
That is the highest technology, really, that most newsrooms currently employ. A lot of newsrooms don't have access to the simplest things, like OCR. That's surprising to a lot of people, but it's true, and in this little area of public source documents, we think we can help.
That's why we pivoted early on away from thinking about DocumentCloud as a federated thing running on hundreds of websites, to a vision where fundamentally it all goes through us. For the most part, we actually host the documents on behalf of news organizations.
All a news organization has to do is get a little embed code from us, which they can embed anywhere they want in their CMS. They can put it within a blank page on their own site, in a blog post, or whatever. It's really simple and really straightforward.
Robert: Some of these technologies like DocumentCloud coming out are pretty exciting. Can you talk a little bit about some other ways that the cloud might be fundamentally shaping journalism practices?
Aron: My team here at the Times couldn't do what we do without the cloud. We run everything off of Amazon. On an election night, we can suddenly go from four or five servers to 22 servers to handle all that traffic. A day later, we can just spin back down to five servers. There's no way you could do that in a traditional IT environment.
Robert: One concern that governments and corporations have about the cloud is where data is stored. Typically they want or need the data to be stored within their country's borders. But what's a drawback for some companies in this scenario actually looks like an advantage for journalism. Is one benefit of the cloud that it's possible to store any potentially embarrassing government documents out of the reach of that government?
Aron: That thought certainly has occurred to me, and I don't know that it's been adjudicated anywhere, really. To flip that idea on its head, consider that in the UK, there's this notion of Crown copyright, where the public doesn't really own public documents and data.
It's sort of bizarre. For example, postal codes are copyrighted under Crown copyright, and you have to pay a huge amount of money to get boundaries of postal codes in the UK. I don't know what would happen if somebody were to make that data publically available on a server in the US. If there were some assertion of Crown copyright, would that even apply jurisdictionally to where that data is hosted?
It's a really good question, and I'm not sure I want to find out, because this is sort of new territory for everybody. We're pretty cautious about what we put up on the cloud and what we don't.
Robert: Looking at DocumentCloud, what was it that required something new to be built? I mean Microsoft has Office 365 with SkyDrive. Google obviously offers Google Docs. There's also Scribd. What did you need that you didn't find in these existing resources?
Aron: We looked at all those options early on, and while in 2007, this field obviously wasn't quite as crowded as it is now, none of them did what we wanted DocumentViewer to do. DocumentViewer is more than just a way of putting a document online.
For example, it also allows you to do annotations, which is kind of key from a journalistic standpoint. There's what we have come to refer to as kind of a journalistic layer on top of a document.
A reporter can go into DocumentViewer, highlight a key paragraph, click "Drag," and create an annotation. He or she can actually write a couple of paragraphs to identify the significance of a particular sentence, phrase, or paragraph and deep link into it.
That allows you to add a narrative to what is effectively a piece of raw data, and say to the reader, "OK, here's the document that we're basing our reporting on. But more that, here are the key paragraphs, and here's why they're key. Here's really what this means."
Scribd didn't do that. Docstoc didn't do that. There was really no technology we could find that did it in a way that we thought accomplished our goals. We also wanted something that wasn't Flash-based, which Scribd at that time was.
Robert: That makes a lot of sense, particularly to support standards, when you consider all of the form factors that you can use to access the Web. I imagine various reporters want to use something like an iPad, a mobile phone, you name it.
Aron: Right. It's not the world's greatest experience, but you can actually use DocumentViewer on an iPhone. This is not an anti-Flash rant, or anything like that. It's just we felt that the right technology for this was to stick to web standards, and what we've come to refer to as HTML5.
Robert: On your blog, you've talked about how to use Amazon EC2 behind the scenes. Can you explain how the elasticity of the cloud, scaling up and down on demand, gets put to use by DocumentCloud?
Aron: Sure. It's a big challenge. Document processing is a very CPU-intensive process, and so we needed to be able to scale up rapidly when there's a big uploaded document, so we did two things. One is that we have built and released a fairly lightweight parallel processing library we call CloudCrowd. DocumentCloud has actually released a number of open source libraries. We haven't released the entire project, but that will come soon.
But the first piece was CloudCrowd, and that was sort of a lightweight, Ruby-based parallel processing library, which allows us to quickly add additional processing nodes if we get a 3,000 document dump from AP, which actually happened last week.
Relatively easily, we can add two, three, four, or 100 servers to the processing pool and split that job up. It's basically a MapReduce project at that point. So that's how the elasticity helps us on DocumentCloud. The front end isn't as much of an issue, because once the documents are actually rendered, it's 100% static content. We just serve those off of S3.
Robert: Can you talk about how much you're processing and what you expect that to grow to?
Aron: It fluctuates, obviously, and it's pretty spiky, which is why we couldn't really do this in a traditional environment. If you're building a data center, you have to size it to the biggest spike you expect to have, which means you've got a lot of time where you're sitting and idling with unused resources. Because we don't need to worry about that, we can spin up 10, 20, or whatever at a time.
I think the most we've every processed in a day is a few thousand documents. And then there are certain days where it's just a few dozen. We opened our beta this summer, and I think we're over 400,000 pages now, closing in on 500,000.
Robert: You mentioned already that DocumentCloud uses open source, and is itself open source.
Aron: Actually, it's MapReduce, but we don't use Hadoop. Our version of Hadoop is CloudCrowd. Think of in the old Apple ad, CloudCrowd is Hadoop for the rest of us. It's a much simpler Ruby-based MapReduce library for doing parallel processing.
Robert: We definitely sense that investigative journalism is being cut from a lot of news organizations, because it's expensive and time-consuming. At the same time, computer assisted reporting, which includes things like web scraping and data mining, is on the rise and has actually led to Pulitzer Prize winning stories. Do you think that technology offers new hope to investigative journalism?
Aron: Certainly, and DocumentCloud, I think, is an example of how technology can be brought t bear on that. As I said before, most journalists do serious document reporting and analysis as a very analog process, and I think that the document piece is just one tiny fragment.
Part of what a lot of computer assisted reporting folks are doing these days in newsrooms is acquiring the data and making it searchable, so it's easier for non-technical journalists to work with. I think the smart application of technology in newsrooms can be a force multiplier for shrinking staff.
The Times obviously has made a significant commitment to investigative reporting, which not every news organization has. Anyone who reads a newspaper knows the industry is struggling, which is a very good reason why newspaper staffs are shrinking. The way I see it is that technology can help overcome some inefficiencies, which can help preserve journalistic quality.
Robert: Hey, thanks so much for your time. I greatly appreciate it.
Aron: You bet.