David Hardtke's Blog

Home Feature Tour About Privacy Blog
 

"How to Save the News" by James Fallows


Yesterday I read an excellent article by James Fallows on How to Save the News in the Atlantic. The article discussed how Google was working to help news organizations monetize their content in the online age. The article prompted me to submit the following letter to the editor:

The ability to quickly find quality content is the primary reason that people use Google, and is good to see that Google is committed to helping professional journalism survive ("How to Save the News" by James Fallows). Journalists provide a large share of the material that draws people to the search engine, both Google news and the main search page. Consumers know that Google (and other search engines like Bing and Yahoo!) provide easy one-stop access to the information they seek on a large variety of topics. In exchange for this convenience, consumers allow Google to run sponsored links above the search results for the small fraction of search queries that are commercially valuable. A good analogy is commercial radio. Radio stations aggregate quality content that consumers want (songs from various artists) in exchange for the right to subject the consumer to the occasional advertisement. The radio versus search engine analogy breaks down, however, when we note that those who create the content that attracts consumers to the radio stations, the songwriters, are compensated for their creations in the form of performance royalties paid by the radio station. The fact that such a performance royalty agreement does not exist between search engines and professional journalists is an accident of history, and need not be the case in the future. Google and other search engines should compensate content providers directly for the right to use their creations.

In order to create such a system journalists need to realize that their creations are largely interchangeable from the perspective of Google or the average web surfer. If individual news providers starts charging in some way, the consumers and Google will simply move their attention to free news sites. This is why micro-payments are not the answer. Instead, news organizations need to bargain collectively and require that Google and other search engines pay for the right to index their content. If a large fraction of news organizations were to simultaneously remove their content from Google, it would seriously impact the quality of the Google product (and give Bing a huge advantage if they were to pay for that body of content). In a performance royalty system, news organizations would be payed a small fixed fee each time Google linked to their articles, similar to the way songwriters are compensated via ASCAP when their songs are played on the radio. Google will argue that the legal concept of fair use allows them to aggregate short portions of content from copyrighted sources and therefore prevents such a system from being enforceable, but it is not clear that fair use applies (the creation of the search index requires an entire copy of the document be stored on the Google servers). Additionally, journalists cannot be forced to participate, just as songwriters need not join ASCAP.

Clearly, such a system will not be invented by Google as it is bad for their bottom line. Nonetheless, a performance royalty system would fairly compensate journalists for the value they provide both Google and consumers.

 
 
 
 

Data rates around the web


A few weeks ago I was at the Twitter Developer Conference. On the hack day there were many impressive presentations about the tools that Twitter has developed to manage all of the data going in and out of Twitter. Twitter moving their back-end data store over to Cassandra. They threw out some impressive numbers -- 50 million tweets per day. 600 million searches per day. After the hack day, I had dinner with a friend from Twitter (@jeanpaul) and we were discussing the raw data volumes that they have to deal with.

My benchmark for "big data" is the STAR Experiment at RHIC. I worked on STAR from 1997-2003, and at that point I believe it was the largest volume data producer in existence. The raw data rates were enormous (Gigabyte or so per second) but it was fairly easy to compress that to 100 MB/s using electronics. At the end of the day, we had to put everything on tape, and the limit at the time was about 20 Mb/s to tape. Using the technologies available at the time, 20 Mb/s was that maximum you could record.

Today, of course, nobody uses tape for these sorts of problems. Tape is the same price as it was 10 years ago but disk is about 1000 times cheaper. One would assume then that people are recording data at much higher rates than the physicists were 10 years ago. Turns out, that for human generated data, the data rates are not as high as one might think. I compiled the following numbers from various places. This is data that needs to be archived -- when Ashton Kutcher sends a 4 kB tweet it causes 20 GB of bandwidth to be used, but only the 4 kB tweet needs to be saved.

Source Rate Data to Storage
Twitter 700/s 2 MB/s
Facebook Status Updates 600/s 2 MB/s
Facebook Photos 400/s 40 MB/s
Google Search Queries 34,000/s 30 MB/s

All of this content is humans typing at a keyboard (except for the Facebook photos). We see something interesting -- human generated unique content, integrated over all humanity, is not a very difficult data problem. Everything we generate is of order 100 MB/s, or perhaps 1 GB/s if we include emails and SMS.

 
 
 
 
 

« May 2010 »
SunMonTueWedThuFriSat
      
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
29
30
31
     
Today
Follow Stinky Teddy

    [This is a Roller site]
     
    © Stinky Teddy