David Hardtke's Blog

Home Feature Tour About Privacy Blog
 

Core value: Best Result First


Today I watched an old video of Steve Jobs talking about branding that had risen to the top of Hacker News. In the video, he talks about how a company needs to have a core value, and that the brand should be about that core value. It got me thinking -- what is the core value of Stinky Teddy? We're not actually a company at this point, just an experiment. Nonetheless, we have a simple core value and it is best result first.

Let me elaborate. As Larry Page pointed out, the perfect search engine should have only a single result in many cases, and that one result is the exact link you are looking for. Our goal at Stinky Teddy, simply put, is to use all available information to try to figure out what that exact result is for you at a particular moment in time.

You might ask, don't all search engines try to do that? To see if "best result first" is a core value of your search engine do a search for something highly commercial like "mortgage refinance process." In all likelihood, the first result is a sponsored link. Not what you want, and certainly not the best result.

Here's the video:

 
 
 
 

open source code on github


I've started the process of open-sourcing some of the code we've developed for Stinky Teddy. Eventually, we'll open source everything except for the ranking algorithms and intelligence. Stinky Teddy uses multiple search APIs, so a lot of effort went into normalizing multiple search APIs. Our first stab at open-sourcing is a simple Java class that converts multiple string representations of the date (used by various search APIs) into the java.util.Date format and also the date string format used by Lucene (org.apache.lucene.document.DateTools.dateToString()).

I'm using github to host the code: Stinkteddy @ github.

 
 
 
 

"How to Save the News" by James Fallows


Yesterday I read an excellent article by James Fallows on How to Save the News in the Atlantic. The article discussed how Google was working to help news organizations monetize their content in the online age. The article prompted me to submit the following letter to the editor:

The ability to quickly find quality content is the primary reason that people use Google, and is good to see that Google is committed to helping professional journalism survive ("How to Save the News" by James Fallows). Journalists provide a large share of the material that draws people to the search engine, both Google news and the main search page. Consumers know that Google (and other search engines like Bing and Yahoo!) provide easy one-stop access to the information they seek on a large variety of topics. In exchange for this convenience, consumers allow Google to run sponsored links above the search results for the small fraction of search queries that are commercially valuable. A good analogy is commercial radio. Radio stations aggregate quality content that consumers want (songs from various artists) in exchange for the right to subject the consumer to the occasional advertisement. The radio versus search engine analogy breaks down, however, when we note that those who create the content that attracts consumers to the radio stations, the songwriters, are compensated for their creations in the form of performance royalties paid by the radio station. The fact that such a performance royalty agreement does not exist between search engines and professional journalists is an accident of history, and need not be the case in the future. Google and other search engines should compensate content providers directly for the right to use their creations.

In order to create such a system journalists need to realize that their creations are largely interchangeable from the perspective of Google or the average web surfer. If individual news providers starts charging in some way, the consumers and Google will simply move their attention to free news sites. This is why micro-payments are not the answer. Instead, news organizations need to bargain collectively and require that Google and other search engines pay for the right to index their content. If a large fraction of news organizations were to simultaneously remove their content from Google, it would seriously impact the quality of the Google product (and give Bing a huge advantage if they were to pay for that body of content). In a performance royalty system, news organizations would be payed a small fixed fee each time Google linked to their articles, similar to the way songwriters are compensated via ASCAP when their songs are played on the radio. Google will argue that the legal concept of fair use allows them to aggregate short portions of content from copyrighted sources and therefore prevents such a system from being enforceable, but it is not clear that fair use applies (the creation of the search index requires an entire copy of the document be stored on the Google servers). Additionally, journalists cannot be forced to participate, just as songwriters need not join ASCAP.

Clearly, such a system will not be invented by Google as it is bad for their bottom line. Nonetheless, a performance royalty system would fairly compensate journalists for the value they provide both Google and consumers.

 
 
 
 

Data rates around the web


A few weeks ago I was at the Twitter Developer Conference. On the hack day there were many impressive presentations about the tools that Twitter has developed to manage all of the data going in and out of Twitter. Twitter moving their back-end data store over to Cassandra. They threw out some impressive numbers -- 50 million tweets per day. 600 million searches per day. After the hack day, I had dinner with a friend from Twitter (@jeanpaul) and we were discussing the raw data volumes that they have to deal with.

My benchmark for "big data" is the STAR Experiment at RHIC. I worked on STAR from 1997-2003, and at that point I believe it was the largest volume data producer in existence. The raw data rates were enormous (Gigabyte or so per second) but it was fairly easy to compress that to 100 MB/s using electronics. At the end of the day, we had to put everything on tape, and the limit at the time was about 20 Mb/s to tape. Using the technologies available at the time, 20 Mb/s was that maximum you could record.

Today, of course, nobody uses tape for these sorts of problems. Tape is the same price as it was 10 years ago but disk is about 1000 times cheaper. One would assume then that people are recording data at much higher rates than the physicists were 10 years ago. Turns out, that for human generated data, the data rates are not as high as one might think. I compiled the following numbers from various places. This is data that needs to be archived -- when Ashton Kutcher sends a 4 kB tweet it causes 20 GB of bandwidth to be used, but only the 4 kB tweet needs to be saved.

Source Rate Data to Storage
Twitter 700/s 2 MB/s
Facebook Status Updates 600/s 2 MB/s
Facebook Photos 400/s 40 MB/s
Google Search Queries 34,000/s 30 MB/s

All of this content is humans typing at a keyboard (except for the Facebook photos). We see something interesting -- human generated unique content, integrated over all humanity, is not a very difficult data problem. Everything we generate is of order 100 MB/s, or perhaps 1 GB/s if we include emails and SMS.

 
 
 
 

Internet Explorer 8 Goodies


This week Microsoft approved two applications that integrate Stinky Teddy's Gossip Powered search directly into your browser and posted them in the Internet Explorer Add-ons Gallery. These tools were built to take advantage of some great features that Microsoft added to Internet Explorer 8. Browsers are becoming like smart phones where the actual phone is not as important as the apps that are available (in the browser world, "apps" are known as "add-ons"). Mozilla's Firefox is the king of the add-on business. Firefox was built as a lightweight shell that could be customized by the user. There are more than 10,000 add-ons in the Mozilla add-on gallery. Google's Chrome has recently enabled third-party add-ons and many Mozilla developers have ported their applications.

Add-ons and toolbars have long existed for Internet Explorer, but there has been a fundamental barrier to their widespread adoption -- the tools used to build add-ons and toolbars for Internet Explorer are also used by hackers to steal your information and infect your computer. Installing add-ons required that you install system software on your computer, and once you hit that button you were at the mercy of the software developer. Often they enticed you to hit the button by offering something useful like smiley face emoticons or access to games. Mozilla's Firefox built a sandboxing mechanism that keeps the add-ons separate from the operating system. Mozilla also has a good system of community policing that keeps the Mozilla community safe from malicious hackers.

Internet Explorer is the default browser for most users, so there has always been a desire to bring add-on features to Internet Explorer without requiring the user to install potentially malicious software on their computer. Enter Internet Explorer 8, with the concept of the Accelerator. Accelerators allow developers to interact with web pages that are rendered in your browser. The applications are completely sandboxed in the browser, and are only activated when you explicitly call for them. Hence, they are safe to install and use.

The Stinky Teddy Abracadabra Search Accelerator allows you to launch a search directly from a web page, either by highlighting terms on the page or by simply right clicking and selecting our accelerator. A little search preview box will pop up, so in many cases you can navigate directly to the page you are looking for. What I've described is pretty standard, but we've added a special ingredient. The Stinky Teddy Abracadabra Search accelerator uses the page you are currently visiting as context for your search. The concepts on the page are used as a frame of reference that guides us when we decided which search results to show you. The word "base" means different things if you are on a page about baseball or a page about furniture. Where you are helps us to know where you want to go. Although this idea is obvious, no other search engine uses this information. To be clear, we aren't tracking you -- all we use is your current screen to provide context. We don't save any information about you.

A second cool feature added to Internet Explorer 8 is Visual Search Suggestions. Firefox allows for search suggestions in a limited fashion (one line of text). After installing our Search Box Plugin we show you a preview of the search page as you type in your search query. Most search providers show query suggestions -- we show the search page. The search page preview we show has most of our usual content types (web, video, real-time, twitter, news), and the "buzzing" content is shown first. We wonder why other search engines don't show you search results as you type, and we suspect the answer is that this is a case where the business of search gets in the way of the user experience. The business of search is to show sponsored links above the search results. Search engines want you to go to their page, even when that step is unnecessary. Direct navigation from the search box makes more sense to us.

Check out our Internet Explorer 8 goodies and let us know what you think.

Visual Search Bar Plugin:


Accelerator:


 
 
 
 
 

« September 2010
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  
       
Today
Follow Stinky Teddy

    [This is a Roller site]
     
    © Stinky Teddy