Search the realtime web - the news, stories and videos people are talking about right now.

18
Love it!

The Inner Workings of a Realtime Search Engine

6/22/09 - Posted by Tobias Peggs under Featured, Industry, OneRiot News, Science

UPDATE: This post is now available as a completed white paper. Get the PDF download here. picture-280

This blog post is a summary of the forthcoming white paper from OneRiot, “The Inner Workings of a Realtime Search Engine.” For an advance copy, just ping Tobias. In the mean time, please leave comments and ask questions on this blog post. Let us know if we’ve covered enough ground, or gone into enough depth. We will try to address each point both on the blog and during the process of completing of the white paper.

40% of users perform search queries which display an intent that is best satisfied by realtime search results. Industry numbers aside, Iran – the country, the situation and the search query – has conclusively proven that users want search results from the realtime web.

Users Want Realtime Search
Across all the major search engines, including Google, Yahoo, Bing and Ask, industry numbers indicate that 40% of users are performing search queries which display an intent that is best satisfied by realtime search results. Irrespective of industry numbers, Iran – the country, the situation, and the search query – has proved beyond doubt that there is huge demand for search results from the realtime web. The question on everybody’s lips is: “What’s going on right now?” In order to answer that question, they need to find the news, images, conversation, stories and videos with the most social relevance right now. Realtime search results meet that need.

Everyday hundreds of millions of search engine users type something as heavyweight as “Obama,” or as entertaining as “Britney”, into the search box and expect to find out what’s going on right now for that topic. These types of searches are commonly called “browse” searches, as people are Browsing for information. They don’t have a particular URL in mind. They just want to know what’s going on right now – the source of information being less important than the information itself. Those users are best satisfied by search results from the realtime web.

Making up the remaining 60% of searches on the web are “Navigation” searches (20%), and specific “Informative” searches (40%). An example of a navigation search is when a user is trying to get to Sony.com, or Yahoo.com. They will enter a search query in an attempt to find a recognized home page. An example of an informative search is when a user is trying to find a specific recipe for Cabbage Soup that is definitely “out there somewhere.” They enter a query in attempt to find that specific information.

The best traditional search engines are very good at finding navigation search results, and specific information. The best realtime web search engines are very good at finding Browse search results – addressing fully 40% of the market. With 1% of the search market worth $1bn per year, 40% is a huge target to go after.

Traditional Search – A Broad Overview

Traditional search engines treat the web like a library. Web pages are crawled, and the content is stored in an index for efficient retrieval of information. Those web pages also build up a “Rank” over time (e.g. Google’s “PageRank”). Pages with the highest Rank percolate to the top of the results.

A page’s Rank is constructed from many factors, but one of the most important is citation importance – broadly, the number of inbound links to that web page. This approach tends to favor highly referenced resources like Wikipedia. For example, search for “Britney Spears” on a traditional search engine and the top result is likely to be a Wikipedia page. This approach produces dependable results, but results that are not necessarily reflective of why the user would be searching for Britney at any particular time (i.e. to find out what’s going on right now). Additionally, a page’s Rank is relatively static. It changes periodically, but not at a pace to keep up with the realtime world of changing interests in a topic. A page with high rank might be tremendously relevant yesterday, but not tomorrow. A traditional search engine is only able to return yesterday’s relevant result.

Traditional search engines struggle to surface the hyper-fresh and socially relevant “realtime” results that satisfy users performing Browse searches. OneRiot, a realtime search engine, is focused exclusively on solving that problem and addressing that 40% of the market. To do that, we have had to:

Invent new ways to index the web: by harnessing the power of the realtime social web.

Invent new ways to rank the content in that index: at search time, to deliver the most relevant result right now.

We will now consider each of these two innovations in turn.

New ways to Index the realtime web
Traditional search engines crawl the web by systematically following links between billions of pages, then indexing the content on those pages. Broadly, they consider the link to be a signal to an important piece of content.

OneRiot, in contrast, considers realtime activity on the social web when determining which pages to index. We consider the links people are tweeting, or digging, or sharing on other services, as a signal to an important piece of content.

In the last two years there has been an explosion in the number of links being shared across the realtime social web. This in part has been driven by the phenomenal growth of Twitter. But the realtime web is much wider than Twitter.

Services like Digg and Delicious – whose user communities provide a wealth of explicit “social signals” to important pieces of content – also continue to grow. Meanwhile, the rise of sharing services like Shareaholic have promoted additional realtime sharing of content on the web. URL shorteners like Bit.ly and TinyURL make this even easier for users. Facebook and other soc-nets make it easy to share links across users’ social graphs. Some of this information is publically available, some is not. But there are a plethora of tools and services that have made sharing of links commonplace among the 230 million US users of the internet, and millions more internationally.

At OneRiot we aggregate that realtime activity across the social web, considering the links people are sharing right now. We then crawl to the pages those links point to, and index the content on those pages – and we do it fast. Currently we index the content of the page and make it ready to search in less than 0.8 seconds.

It’s a completely new way to index the web. Effectively, users of the social web are curating the search index as they Tweet, Digg or share links on other services. Those pages inherently have social buzz and implicitly reflect “what’s going on right now” for their subject matter. Meanwhile, we provide the infrastructure to keep it all up to date in “realtime.”

In addition, OneRiot also draws upon its own panel of users to help determine what webpages will be indexed. Similar to Compete.com or other internet measurement services, OneRiot manages a significant panel of users (almost 3 million strong at this point) who have opted in to pass back anonymous data about what pages are important to them as they surf the web.

This aggregation of data from our own panel alongside realtime sharing activity on services like Twitter and Digg helps create a huge realtime index of the web. While the volume of shared links on Twitter is exploding, they account for a fraction of the web pages in our index. This is important. There is no doubt that Twitter provides a tremendously valuable stream of data for us, but Harvard Business Review recently reported that 10% of the Twitter users create 90% of the content. If a search index is exclusively based on tweets its results will be heavily biased towards the social activity of that subset of power users. So OneRiot’s search index is constantly being updated with the web pages that are generating social buzz across the whole web right now, not just on one service. We index hyper-fresh, socially relevant pages. Pages that perhaps haven’t been published long enough to start building up a traditional Rank in Google. In other words, our index is full of potential results for that 40% of users performing Browse queries. When the user wants to know “what’s going on right now,” we’ve got the pages indexed to help answer that question – powered by the social web, and a lot of realtime infrastructure.

Naturally there are some challenges to creating a realtime index of the social web. Chief among them is spam. Indeed, many observers think there’s a tsunami of spam heading for the social web – especially in the realtime conversations that often act as the platform for link sharing (ref: Danny Sullivan’s excellent blog here). Undoubtedly, there is tremendous value from following the stream of realtime conversation on services like Twitter (e.g. “Iran”). But I can also tweet something like “Obama is awesome <link to a porn movie>” and see the link to that porn movie showing up in search results for “Obama” on any search engine that only indexes tweets. At OneRiot, we’ve chosen to index the content behind the link – whether that link has been tweeted or dugg, or shared elsewhere. So our search results focus on the content that the social web is buzzing about, in addition to the conversation it is having. In the Obama example above, our crawler would go the page behind the link that was tweeted, then index and categorize the content. A search on OneRiot for “Obama” would not return that porn movie. Our index is realtime, but also reliable.

New ways to rank the realtime web – PulseRank
Now that you’ve got a realtime index of the web, how do you rank the pages within it? When you search, what results should be retrieved from the index and placed at the top of the search results page? In other words, what are the news, stories and videos with the most social relevance right now? Firstly, being a realtime search engine, OneRiot ranks its results at search-time. That’s key. Realtime search results need to be ordered based on social relevance right now, not sometime recently.
Secondly, we have invented a new ranking algorithm – PulseRank – to drive the realtime ordering of our search results. Think of PulseRank as PageRank for the realtime time web. If PageRank reflects historical dependability, then PulseRank reflects current social buzz. PulseRank is the ranking algorithm for the 40% of searches that traditional search engines struggle with.

Our PulseRank algorithm actually looks at dozens of factors that give “weight” to certain results in realtime. As a previous blog post noted in detail, these include:

Freshness
: A story published 2 minutes ago is probably more interesting than one published 2 weeks ago, if the user is performing a browse search. But the ranking algorithm also accounts for the fact that the most recently published content is not necessarily the most relevant. The realtime stream – aka the firehose – can be noisy and filled with spam.

Domain Authority: Just because I’ve published a post on my own personal blog about Obama, should that be weighted more highly than a post from, say, the New York Times, on the same subject published at the same time? PulseRank considers factors like the number links being shared from a particular domain right now, and increases the weight for links from currently popular domains.

People Authority: PulseRank considers who shared the link on the social web. Known spammers tend to pummel their social graph with the same link many times a day. Links shared in this manner will get a lower weight in our system. More thoughtful social web users share links that tend to get retweeted and heavily dugg. Those links get a higher weight.

Acceleration: PulseRank considers whether a link is increasing in hotness or decreasing in hotness. For example, we assess whether more people are sharing the link right now than they were 2 minutes ago. The algorithm is weighted to favor “emerging” webpages rather than popular ones that everyone already knows about.

These are just four of dozens of factors that combine, at search-time, to calculate a page’s PulseRank, which determines where the link sits on our search results page. The end result to you, the user, should be better results. In short, the most socially relevant content on the web, related to your search query, should be the top result.

Delivering Realtime Web Search results at Scale
Clearly, delivering search results at speed and scale is critical. Every new data stream that the system ingests adds a layer of complexity at scale. As the size of the index grows, the system needs to grow too – to be able to index, return results, and rank to provide relevance all in realtime. We’ve built some fantastic technology to be able to deal with that – including a highly optimized in-memory index to support super-fast retrieval of search results. Our technology also includes a robust partner API, that’s powering many partners, helping to deliver realtime search results to their users. And Microsoft recently released a new version of Internet Explorer 8 bundled with OneRiot search. Being able to deliver realtime web search results at scale is key – we owe it to our users and our partners.

The Future: Monetizing Realtime Web Search
Contextual ads against search results page, clearly, is a proven model. That definitely has its place in realtime search. However, because search results from the realtime web keep updating, our studies have shown that users search many more times per day per query with OneRiot than they do on a traditional engine – because they want to stay on top of the latest buzz. That alone offers many more opportunities to monetize the same user using this well understood model. Our belief, however, is that new realtime monetization models we are working on will deliver even better results. But that is for the future. For now, our primary focus is on delivering user value at OneRiot.com and to our partners through our API. That’s puts the focus on speed, relevancy, scale and distribution. We’re excited about the work ahead.

  1. Terence Pua June 22, 2009 5:41 pm

    Where did you get the % breakdown for the search types? Would be good to include a footnote to the source.

  2. James Pearce June 22, 2009 7:44 pm

    Very interesting. Just one thing though.

    Is “what’s going on right now?” the same as “what’s being linked a lot on the realtime web?”

    Isn’t the reliance on hyperlinks to imply importance a rather traditional philosophy?

    Since the twin challenge to ‘reatime’ for next-gen search is mobile & ‘local’, how could you see your approach being used to answer an even more interesting question:

    “What’s going on right now, near me?”

    Exciting times, guys! Good luck.

  3. [...] Peggs, GM of OneRiot, posted a great blog today that says a lot about what realtime search is, and how OneRiot fits into the space. A great read [...]

  4. [...] Описание алгоритмов и процессов, применяемых при создании realtime-поисковика от создателей одного из таких поисковиков – OneRiot. Вкратце: есть необходимость создания индексов, которые обновляются постоянно, а не периодически. Параллельно надо жонглировать источниками, стараясь охватить не только популярные ресурсы (Digg, Twitter), но и интересные ресурсы от конкретных пользователей, чтобы не стать простым зеркалом Digg и Twitter. Кроме того, есть проблема спама, так как индексируемые “социальные масс-медиа” зачастую становятся площадкой для раскрутки спамерских проектов. [...]

  5. [...] service today. Or maybe things like its role in major events — such as the popularity of its real-time search during the Iran election crisis — will make it more and more [...]

  6. [...] too much noise. Other approaches attempt to add in other factors. OneRiot, for instance, is developing what it calls PulseRank, which takes into account the freshness of the information, the link authority of the Webpage where [...]

  7. [...] too much noise. Other approaches attempt to add in other factors. OneRiot, for instance, is developing what it calls PulseRank, which takes into account the freshness of the information, the link authority of the Webpage where [...]

  8. [...] much noise. Other approaches attempt to add in &#111&#116&#104er factors. OneRiot, for instance, is developing &#119&#104&#97t it calls PulseRank, which takes into account the&#32&#102&#114eshness of the information, the link authority of [...]

  9. [...] too much noise. Other approaches attempt to add in other factors. OneRiot, for instance, is developing what it calls PulseRank, which takes into account the freshness of the information, the link authority of the Webpage where [...]

  10. [...] our lifestyles. Among these transformations, Schonfeld includes a series of new search engines (OneRiot, Collecta that are trying to address the need of real time [...]

  11. [...] too much noise. Other approaches attempt to add in other factors. OneRiot, for instance, is developing what it calls PulseRank, which takes into account the freshness of the information, the link authority of the Webpage where [...]

  12. [...] しかし、単純に新しい順にメッセージを表示すればノイズが多くなるのは避けられない。ノイズを低減するために別のアプローチを取るサービスも登場している。たとえば、OneRiotはPulseRankと呼ばれる手法を開発した。これは時間軸の基準に加えて、リンク先のウェブページの重要性、リンクを張ったユーザーの重要性、メッセージが拡散された速度などを考慮する。こうした手法は理屈には合っているが、単純に新しいメッセージを優先する手法に比べると、突発的な重大事件などの場合、やはり速報性で劣ることになるおそれがある。 [...]

  13. [...] The Inner Workings of a Realtime Search Engine | Blog.OneRiot.com – Blogging the Pulse of the … This blog post is a summary of the forthcoming white paper from OneRiot, “The Inner Workings of a Realtime Search Engine.” (tags: news blog search twitter whitepaper realtime oneriot) [...]

  14. [...] The Inner Workings of a Realtime Search Engine (oneriot.com) [...]

  15. [...] more, see our OneRiot Offers Twitter Search … With a Twist post. Also take a look at The Inner Workings of a Realtime Search Engine from OneRiot, which discusses how it tried to measure the “pulse” of content from a [...]

  16. [...] more, see our OneRiot Offers Twitter Search … With a Twist post. Also take a look at The Inner Workings of a Realtime Search Engine from OneRiot, which discusses how it tried to measure the “pulse” of content from a variety of [...]

  17. [...] The Inner Workings of a Realtime Search Engine [...]

  18. [...] want to subscribe to the RSS feed for updates on this topic.Powered by WP Greet BoxAn interesting white paper published at OneRiot’s blog claims that 40% of users’ search queries across major [...]

  19. axopetwo November 9, 2009 1:02 am

    72]ÂÕÎÄ íà ÏÎÐÍÎ ÑÀÉÒ

    46] ÂÕÎÄ

    http://pipiskun.land.ru/xr3/69.jpg http://pipiskun.land.ru/xr3/61.jpg
    http://pipiskun.land.ru/xr3/51.jpg http://pipiskun.land.ru/xr3/8.jpg
    http://pipiskun.land.ru/xr3/44.jpg http://pipiskun.land.ru/xr3/42.jpg
    http://pipiskun.land.ru/xr3/54.jpg http://pipiskun.land.ru/xr3/0.jpg

    46] ÏÎÐÍÎ

    72]ÑÀÌÎÅ ËÓרÅÅ ÏÎÐÍÓÕÀ

    72]âèäåî xxx çàãðóçèòü
    72]ïîðåâî ôåòèø ðîëèê çàãðóçèòü
    72]ïîðíî âèäèî ñêà÷àòü èíòåðíåò-ñàéò

    xxx ïîñìîòðåòü ïîðòàë
    ýðîòèêà âèäèî ïîñìîòðåòü ïîðíî-ñàéò
    ýðîòèêà âèäèî ñêà÷àòü
    ïîðíî ôîòî ïîñìîòðåòü
    ãîëàÿ äâä ïðîñìîòð

    ïîðíî ãàëåðåè ñìîòðåòü
    ãîëûå äåâî÷êè - ñåêñ âèäåî
    ïîðíî èíöåñò îòåö è äî÷ü
    õýíòàé îíëàéí - ïîðíî âèäåî
    ïîðíî ôîòîøîï ðîññèéñêèõ çâåçä
    ïîðíî ñ ÷óæîé æåíîé
    ïîðíî òîëñòûõ ôîòî - ïîðíî ðîëèêè
    ïîñìîòðåòü âèäåî ñåêñ - ïîðíî ðîëèêè
    îðãèÿ - ïîðíî âèäåî ôèëüìû
    ãîëûå ëþäè - ïîðíî âèäåî ðîëèêè
    òåëêè ãîëû - ïîðíî âèäåî ôèëüìû
    ðóññêîå ãåé âèäåî îíëàéí
    ïîðíî âèäåî îíëàéí ìàøè ìàëèíîâñêîé
    òðàõ îíëàéí âèäåî
    porno video prosmotr besplatno
    ýðîòèêà îíëàéí âèäåî
    ïîðíî êàòÿ ãóñåâà - ñåêñ âèäåî ðîëèêè
    êðàñèâûé ìèíåò âèäåî
    ñåêñ ñ êîíåì âèäåî îíëàéí
    âçðîñëûå æåíùèíû
    ïîðíî ôîòî ìîëîäûõ
    ïîðíóõà êëèïû ïðîñìîòð
    ñìîòðåòü îíëàéí òðàõ ìàëîëåòîê
    ñåêñ ìàøèíû ñìîòðåòü îíëàéí
    ëåçáèÿíêè îíëàéí
    ïîðíî íåãðè - ïîðíî âèäåî
    ïîðíî æîïû - ïîðíî âèäåî ôèëüìû
    ïîðíî îíëàéí ñî ñâåòîé áóêèíîé
    ëþáëþ æåñòêèé ñåêñ
    ïîðíî âèäåî çîî çàãðóçèòü
    ÷àñòíîå ôîòî æåíùèí
    ïîðíî ìàìî÷êè - ñåêñ âèäåî ðîëèêè
    ïðîñìîòð âèäåî ñåêñ ìàøèíàìè
    ñïåðìà îíëàéí âèäåî
    ñêà÷àòü îðãàçì
    æåñòêîå ïîðíî on line
    ñìîòðåòü ïîðíî âèäåî ñî çâåçäàìè - ñåêñ âèäåî
    ïðîñìîòð ãåé ôèëüìîâ

    îí-ëàéí sex ïîñìîòðåòü
    âèäåî xxx ñêà÷àòü ïîðíî-ïîðòàë
    îíëàéí ãîëîå çàãðóçèòü
    îí-ëàéí nude ðîëèêè ñêà÷àòü
    on-line ãîëóþ ôèëüì ñìîòðåòü
    ïîðíî ñàéò çàãðóçèòü
    online ïîðíî ôîòî ïîñìîòðåòü èíòåðíåò-ñàéò
    èíòåðíåò xxx çàãðóçèòü
    ïîðíî âèäåî çàãðóçèòü ïîðòàë
    èíòåðíåò ïîðíî ñìîòðåòü
    on-line sex ôèëüì ñìîòðåòü
    ãîëàÿ ïîñìîòðåòü
    xxx ôèëüì ñêà÷àòü
    online ïîðåâî dvd ñìîòðåòü ïîðòàë

Comment on this

Partnership

Recent Comments

Gossip

  • The Heirs To Uga VII’s Throne

    11/20/09

    The Story: With the passing of Georgia’s beloved bulldog icon, a new master-mascot must be chosen.  Who will it be?
    The Search: Uga VII
    Some of us are born into greatness and some have it unexpectedly thrust upon us and never learn to deal with it, leading us to end up in rehab.  You can count the [...]

    read more…

  • Johnny Depp Brings the Sex

    11/18/09

    The Story: Johnny Depp was named the sexiest man alive according to People magazine.  But is he man enough for this title?
    The Search: Sexy Johnny Depp
    I’m a man, and I find men disgusting. I look at other men, and the last adjective that comes to mind is ’sexy’. However, I can’t deny that among the [...]

    read more…

  • Appropriate Publicity Techniques

    11/12/09

    The Story: Carrie Prejean became extremely flustered while on-air with Larry King.  OneRiot breaks down her interviewing tactics, and helps her further her public speaking skills.

    The Search: Larry King & Carrie Prejean
    With all the pseudo-celebrities emerging over the past decade or so, we at OneRiot have grown increasingly concerned that people are getting famous for [...]

    read more…

  • The Galactic Search for Love

    11/11/09

    The Story: Diaper-wearing Astronaut Lisa Nowak was sentenced today for her cross-country crime.  OneRiot casts the movie featuring the out of this world love triangle.
    The Search: Lisa Nowak Sentence
    Astronaut love is not like regular person love. Maybe that’s why the 2004 story of Lisa Nowak and the space-love triangle that caused her to drive 1000 [...]

    read more…

  • Renaming Mel Gibson’s Octo-Spawn

    11/03/09

    The Summary: Mel Gibson becomes a father of eight children - and OneRiot renames them all.
    The Search: Mel Gibson’s Baby
    Everyone loves when celebrities have babies, because the names they grace them with offer solid proof that they operate on a whole different plane of existence. However, the latest celebrity couple to a offer a up [...]

    read more…

Updates

Partnership

Stuff We’re Watching

  • http://www.vimeo.com/7235817
  • http://www.vimeo.com/6958283
  • http://www.vimeo.com/6788487

Tweet Tweet

    more tweets
     

    You need to log in to vote

    The blog owner requires users to be logged in to be able to vote for this post.

    Alternatively, if you do not have an account yet you can create one here.

    Powered by Vote It Up