UPDATE: This post is now available as a completed white paper. Get the PDF download here. 
This blog post is a summary of the forthcoming white paper from OneRiot, “The Inner Workings of a Realtime Search Engine.” For an advance copy, just ping Tobias. In the mean time, please leave comments and ask questions on this blog post. Let us know if we’ve covered enough ground, or gone into enough depth. We will try to address each point both on the blog and during the process of completing of the white paper.
40% of users perform search queries which display an intent that is best satisfied by realtime search results. Industry numbers aside, Iran – the country, the situation and the search query – has conclusively proven that users want search results from the realtime web.
Users Want Realtime Search
Across all the major search engines, including Google, Yahoo, Bing and Ask, industry numbers indicate that 40% of users are performing search queries which display an intent that is best satisfied by realtime search results. Irrespective of industry numbers, Iran – the country, the situation, and the search query – has proved beyond doubt that there is huge demand for search results from the realtime web. The question on everybody’s lips is: “What’s going on right now?” In order to answer that question, they need to find the news, images, conversation, stories and videos with the most social relevance right now. Realtime search results meet that need.
Everyday hundreds of millions of search engine users type something as heavyweight as “Obama,” or as entertaining as “Britney”, into the search box and expect to find out what’s going on right now for that topic. These types of searches are commonly called “browse” searches, as people are Browsing for information. They don’t have a particular URL in mind. They just want to know what’s going on right now – the source of information being less important than the information itself. Those users are best satisfied by search results from the realtime web.
Making up the remaining 60% of searches on the web are “Navigation” searches (20%), and specific “Informative” searches (40%). An example of a navigation search is when a user is trying to get to Sony.com, or Yahoo.com. They will enter a search query in an attempt to find a recognized home page. An example of an informative search is when a user is trying to find a specific recipe for Cabbage Soup that is definitely “out there somewhere.” They enter a query in attempt to find that specific information.
The best traditional search engines are very good at finding navigation search results, and specific information. The best realtime web search engines are very good at finding Browse search results – addressing fully 40% of the market. With 1% of the search market worth $1bn per year, 40% is a huge target to go after.
Traditional Search – A Broad Overview
Traditional search engines treat the web like a library. Web pages are crawled, and the content is stored in an index for efficient retrieval of information. Those web pages also build up a “Rank” over time (e.g. Google’s “PageRank”). Pages with the highest Rank percolate to the top of the results.
A page’s Rank is constructed from many factors, but one of the most important is citation importance – broadly, the number of inbound links to that web page. This approach tends to favor highly referenced resources like Wikipedia. For example, search for “Britney Spears” on a traditional search engine and the top result is likely to be a Wikipedia page. This approach produces dependable results, but results that are not necessarily reflective of why the user would be searching for Britney at any particular time (i.e. to find out what’s going on right now). Additionally, a page’s Rank is relatively static. It changes periodically, but not at a pace to keep up with the realtime world of changing interests in a topic. A page with high rank might be tremendously relevant yesterday, but not tomorrow. A traditional search engine is only able to return yesterday’s relevant result.
Traditional search engines struggle to surface the hyper-fresh and socially relevant “realtime” results that satisfy users performing Browse searches. OneRiot, a realtime search engine, is focused exclusively on solving that problem and addressing that 40% of the market. To do that, we have had to:
Invent new ways to index the web: by harnessing the power of the realtime social web.
Invent new ways to rank the content in that index: at search time, to deliver the most relevant result right now.
We will now consider each of these two innovations in turn.
New ways to Index the realtime web
Traditional search engines crawl the web by systematically following links between billions of pages, then indexing the content on those pages. Broadly, they consider the link to be a signal to an important piece of content.
OneRiot, in contrast, considers realtime activity on the social web when determining which pages to index. We consider the links people are tweeting, or digging, or sharing on other services, as a signal to an important piece of content.
In the last two years there has been an explosion in the number of links being shared across the realtime social web. This in part has been driven by the phenomenal growth of Twitter. But the realtime web is much wider than Twitter.
Services like Digg and Delicious – whose user communities provide a wealth of explicit “social signals” to important pieces of content – also continue to grow. Meanwhile, the rise of sharing services like Shareaholic have promoted additional realtime sharing of content on the web. URL shorteners like Bit.ly and TinyURL make this even easier for users. Facebook and other soc-nets make it easy to share links across users’ social graphs. Some of this information is publically available, some is not. But there are a plethora of tools and services that have made sharing of links commonplace among the 230 million US users of the internet, and millions more internationally.
At OneRiot we aggregate that realtime activity across the social web, considering the links people are sharing right now. We then crawl to the pages those links point to, and index the content on those pages – and we do it fast. Currently we index the content of the page and make it ready to search in less than 0.8 seconds.
It’s a completely new way to index the web. Effectively, users of the social web are curating the search index as they Tweet, Digg or share links on other services. Those pages inherently have social buzz and implicitly reflect “what’s going on right now” for their subject matter. Meanwhile, we provide the infrastructure to keep it all up to date in “realtime.”
In addition, OneRiot also draws upon its own panel of users to help determine what webpages will be indexed. Similar to Compete.com or other internet measurement services, OneRiot manages a significant panel of users (almost 3 million strong at this point) who have opted in to pass back anonymous data about what pages are important to them as they surf the web.
This aggregation of data from our own panel alongside realtime sharing activity on services like Twitter and Digg helps create a huge realtime index of the web. While the volume of shared links on Twitter is exploding, they account for a fraction of the web pages in our index. This is important. There is no doubt that Twitter provides a tremendously valuable stream of data for us, but Harvard Business Review recently reported that 10% of the Twitter users create 90% of the content. If a search index is exclusively based on tweets its results will be heavily biased towards the social activity of that subset of power users. So OneRiot’s search index is constantly being updated with the web pages that are generating social buzz across the whole web right now, not just on one service. We index hyper-fresh, socially relevant pages. Pages that perhaps haven’t been published long enough to start building up a traditional Rank in Google. In other words, our index is full of potential results for that 40% of users performing Browse queries. When the user wants to know “what’s going on right now,” we’ve got the pages indexed to help answer that question – powered by the social web, and a lot of realtime infrastructure.
Naturally there are some challenges to creating a realtime index of the social web. Chief among them is spam. Indeed, many observers think there’s a tsunami of spam heading for the social web – especially in the realtime conversations that often act as the platform for link sharing (ref: Danny Sullivan’s excellent blog here). Undoubtedly, there is tremendous value from following the stream of realtime conversation on services like Twitter (e.g. “Iran”). But I can also tweet something like “Obama is awesome <link to a porn movie>” and see the link to that porn movie showing up in search results for “Obama” on any search engine that only indexes tweets. At OneRiot, we’ve chosen to index the content behind the link – whether that link has been tweeted or dugg, or shared elsewhere. So our search results focus on the content that the social web is buzzing about, in addition to the conversation it is having. In the Obama example above, our crawler would go the page behind the link that was tweeted, then index and categorize the content. A search on OneRiot for “Obama” would not return that porn movie. Our index is realtime, but also reliable.
New ways to rank the realtime web – PulseRank
Now that you’ve got a realtime index of the web, how do you rank the pages within it? When you search, what results should be retrieved from the index and placed at the top of the search results page? In other words, what are the news, stories and videos with the most social relevance right now? Firstly, being a realtime search engine, OneRiot ranks its results at search-time. That’s key. Realtime search results need to be ordered based on social relevance right now, not sometime recently.
Secondly, we have invented a new ranking algorithm – PulseRank – to drive the realtime ordering of our search results. Think of PulseRank as PageRank for the realtime time web. If PageRank reflects historical dependability, then PulseRank reflects current social buzz. PulseRank is the ranking algorithm for the 40% of searches that traditional search engines struggle with.
Our PulseRank algorithm actually looks at dozens of factors that give “weight” to certain results in realtime. As a previous blog post noted in detail, these include:
Freshness: A story published 2 minutes ago is probably more interesting than one published 2 weeks ago, if the user is performing a browse search. But the ranking algorithm also accounts for the fact that the most recently published content is not necessarily the most relevant. The realtime stream – aka the firehose – can be noisy and filled with spam.
Domain Authority: Just because I’ve published a post on my own personal blog about Obama, should that be weighted more highly than a post from, say, the New York Times, on the same subject published at the same time? PulseRank considers factors like the number links being shared from a particular domain right now, and increases the weight for links from currently popular domains.
People Authority: PulseRank considers who shared the link on the social web. Known spammers tend to pummel their social graph with the same link many times a day. Links shared in this manner will get a lower weight in our system. More thoughtful social web users share links that tend to get retweeted and heavily dugg. Those links get a higher weight.
Acceleration: PulseRank considers whether a link is increasing in hotness or decreasing in hotness. For example, we assess whether more people are sharing the link right now than they were 2 minutes ago. The algorithm is weighted to favor “emerging” webpages rather than popular ones that everyone already knows about.
These are just four of dozens of factors that combine, at search-time, to calculate a page’s PulseRank, which determines where the link sits on our search results page. The end result to you, the user, should be better results. In short, the most socially relevant content on the web, related to your search query, should be the top result.
Delivering Realtime Web Search results at Scale
Clearly, delivering search results at speed and scale is critical. Every new data stream that the system ingests adds a layer of complexity at scale. As the size of the index grows, the system needs to grow too – to be able to index, return results, and rank to provide relevance all in realtime. We’ve built some fantastic technology to be able to deal with that – including a highly optimized in-memory index to support super-fast retrieval of search results. Our technology also includes a robust partner API, that’s powering many partners, helping to deliver realtime search results to their users. And Microsoft recently released a new version of Internet Explorer 8 bundled with OneRiot search. Being able to deliver realtime web search results at scale is key – we owe it to our users and our partners.
The Future: Monetizing Realtime Web Search
Contextual ads against search results page, clearly, is a proven model. That definitely has its place in realtime search. However, because search results from the realtime web keep updating, our studies have shown that users search many more times per day per query with OneRiot than they do on a traditional engine – because they want to stay on top of the latest buzz. That alone offers many more opportunities to monetize the same user using this well understood model. Our belief, however, is that new realtime monetization models we are working on will deliver even better results. But that is for the future. For now, our primary focus is on delivering user value at OneRiot.com and to our partners through our API. That’s puts the focus on speed, relevancy, scale and distribution. We’re excited about the work ahead.








Where did you get the % breakdown for the search types? Would be good to include a footnote to the source.
Very interesting. Just one thing though.
Is “what’s going on right now?” the same as “what’s being linked a lot on the realtime web?”
Isn’t the reliance on hyperlinks to imply importance a rather traditional philosophy?
Since the twin challenge to ‘reatime’ for next-gen search is mobile & ‘local’, how could you see your approach being used to answer an even more interesting question:
“What’s going on right now, near me?”
Exciting times, guys! Good luck.
[...] Peggs, GM of OneRiot, posted a great blog today that says a lot about what realtime search is, and how OneRiot fits into the space. A great read [...]
[...] Описание алгоритмов и процессов, применяемых при создании realtime-поисковика от создателей одного из таких поисковиков – OneRiot. Вкратце: есть необходимость создания индексов, которые обновляются постоянно, а не периодически. Параллельно надо жонглировать источниками, стараясь охватить не только популярные ресурсы (Digg, Twitter), но и интересные ресурсы от конкретных пользователей, чтобы не стать простым зеркалом Digg и Twitter. Кроме того, есть проблема спама, так как индексируемые “социальные масс-медиа” зачастую становятся площадкой для раскрутки спамерских проектов. [...]
[...] service today. Or maybe things like its role in major events — such as the popularity of its real-time search during the Iran election crisis — will make it more and more [...]
[...] too much noise. Other approaches attempt to add in other factors. OneRiot, for instance, is developing what it calls PulseRank, which takes into account the freshness of the information, the link authority of the Webpage where [...]
[...] too much noise. Other approaches attempt to add in other factors. OneRiot, for instance, is developing what it calls PulseRank, which takes into account the freshness of the information, the link authority of the Webpage where [...]
[...] much noise. Other approaches attempt to add in other factors. OneRiot, for instance, is developing what it calls PulseRank, which takes into account the freshness of the information, the link authority of [...]
[...] too much noise. Other approaches attempt to add in other factors. OneRiot, for instance, is developing what it calls PulseRank, which takes into account the freshness of the information, the link authority of the Webpage where [...]
[...] our lifestyles. Among these transformations, Schonfeld includes a series of new search engines (OneRiot, Collecta that are trying to address the need of real time [...]
[...] too much noise. Other approaches attempt to add in other factors. OneRiot, for instance, is developing what it calls PulseRank, which takes into account the freshness of the information, the link authority of the Webpage where [...]
[...] しかし、単純に新しい順にメッセージを表示すればノイズが多くなるのは避けられない。ノイズを低減するために別のアプローチを取るサービスも登場している。たとえば、OneRiotはPulseRankと呼ばれる手法を開発した。これは時間軸の基準に加えて、リンク先のウェブページの重要性、リンクを張ったユーザーの重要性、メッセージが拡散された速度などを考慮する。こうした手法は理屈には合っているが、単純に新しいメッセージを優先する手法に比べると、突発的な重大事件などの場合、やはり速報性で劣ることになるおそれがある。 [...]
[...] The Inner Workings of a Realtime Search Engine | Blog.OneRiot.com – Blogging the Pulse of the … This blog post is a summary of the forthcoming white paper from OneRiot, “The Inner Workings of a Realtime Search Engine.” (tags: news blog search twitter whitepaper realtime oneriot) [...]
[...] The Inner Workings of a Realtime Search Engine (oneriot.com) [...]
[...] more, see our OneRiot Offers Twitter Search … With a Twist post. Also take a look at The Inner Workings of a Realtime Search Engine from OneRiot, which discusses how it tried to measure the “pulse” of content from a [...]
[...] more, see our OneRiot Offers Twitter Search … With a Twist post. Also take a look at The Inner Workings of a Realtime Search Engine from OneRiot, which discusses how it tried to measure the “pulse” of content from a variety of [...]
[...] The Inner Workings of a Realtime Search Engine [...]
[...] want to subscribe to the RSS feed for updates on this topic.Powered by WP Greet BoxAn interesting white paper published at OneRiot’s blog claims that 40% of users’ search queries across major [...]
72]ÂÕÎÄ íà ÏÎÐÍÎ ÑÀÉÒ
46] ÂÕÎÄ
http://pipiskun.land.ru/xr3/69.jpg http://pipiskun.land.ru/xr3/61.jpg
http://pipiskun.land.ru/xr3/51.jpg http://pipiskun.land.ru/xr3/8.jpg
http://pipiskun.land.ru/xr3/44.jpg http://pipiskun.land.ru/xr3/42.jpg
http://pipiskun.land.ru/xr3/54.jpg http://pipiskun.land.ru/xr3/0.jpg
46] ÏÎÐÍÎ
72]ÑÀÌÎÅ ËÓרÅÅ ÏÎÐÍÓÕÀ
72]âèäåî xxx çàãðóçèòü
72]ïîðåâî ôåòèø ðîëèê çàãðóçèòü
72]ïîðíî âèäèî ñêà÷àòü èíòåðíåò-ñàéò
xxx ïîñìîòðåòü ïîðòàë
ýðîòèêà âèäèî ïîñìîòðåòü ïîðíî-ñàéò
ýðîòèêà âèäèî ñêà÷àòü
ïîðíî ôîòî ïîñìîòðåòü
ãîëàÿ äâä ïðîñìîòð
ïîðíî ãàëåðåè ñìîòðåòü
ãîëûå äåâî÷êè - ñåêñ âèäåî
ïîðíî èíöåñò îòåö è äî÷ü
õýíòàé îíëàéí - ïîðíî âèäåî
ïîðíî ôîòîøîï ðîññèéñêèõ çâåçä
ïîðíî ñ ÷óæîé æåíîé
ïîðíî òîëñòûõ ôîòî - ïîðíî ðîëèêè
ïîñìîòðåòü âèäåî ñåêñ - ïîðíî ðîëèêè
îðãèÿ - ïîðíî âèäåî ôèëüìû
ãîëûå ëþäè - ïîðíî âèäåî ðîëèêè
òåëêè ãîëû - ïîðíî âèäåî ôèëüìû
ðóññêîå ãåé âèäåî îíëàéí
ïîðíî âèäåî îíëàéí ìàøè ìàëèíîâñêîé
òðàõ îíëàéí âèäåî
porno video prosmotr besplatno
ýðîòèêà îíëàéí âèäåî
ïîðíî êàòÿ ãóñåâà - ñåêñ âèäåî ðîëèêè
êðàñèâûé ìèíåò âèäåî
ñåêñ ñ êîíåì âèäåî îíëàéí
âçðîñëûå æåíùèíû
ïîðíî ôîòî ìîëîäûõ
ïîðíóõà êëèïû ïðîñìîòð
ñìîòðåòü îíëàéí òðàõ ìàëîëåòîê
ñåêñ ìàøèíû ñìîòðåòü îíëàéí
ëåçáèÿíêè îíëàéí
ïîðíî íåãðè - ïîðíî âèäåî
ïîðíî æîïû - ïîðíî âèäåî ôèëüìû
ïîðíî îíëàéí ñî ñâåòîé áóêèíîé
ëþáëþ æåñòêèé ñåêñ
ïîðíî âèäåî çîî çàãðóçèòü
÷àñòíîå ôîòî æåíùèí
ïîðíî ìàìî÷êè - ñåêñ âèäåî ðîëèêè
ïðîñìîòð âèäåî ñåêñ ìàøèíàìè
ñïåðìà îíëàéí âèäåî
ñêà÷àòü îðãàçì
æåñòêîå ïîðíî on line
ñìîòðåòü ïîðíî âèäåî ñî çâåçäàìè - ñåêñ âèäåî
ïðîñìîòð ãåé ôèëüìîâ
îí-ëàéí sex ïîñìîòðåòü
âèäåî xxx ñêà÷àòü ïîðíî-ïîðòàë
îíëàéí ãîëîå çàãðóçèòü
îí-ëàéí nude ðîëèêè ñêà÷àòü
on-line ãîëóþ ôèëüì ñìîòðåòü
ïîðíî ñàéò çàãðóçèòü
online ïîðíî ôîòî ïîñìîòðåòü èíòåðíåò-ñàéò
èíòåðíåò xxx çàãðóçèòü
ïîðíî âèäåî çàãðóçèòü ïîðòàë
èíòåðíåò ïîðíî ñìîòðåòü
on-line sex ôèëüì ñìîòðåòü
ãîëàÿ ïîñìîòðåòü
xxx ôèëüì ñêà÷àòü
online ïîðåâî dvd ñìîòðåòü ïîðòàë