<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-5716516924669268476</id><updated>2011-07-08T05:55:08.233-04:00</updated><title type='text'>Eric Gaumer</title><subtitle type='html'>Hacking The Planet</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://gaumer.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5716516924669268476/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://gaumer.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Eric Gaumer</name><uri>http://www.blogger.com/profile/12100268913296713952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>6</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-5716516924669268476.post-6480812451932042206</id><published>2009-09-27T11:50:00.001-04:00</published><updated>2009-09-29T07:34:00.707-04:00</updated><title type='text'>Pypes: The Topological Scheduler</title><content type='html'>Pypes provides a simple and robust scheduling algorithm. Each component is derived from a &lt;a href="http://www.stackless.com/"&gt;stackless&lt;/a&gt; tasklet which is an extremely lightweight thread of execution often referred to as a &lt;a href="http://en.wikipedia.org/wiki/Microthread"&gt;microthread&lt;/a&gt;. Components are then chained together in dependency order allowing the output of one component to be &lt;em&gt;pyped&lt;/em&gt; as input to another. But how does the scheduler handle branching and merging?&lt;br /&gt;&lt;br /&gt;&lt;span class="fullpost"&gt;The answer is quite simple; it uses topological sorting. In graph theory, a topological sort or topological ordering of a directed acyclic graph (DAG) is a linear ordering of its nodes in which each node comes before all nodes to which it has outbound edges. Every DAG has one or more topological sorts and every dataflow you design with pypes is a DAG. This also helps explain why pypes &lt;a href="http://gaumer.blogspot.com/2009/08/loop-type-networks-in-pypes.html"&gt;does not support Loop-Type Networks&lt;/a&gt;. Loops introduce cycles in the graph.&lt;br /&gt;&lt;br /&gt;While this type of scheduling prohibits cycles, it allows us to achieve very high performance. Pypes was originally designed to process hundreds of millions of documents in preparation for indexing. Imagine we have 30 components designed to handle different processing aspects and data normalization. At just 1 million input documents we have created 30 million context switches. A context switch is the act of switching control from one component to another. This accounts for a significant portion of the overall processing time (the total throughput). The longer this takes, the lower the throughput.&lt;br /&gt;&lt;br /&gt;The topological scheduler in pypes never has to decide which component to run. This ordering is defined by the dataflow you've created (the DAG) and the linear dependency order provided by the topological sorting. This sorting happens once and is utilized throughout the entire execution which means our scheduler never has to decide which component to select next. If we add the lightweight (and extremely fast) mechanism of context switching that stackless python offers, we end up with a very well performing scheduler that's simple to understand as easy to maintain.&lt;br /&gt;&lt;br /&gt;Of course others have provided &lt;a href="http://entitycrisis.blogspot.com/2009/03/concurrent-scaling-benchmarks.html"&gt;benchmarks&lt;/a&gt; showing how well stackless performs against other concurrency frameworks. The decision to use stackless in pypes was based mainly on performance and the astute reader will recognize that we're not achieving any true concurrency using stackless. Rather, were using a cooperative multitasking approach which fits the nature of flow based programming. If component B depends on component A then it makes no sense to run B before A is complete. It also doesn't make sense to run A partially so we can run B partially. This might be a goal in a realtime situation such as playing an audio/video stream where we don't want to starve components but pypes is all about throughput. Our goal is to minimize the number of context switches and scheduling semantics.&lt;br /&gt;&lt;br /&gt;Of course this is not to say that pypes doesn't offer any true concurrency. Incoming documents are load balanced across an instance of the graph running on each CPU/core. If you have a dual core machine then you are essentially processing two documents in parallel. Typical use cases for us are quad Xeon machines running 4 instances of the graph in parallel (i.e., 4 documents being processed at once).&lt;br /&gt;&lt;br /&gt;Pypes can also scale out very easily. Since pypes uses a &lt;a href="http://en.wikipedia.org/wiki/REST"&gt;REST&lt;/a&gt; architecture (i.e., use HTTP POST to submit content) you can place several machines behind a load balancer and requests will be executed across the cluster. This allows you to achieve very high throughput.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5716516924669268476-6480812451932042206?l=gaumer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gaumer.blogspot.com/feeds/6480812451932042206/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5716516924669268476&amp;postID=6480812451932042206' title='36 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5716516924669268476/posts/default/6480812451932042206'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5716516924669268476/posts/default/6480812451932042206'/><link rel='alternate' type='text/html' href='http://gaumer.blogspot.com/2009/09/pypes-topological-scheduler.html' title='Pypes: The Topological Scheduler'/><author><name>Eric Gaumer</name><uri>http://www.blogger.com/profile/12100268913296713952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>36</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5716516924669268476.post-3786774182777357247</id><published>2009-08-17T21:22:00.001-04:00</published><updated>2009-08-17T21:22:40.704-04:00</updated><title type='text'>Loop-Type Networks in Pypes</title><content type='html'>&lt;a href="http://yeoldeclue.com/blog"&gt;Michael Sparks&lt;/a&gt;, the author of &lt;a href="http://www.kamaelia.org/Home"&gt;Kamaelia&lt;/a&gt;, took a look at &lt;a href="http://www.pypes.org/"&gt;Pypes&lt;/a&gt; this weekend and sent out some &lt;a href="http://twitter.com/kamaelian?utm_source=follow&amp;utm_campaign=twitter20080331162631&amp;utm_medium=email"&gt;tweets&lt;/a&gt; mentioning the similarities, limitations, and possible synergies with Kamaelia. I'm a huge fan of Kamaelia and so the similarities should not be too surprising. I like the style of messaging passing used in Kamaelia and the only real difference there is that I chose the &lt;a href="http://jpaulmorrison.com/fbp/"&gt;Flow-Based Programming&lt;/a&gt; (FBP) terminology, calling them "ports" rather than "in/out boxes". At the same time, there are some fundamental differences both in design and objectives.&lt;br /&gt;&lt;br /&gt;&lt;span class="fullpost"&gt;Pypes was designed with a specific goal in mind; to quickly process high volumes of data in a completely modular and service oriented way. When I was first faced with this task, I naturally turned to Kamaelia but the "service oriented" requirement proved to be a problem. It's possible that I just didn't understand Kamaelia well enough to address the issue but considering it uses generators, it inherently provides &lt;i&gt;lazy&lt;/i&gt; data-flow.&lt;br /&gt;&lt;br /&gt;By lazy data-flow, what I mean to say is that generators do not compute any data until it's asked for. This essentially means data is pulled through the system by creating components, that at some point, generate data. This is in contrast to &lt;i&gt;eager&lt;/i&gt; data-flow models where data is pushed into the system without ever being requested.&lt;br /&gt;&lt;br /&gt;Why does this matter? Pypes was originally designed to be a content processing framework capable of processing large volumes of data prior to being indexed for search (normalization, classification, information extraction, etc.). When it comes to indexing content, there are lots of systems/sources in which data resides (CMS, DB, Filesystem, Email, etc.). It's quite inefficient, and typically fragile, to poll these systems for new data. Pypes allows these systems to push new data out to a service oriented processing framework.&lt;br /&gt;&lt;br /&gt;Another big concern here is that many of these systems have language specific APIs. Writing a component that pulls data from FileNet using Python isn't possible. I felt it was important that this "connector" functionality be isolated from the task of content processing. Pypes uses the notion of Adapters to adapt various incoming data formats to a unified data model called a &lt;i&gt;Packet&lt;/i&gt; (using FBP terminology). Adapters can even consume batches of packets that are disassembled and streamed through the system.&lt;br /&gt;&lt;br /&gt;Why not just create a linear pipeline? Quite simply, we wanted the ability to publish to multiple locations. An editor might create a article that needs to be indexed but also needs to be pushed out to some location as an Atom feed. At the same time, we might process packets containing certain attributes slightly different than others so the ability to branch helps solve this.&lt;br /&gt;&lt;br /&gt;This leads me to the limitation that Michael mentioned regarding cycles. Everything I've described to this point is collectively referred to as &lt;i&gt;Batch-Type Networks&lt;/i&gt; in FBP. This topology means that the network generally has a left-to-right/top-to-bottom flow, with packets being created on the left/top side and disposed of on the right/bottom side of the network.&lt;br /&gt;&lt;br /&gt;When we introduce cycles then the topology changes to a &lt;a href="http://jpaulmorrison.com/fbp/loop.htm"&gt;Loop-Type Network&lt;/a&gt;. This sort of topology adds some additional complexity and overhead to the system that I just haven't wrapped my brain around yet. My ambition right now is focused more on building a richer component library and better documentation for &lt;a href="https://sourceforge.net/project/screenshots.php?group_id=271766"&gt;Visual Design Studio&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;I envision pypes (Visual Design Studio) as more of a mashup or ETL tool/framework. Similar to &lt;a href="http://pipes.yahoo.com/"&gt;Yahoo Pipes&lt;/a&gt; with the ability to write custom components while also scaling to handle large volumes of data. I actually did a demo at a search conference a few months back that dealt completely with RSS feeds. The basic concept was to mashup a bunch of RSS content for indexing which was a pretty trivial task for pypes. Visual Design Studio provides an interface that allows business users (non-developers) to wire up these components to produce different applications.&lt;br /&gt;&lt;br /&gt;See the Python &lt;a href="http://wiki.python.org/moin/Concurrency"&gt;concurrency wiki&lt;/a&gt; for a &lt;a href="http://wiki.python.org/moin/Concurrency/99Bottles"&gt;simple example&lt;/a&gt; of pypes as well as a comparison of the same solution across several different frameworks. You can also see a few examples at &lt;a href="http://www.pypes.org"&gt;pypes.org&lt;/a&gt; where we're in the process of adding more.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5716516924669268476-3786774182777357247?l=gaumer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gaumer.blogspot.com/feeds/3786774182777357247/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5716516924669268476&amp;postID=3786774182777357247' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5716516924669268476/posts/default/3786774182777357247'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5716516924669268476/posts/default/3786774182777357247'/><link rel='alternate' type='text/html' href='http://gaumer.blogspot.com/2009/08/loop-type-networks-in-pypes.html' title='Loop-Type Networks in Pypes'/><author><name>Eric Gaumer</name><uri>http://www.blogger.com/profile/12100268913296713952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5716516924669268476.post-7430518071466998645</id><published>2009-08-16T09:33:00.001-04:00</published><updated>2009-08-16T10:34:17.707-04:00</updated><title type='text'>Flow-Based Programming with Pypes</title><content type='html'>I finally had a chance to launch a beta release of &lt;a href="http://www.pypes.org"&gt;pypes&lt;/a&gt; this week. Pypes is a framework for designing data flow applications using flow-based programming techniques. It originated from the necessity to process huge amounts of data prior to indexing. It was inspired by a wide range of concepts which resulted in it being a hybrid of several different techniques and/or architectural styles.&lt;br /&gt;&lt;br /&gt;&lt;span class="fullpost"&gt;I would say that pypes most closely resembles J. Paul Morrison's &lt;a href="http://jpaulmorrison.com/fbp/"&gt;Flow-Based&lt;/a&gt; architectural style as it satisfies what he refers to as the "three main legs" of FBP:&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;1. asynchronous processes&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;2. data packets with a lifetime of their own&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;3. external definition of connections&lt;br /&gt;&lt;br /&gt;Pypes achieves its asynchronous behavior by leveraging &lt;a href="http://www.stackless.com/"&gt;Stackless Python&lt;/a&gt;. There's been a lot of buzz (and frameworks) designed around Python's generator syntax and I started down that path as well. I ultimately turned to Stackless because performance was a priority.&lt;br /&gt;&lt;br /&gt;It's not uncommon to create data-flow graphs that range anywhere from 5 to 30 different components. In the same breath, it's not uncommon to process documents numbering in the millions. Ten million documents switching across 30 different components equates to a whole lot of context switching.&lt;br /&gt;&lt;br /&gt;This is not to say that context switching between generators is necessarily slow, but task scheduling also plays a role. I personally haven't done any benchmarks in this area but &lt;a href="http://entitycrisis.blogspot.com/2009/03/concurrent-scaling-benchmarks.html"&gt;others have&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;In the context in which pypes was designed, I simply needed blazing fast performance with superior scalability. Pypes scales up using the (new) Python 2.6 multiprocessing module. Essentially an instance of the graph is placed on each CPU and/or core in order to achieve parallel processing. Incoming packets are then load balanced between instances using a multiprocessing Queue.&lt;br /&gt;&lt;br /&gt;Pypes scales out using REST (well actually &lt;a href="https://sourceforge.net/project/screenshots.php?group_id=271766"&gt;Visual Data Studio&lt;/a&gt; adds this functionality on top of pypes). Visual Data Studio is a WSGI application that provides a REST interface to the underlying pypes framework. Users can simply POST documents to the system for processing (a very common use case in search indexing). VDS also provides a plugin architecture that allows users to drop new/custom components into the system (using Python's egg format) as well as templates (via Paste) for generating the boilerplate code necessary for creating custom components. Essentially, VDS is to pypes what &lt;a href="http://lucene.apache.org/solr/"&gt;SOLR&lt;/a&gt; is to &lt;a href="http://lucene.apache.org/java/docs/"&gt;Lucene&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;It's one thing to achieve modularity at a code level but providing that same modularity for non-developers is a more difficult goal. This is where Morrison's idea of external definitions of connections is essential. Without this it becomes impossible to achieve true modularity. VDS/pypes leverages this concept to provide a completely visual development and design environment that allows business users to build business processing logic without the need for any programming experience whatsoever. This was another highly prioritized design goal considering business logic tends to fluctuate long after the original search architects and engineers complete.&lt;br /&gt;&lt;br /&gt;Documentation at this time is pretty scarce but I'm working on getting together an extensive set of examples and tutorials (including podcasts). In the meantime, feel free to email with any questions, concerns, or complaints and I'll be happy to address them. It's always interesting to see people using software in ways you never originally designed it for. Of course this also helps shape its future.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5716516924669268476-7430518071466998645?l=gaumer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gaumer.blogspot.com/feeds/7430518071466998645/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5716516924669268476&amp;postID=7430518071466998645' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5716516924669268476/posts/default/7430518071466998645'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5716516924669268476/posts/default/7430518071466998645'/><link rel='alternate' type='text/html' href='http://gaumer.blogspot.com/2009/08/flow-based-programming-with-pypes.html' title='Flow-Based Programming with Pypes'/><author><name>Eric Gaumer</name><uri>http://www.blogger.com/profile/12100268913296713952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5716516924669268476.post-5832555805273131831</id><published>2009-05-04T19:39:00.001-04:00</published><updated>2009-05-04T19:39:31.187-04:00</updated><title type='text'>Bringing Entity Extraction to SOLR</title><content type='html'>We've done a proof-of-concept that brings &lt;a href="http://en.wikipedia.org/wiki/Named_entity_recognition"&gt;NER&lt;/a&gt; capabilities to &lt;a href="http://lucene.apache.org/solr/"&gt;SOLR&lt;/a&gt;. &lt;a href="http://en.wikipedia.org/wiki/Faceted_browser"&gt;Faceted navigation&lt;/a&gt; enhances the search experience and provides a more exploratory driven process, allowing the user to glean information about data they wouldn't typically get from a standard result set.&lt;br /&gt;&lt;br /&gt;&lt;span class="fullpost"&gt;At the moment we're using machine learning techniques to extract names from the documents at index time. We've written some code that leverages a custom version of &lt;a href="http://mallet.cs.umass.edu/"&gt;MALLET&lt;/a&gt;. Our approach to entity extraction is to view it as a classification problem which can then be solved using sequence labeling. &lt;a href="http://en.wikipedia.org/wiki/Conditional_random_field"&gt;CRF&lt;/a&gt; seems to be at the forefront of that research, augmenting maximum entropy Markov models (MEMM) and solving what is known as the &lt;i&gt;label bias&lt;/i&gt; problem.&lt;br /&gt;&lt;br /&gt;Our reasoning for using a more sophisticated approach was the potential ability to extract more difficult entities where grammar based approaches typically fail. Grammar based approaches also tend to be language dependent and cumbersome to build and maintain. Statistical methods require large volumes of annotated data but the process itself is simple and can be performed by novice developers who have no formal background in computational linguistics.&lt;br /&gt;&lt;br /&gt;The extractor we built for the POC was modestly trained but the focus was more on integration points. The extracted names are indexed in a multi-valued field which we use to create a facet at query time.&lt;br /&gt;&lt;br /&gt;Here's a screenshot showing the POC work we did. We then crawled a site about a sax player named Jon Smith. You can see the extracted names along with the facet counts displaying the number of occurrences each entity represents within the result set. &lt;br /&gt;&lt;br /&gt;&lt;img src="http://web.me.com/egaumer/blogger/images/ee-sample.tiff"&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5716516924669268476-5832555805273131831?l=gaumer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gaumer.blogspot.com/feeds/5832555805273131831/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5716516924669268476&amp;postID=5832555805273131831' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5716516924669268476/posts/default/5832555805273131831'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5716516924669268476/posts/default/5832555805273131831'/><link rel='alternate' type='text/html' href='http://gaumer.blogspot.com/2009/05/bringing-entity-extraction-to-solr.html' title='Bringing Entity Extraction to SOLR'/><author><name>Eric Gaumer</name><uri>http://www.blogger.com/profile/12100268913296713952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5716516924669268476.post-513078922004919066</id><published>2009-05-03T12:55:00.001-04:00</published><updated>2009-05-03T13:03:07.723-04:00</updated><title type='text'>Boolean Retrieval: What You're Not Being Told</title><content type='html'>Despite the wide-scale criticism by many researchers, boolean retrieval models continue to dominate the commercial search space. The long recognized limitations and inadequacies of boolean retrieval models seem to have had no discernible effect on the industry. Entities such as Google have conditioned users into believing their information needs are being met but past research proves otherwise.&lt;br /&gt;&lt;br /&gt;In a study conducted by Blair &amp; Maron in 1985 involving a 40,000 document case, lawyers estimated that a manual search would find 75% of relevant documents, when in fact the research showed only 20% or so had actually been found. Our information needs have grown exponentially over the past few decades while search technology has remained relatively stagnant. Statistics such as these are particularly alarming from the perspective of e-discovery where justice is at stake.&lt;br /&gt;&lt;br /&gt;&lt;span class="fullpost"&gt;So what exactly is boolean retrieval anyway? For starters, it the basis by which most of you search for information. Boolean retrieval models use Boolean algebra to partition a set of documents into two distinct classes; documents which are relevant to the user's information needs and those which are not.&lt;br /&gt;&lt;br /&gt;Allow me to start by providing some background knowledge and insight as to how boolean retrieval models operate.&lt;br /&gt;&lt;br /&gt;Assume we have a set of four documents in a fictitious search engine D where D = { &lt;i&gt;d&lt;sub&gt;A&lt;/sub&gt;, d&lt;sub&gt;B&lt;/sub&gt;, d&lt;sub&gt;C&lt;/sub&gt;,  d&lt;sub&gt;D&lt;/sub&gt;&lt;/i&gt; }.&lt;br /&gt;&lt;br /&gt;Now consider that each document in D contains tokens (words) from a finite set T where T = {&lt;i&gt;t&lt;sub&gt;1&lt;/sub&gt;, t&lt;sub&gt;2&lt;/sub&gt;, t&lt;sub&gt;3&lt;/sub&gt;, ..., t&lt;sub&gt;n&lt;/sub&gt;&lt;/i&gt; }.&lt;br /&gt;&lt;br /&gt;This can be represented internally as a &lt;i&gt;binary incidence matrix&lt;/i&gt; M where M is shown just below.&lt;br /&gt;&lt;br /&gt;&lt;center&gt;&lt;table border="1"&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;i&gt;d&lt;sub&gt;A&lt;/sub&gt;&lt;i&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;i&gt;d&lt;sub&gt;B&lt;/sub&gt;&lt;i&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;i&gt;d&lt;sub&gt;C&lt;/sub&gt;&lt;i&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;i&gt;d&lt;sub&gt;D&lt;/sub&gt;&lt;i&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&lt;i&gt;t&lt;sub&gt;1&lt;/sub&gt;&lt;/i&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;0&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;tr/&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&lt;i&gt;t&lt;sub&gt;2&lt;/sub&gt;&lt;/i&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&lt;center&gt;0&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;0&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;tr/&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&lt;i&gt;t&lt;sub&gt;3&lt;/sub&gt;&lt;/i&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;0&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;0&lt;/center&gt;&lt;/td&gt;&lt;tr/&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&lt;i&gt;t&lt;sub&gt;4&lt;/sub&gt;&lt;/i&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;tr/&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&lt;i&gt;t&lt;sub&gt;5&lt;/sub&gt;&lt;/i&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&lt;center&gt;0&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;0&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;0&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;tr/&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&lt;i&gt;t&lt;sub&gt;6&lt;/sub&gt;&lt;/i&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&lt;center&gt;0&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;0&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;0&lt;/center&gt;&lt;/td&gt;&lt;tr/&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&lt;i&gt;t&lt;sub&gt;7&lt;/sub&gt;&lt;/i&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&lt;center&gt;0&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;tr/&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&lt;i&gt;...&lt;/i&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&lt;center&gt;...&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;...&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;...&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;...&lt;/center&gt;&lt;/td&gt;&lt;tr/&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&lt;i&gt;t&lt;sub&gt;n&lt;/sub&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&lt;center&gt;0&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;td&gt;&lt;center&gt;1&lt;/center&gt;&lt;/td&gt;&lt;tr/&gt;&lt;/table&gt;&lt;/center&gt;&lt;br /&gt;&lt;br /&gt;The idea here is simple; if a term &lt;i&gt;t&lt;sub&gt;x&lt;/sub&gt;&lt;/i&gt; appears in a document &lt;i&gt;d&lt;sub&gt;x&lt;/sub&gt;&lt;/i&gt; then we record a 1 in our incidence matrix where &lt;i&gt;d&lt;sub&gt;x&lt;/sub&gt;&lt;/i&gt; and &lt;i&gt;t&lt;sub&gt;x&lt;/sub&gt;&lt;/i&gt; intersect, otherwise we record a 0. For example, &lt;i&gt;t&lt;sub&gt;1&lt;/sub&gt;&lt;/i&gt; appears in documents &lt;i&gt;d&lt;sub&gt;A&lt;/sub&gt;&lt;/i&gt;, &lt;i&gt;d&lt;sub&gt;B&lt;/sub&gt;&lt;/i&gt;, and &lt;i&gt;d&lt;sub&gt;D&lt;/sub&gt;&lt;/i&gt; but does not appear in document &lt;i&gt;d&lt;sub&gt;C&lt;/sub&gt;&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;Once we have constructed our incidence matrix, issuing boolean queries becomes trivial. We apply boolean algebra to the term vectors representing each query term. If the resulting column is a 1 then the document corresponding to that column is part of the result set R. Consider the following example queries.&lt;br /&gt;&lt;br /&gt;Return all documents that contain the term (keyword) &lt;i&gt;t&lt;sub&gt;3&lt;/sub&gt;&lt;/i&gt; would yield R = { &lt;i&gt;d&lt;sub&gt;A&lt;/sub&gt;&lt;/i&gt;, &lt;i&gt;d&lt;sub&gt;C&lt;/sub&gt;&lt;/i&gt; } because 1010 = 1010 &lt;br/&gt;&lt;br /&gt;Return all documents that contain the term &lt;i&gt;t&lt;sub&gt;3&lt;/sub&gt;&lt;/i&gt; AND &lt;i&gt;t&lt;sub&gt;7&lt;/sub&gt;&lt;/i&gt; would yield R = { &lt;i&gt;d&lt;sub&gt;C&lt;/sub&gt;&lt;/i&gt; } because 1010 AND 0111 = 0010 &lt;br/&gt;&lt;br /&gt;Return all documents that contain the term &lt;i&gt;t&lt;sub&gt;2&lt;/sub&gt;&lt;/i&gt; OR &lt;i&gt;t&lt;sub&gt;3&lt;/sub&gt;&lt;/i&gt; would yield R = { &lt;i&gt;d&lt;sub&gt;A&lt;/sub&gt;&lt;/i&gt;, &lt;i&gt;d&lt;sub&gt;B&lt;/sub&gt;&lt;/i&gt;, &lt;i&gt;d&lt;sub&gt;C&lt;/sub&gt;&lt;/i&gt;, &lt;i&gt;d&lt;sub&gt;D&lt;/sub&gt;&lt;/i&gt; } because 0101 OR 1010 = 1111 &lt;br/&gt;&lt;br /&gt;Return all documents that contain the term &lt;i&gt;t&lt;sub&gt;7&lt;/sub&gt;&lt;/i&gt; AND &lt;i&gt;t&lt;sub&gt;1&lt;/sub&gt;&lt;/i&gt; AND NOT &lt;i&gt;t&lt;sub&gt;5&lt;/sub&gt;&lt;/i&gt; would yield R = { &lt;i&gt;d&lt;sub&gt;B&lt;/sub&gt;&lt;/i&gt; } because 0111 AND 1101 AND NOT 0001 = 0101 AND 1110 = 0100&lt;br /&gt;&lt;br /&gt;From this simple example you should now have a very basic understanding of how boolean retrieval models operate. With that, it should be evident that any given document represented in the incidence matrix is either part of a particular result set or not; either relevant or not relevant to the user's information need. &lt;br /&gt;&lt;br /&gt;In the real world however, relevance is not so black and white. Information typically has a degree of relevance and cannot simply be partitioned into two distinct sets.&lt;br /&gt;&lt;br /&gt;The astute reader might argue that typical search engines rank documents by weighting metrics that are applied to each document in the result set. This is typically referred to as &lt;i&gt;extended Boolean retrieval&lt;/i&gt; and it still doesn't address the primary concern.&lt;br /&gt;&lt;br /&gt;What this does is shift responsibility from the designer to the user. It requires users to have sophisticated knowledge about boolean operators. At the same time, the search engine ranking heuristics have to be tuned in such a way as to align themselves with the expectations of the user. The problem is that users expectations will vary from one user to another. The search engine might be configured to boost (favor) documents that contain matching query terms in the title (as opposed to the document body for instance). Can such rigid constraints possibly meet the needs of every user?&lt;br /&gt;&lt;br /&gt;Google has had some success with their &lt;a href="http://en.wikipedia.org/wiki/PageRank"&gt;PageRank&lt;/a&gt; algorithm which essentially provides a popularity score to each document to help determine overall relevance. Does popularity really have any relation to relevance? A document can be extremely relevant regardless of its popularity and one has to wonder just how effective this approach is.&lt;br /&gt;&lt;br /&gt;To complicate matters, algorithms like PageRank don't really apply to information residing outside the World Wide Web (hypermedia in general). As organizations around the globe continue to store larger volumes of data, search is becoming more commonplace. Just how effective are current techniques used within enterprise search? Are the current ranking algorithms sufficient to meet the growing demands of information seekers?&lt;br /&gt;&lt;br /&gt;To provide yet another argument, consider a document titled: &lt;i&gt;&lt;u&gt;The Guide to domestic car repair for all models excluding Ford&lt;/u&gt;&lt;/i&gt;. For arguments sake let assume we have a user whose information need involves domestic car repair for any model with the exception of Ford. A reasonable user would construct a Boolean query such as &lt;b&gt;domestic cars AND NOT ford&lt;/b&gt;. &lt;br /&gt;&lt;br /&gt;One might be surprised to find out that the document above is excluded from the result set. The problem here is that the term &lt;i&gt;Ford&lt;/i&gt; appears in the title so the boolean NOT operator forces exclusion. Of course no amount a rank tuning can prevent this from happening. In a controlled test environment these issues become evident but in searching larger systems of potentially unexplored data, these sort of relevance problems go unnoticed. I would speculate that this is one indication of why Boolean retrieval models are still common in the enterprise search space.&lt;br /&gt;&lt;br /&gt;So what options are available? Luckily these very same questions and concerns have sparked new life into research based on  &lt;i&gt;Probabilistic Information Retrieval&lt;/i&gt;. Probabilistic IR models predict the probability that a given document will be relevant to a given query then ranks the document according to its probability of relevance. Documents are no longer considered to be simply relevant or not, rather they contain a degree of relevance.&lt;br /&gt;&lt;br /&gt;There's a lot of research to support this type of model and it's actually been around for decades. If you're more pragmatic by nature then I suggest giving &lt;a href="http://xapian.org/"&gt;Xapian&lt;/a&gt; a try. It's an open source, free search engine you can download and use. It incorporates a probabilistic relevance model (&lt;a href="http://en.wikipedia.org/wiki/Probabilistic_relevance_model_(BM25)"&gt;Okapi BM25&lt;/a&gt;) to score documents and it's used to drive several high volume, high profile, sites on the Internet including &lt;a href="http://www.debian.org/"&gt;Debian&lt;/a&gt; and &lt;a href="http://gmane.org/"&gt;Gmame&lt;/a&gt; to name just a few.&lt;br /&gt;&lt;br /&gt;Another exciting area of research is the use of language models for information retrieval. There's an ongoing cooperative effort between the University of Massachusetts and Carnegie Mellon University to build language modeling information retrieval tools. In particular they have developed &lt;a href="http://www.lemurproject.org/indri/"&gt;Indri&lt;/a&gt;, an IR system that merges ideas from &lt;a href="http://en.wikipedia.org/wiki/Bayesian_inference"&gt;Bayesian inference&lt;/a&gt; networks and &lt;a href="http://en.wikipedia.org/wiki/Statistical_language_modeling"&gt;statistical language modeling&lt;/a&gt; approaches. Indri is also free to download and use in both academic and commercial environments (and of course, personal exploration).&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5716516924669268476-513078922004919066?l=gaumer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gaumer.blogspot.com/feeds/513078922004919066/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5716516924669268476&amp;postID=513078922004919066' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5716516924669268476/posts/default/513078922004919066'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5716516924669268476/posts/default/513078922004919066'/><link rel='alternate' type='text/html' href='http://gaumer.blogspot.com/2009/05/boolean-retrieval-what-you-not-being.html' title='Boolean Retrieval: What You&amp;#39;re Not Being Told'/><author><name>Eric Gaumer</name><uri>http://www.blogger.com/profile/12100268913296713952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5716516924669268476.post-1965983259715354157</id><published>2009-04-30T00:49:00.001-04:00</published><updated>2009-04-30T01:29:58.595-04:00</updated><title type='text'>Open Source Enterprise Search</title><content type='html'>With the 2009 &lt;a href="http://www.infonortics.com/searchengines/"&gt;Infonortics Search Meeting&lt;/a&gt; concluded, it's even more apparent that enterprise search (search in general) has become a commodity. With most of the industry leaders showcasing their offerings, I had a hard time differentiating the choices. They all seemed to offer some variation of semantic analysis and faceted browsing has become a standard search feature as well.&lt;br /&gt;&lt;br /&gt;More importantly, I didn't see many commercial value add-ons that I can't get from &lt;a href="http://lucene.apache.org/java/docs/"&gt;Lucene&lt;/a&gt; or more appropriately, &lt;a href="http://lucene.apache.org/solr/"&gt;Solr&lt;/a&gt;. With &lt;a href="http://www.lucidimagination.com/"&gt;Lucid Imagination&lt;/a&gt; now offering &lt;a href="http://www.businesswire.com/portal/site/google/?ndmViewId=news_view&amp;newsId=20090126005439&amp;newsLang=en"&gt;commercial support&lt;/a&gt;, it's hard to imagine (pun intended) why people would continue paying the six digit price tag tied to these proprietary search products.&lt;br /&gt;&lt;br /&gt;&lt;span class="fullpost"&gt;I'm not the only one noticing hesitation in the search market, &lt;a href="http://www.arnoldit.com/"&gt;Steve Arnold&lt;/a&gt; also blogged about &lt;a href="http://arnoldit.com/wordpress/2009/04/29/autonomy-thrives-in-lousy-economic-climate/"&gt;harsh times for commercial search vendors&lt;/a&gt; due to the economic downturn. With the current state of the economy it's even harder to imagine paying an inflated premium for enterprise search when you already have folks like Apple, Comcast, Netflix, CNET, IBM, AOL, LinkedIn, and MySpace (just to name a few) using Solr powered Lucene. This should surely provide some level of comfort about the validity of the technology itself.&lt;br /&gt;&lt;br /&gt;We've implemented search architectures and systems for just about every major player out there including The Dow Jones, The Associated Press, Disney, AOL, The Financial Times, McGraw Hill, Standard &amp; Poors, and Wolters Kluwer to name but a few. There aren't too many features you won't get from Solr at a fraction of the price. I think one of the limiting factors was definitely support and you'd see companies like AOL, who had a big enough internal IT staff, supporting it themselves. This strategy wasn't feasible for many organizations who wanted accountability and support.&lt;br /&gt;&lt;br /&gt;With Lucid paving the way, accountability is now a reality. I'm very excited to see open source search at a breaking point and we're in the process of forming a business relationship with Lucid to build out their professional services offerings (an area where we have extensive knowledge). &lt;br /&gt;&lt;br /&gt;We've also been quietly putting together our own Solr offering referred to internally as &lt;i&gt;Plasma&lt;/i&gt;. We've essentially taken Solr and polished it up in terms of user interfaces. We're providing a default search interface and some pre-configured settings that provide a shrink wrapped package that meets the needs of most general users. With this, we've done the work of integrating &lt;a href="http://lucene.apache.org/nutch/"&gt;Nutch&lt;/a&gt; as the default crawler for a complete "out of the box" solution.&lt;br /&gt;&lt;br /&gt;Semantic analysis is another area of interest for us. In fact, we spent the last year developing named entity taggers based on &lt;a href="http://www.cis.upenn.edu/~pereira/papers/crf.pdf"&gt;Conditional Random Fields&lt;/a&gt; [Lafferty, McCallum, Pereira]. We took a stochastic approach so we'd have the ability to deal with tougher entities like &lt;i&gt;Products&lt;/i&gt; for which no pattern can be derived.&lt;br /&gt;&lt;br /&gt;That work was recently integrated into Plasma and so we've brought entity extraction capabilities to Solr and we're offering that functionality as part of Plasma. Another really cool technology we've been working on is a component based content processing framework called &lt;i&gt;Pypes&lt;/i&gt;. The framework uses &lt;a href="http://en.wikipedia.org/wiki/Coroutine"&gt;coroutines&lt;/a&gt; to drive a sophisticated component model that eliminates the need for threading and all the nasty semantics (and bugs) that go along with it. &lt;a href="http://www-cs-faculty.stanford.edu/~knuth/"&gt;Dr. Knuth&lt;/a&gt; would be proud!&lt;br /&gt;&lt;br /&gt;Pypes is a product that evolved from necessity. Every enterprise search implementation is tasked with transforming content and moving it into the index. The idea of Pypes was taken from Unix pipes and the notion of breaking a complex task into smaller components that focus on one particular aspect. Pypes takes that ideology to the next level and allows users to create graph structures, making branching and merging of data possible (what you'd expect from an ETL application). Best of all, we're releasing the framework as an open source project and we'll be providing publishing components for Solr.&lt;br /&gt;&lt;br /&gt;I think every commercial search vendor out there should be worried. Open Source Enterprise Search is about to take search into all the various organizations that couldn't warrant the six digit price tag previously associated with enterprise search products. I'd imagine it's going to displace quite a few commercial vendor deployments along the way and in fact, I know of a few already.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5716516924669268476-1965983259715354157?l=gaumer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gaumer.blogspot.com/feeds/1965983259715354157/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5716516924669268476&amp;postID=1965983259715354157' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5716516924669268476/posts/default/1965983259715354157'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5716516924669268476/posts/default/1965983259715354157'/><link rel='alternate' type='text/html' href='http://gaumer.blogspot.com/2009/04/open-source-enterprise-search.html' title='Open Source Enterprise Search'/><author><name>Eric Gaumer</name><uri>http://www.blogger.com/profile/12100268913296713952</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
