Pypes provides a simple and robust scheduling algorithm. Each component is derived from a stackless tasklet which is an extremely lightweight thread of execution often referred to as a microthread. Components are then chained together in dependency order allowing the output of one component to be pyped as input to another. But how does the scheduler handle branching and merging?
The answer is quite simple; it uses topological sorting. In graph theory, a topological sort or topological ordering of a directed acyclic graph (DAG) is a linear ordering of its nodes in which each node comes before all nodes to which it has outbound edges. Every DAG has one or more topological sorts and every dataflow you design with pypes is a DAG. This also helps explain why pypes does not support Loop-Type Networks. Loops introduce cycles in the graph.
While this type of scheduling prohibits cycles, it allows us to achieve very high performance. Pypes was originally designed to process hundreds of millions of documents in preparation for indexing. Imagine we have 30 components designed to handle different processing aspects and data normalization. At just 1 million input documents we have created 30 million context switches. A context switch is the act of switching control from one component to another. This accounts for a significant portion of the overall processing time (the total throughput). The longer this takes, the lower the throughput.
The topological scheduler in pypes never has to decide which component to run. This ordering is defined by the dataflow you've created (the DAG) and the linear dependency order provided by the topological sorting. This sorting happens once and is utilized throughout the entire execution which means our scheduler never has to decide which component to select next. If we add the lightweight (and extremely fast) mechanism of context switching that stackless python offers, we end up with a very well performing scheduler that's simple to understand as easy to maintain.
Of course others have provided benchmarks showing how well stackless performs against other concurrency frameworks. The decision to use stackless in pypes was based mainly on performance and the astute reader will recognize that we're not achieving any true concurrency using stackless. Rather, were using a cooperative multitasking approach which fits the nature of flow based programming. If component B depends on component A then it makes no sense to run B before A is complete. It also doesn't make sense to run A partially so we can run B partially. This might be a goal in a realtime situation such as playing an audio/video stream where we don't want to starve components but pypes is all about throughput. Our goal is to minimize the number of context switches and scheduling semantics.
Of course this is not to say that pypes doesn't offer any true concurrency. Incoming documents are load balanced across an instance of the graph running on each CPU/core. If you have a dual core machine then you are essentially processing two documents in parallel. Typical use cases for us are quad Xeon machines running 4 instances of the graph in parallel (i.e., 4 documents being processed at once).
Pypes can also scale out very easily. Since pypes uses a REST architecture (i.e., use HTTP POST to submit content) you can place several machines behind a load balancer and requests will be executed across the cluster. This allows you to achieve very high throughput.
Read Full Story
skip to main |
skip to sidebar
Sunday, September 27, 2009
Monday, August 17, 2009
Loop-Type Networks in Pypes
Michael Sparks, the author of Kamaelia, took a look at Pypes this weekend and sent out some tweets mentioning the similarities, limitations, and possible synergies with Kamaelia. I'm a huge fan of Kamaelia and so the similarities should not be too surprising. I like the style of messaging passing used in Kamaelia and the only real difference there is that I chose the Flow-Based Programming (FBP) terminology, calling them "ports" rather than "in/out boxes". At the same time, there are some fundamental differences both in design and objectives.
Pypes was designed with a specific goal in mind; to quickly process high volumes of data in a completely modular and service oriented way. When I was first faced with this task, I naturally turned to Kamaelia but the "service oriented" requirement proved to be a problem. It's possible that I just didn't understand Kamaelia well enough to address the issue but considering it uses generators, it inherently provides lazy data-flow.
By lazy data-flow, what I mean to say is that generators do not compute any data until it's asked for. This essentially means data is pulled through the system by creating components, that at some point, generate data. This is in contrast to eager data-flow models where data is pushed into the system without ever being requested.
Why does this matter? Pypes was originally designed to be a content processing framework capable of processing large volumes of data prior to being indexed for search (normalization, classification, information extraction, etc.). When it comes to indexing content, there are lots of systems/sources in which data resides (CMS, DB, Filesystem, Email, etc.). It's quite inefficient, and typically fragile, to poll these systems for new data. Pypes allows these systems to push new data out to a service oriented processing framework.
Another big concern here is that many of these systems have language specific APIs. Writing a component that pulls data from FileNet using Python isn't possible. I felt it was important that this "connector" functionality be isolated from the task of content processing. Pypes uses the notion of Adapters to adapt various incoming data formats to a unified data model called a Packet (using FBP terminology). Adapters can even consume batches of packets that are disassembled and streamed through the system.
Why not just create a linear pipeline? Quite simply, we wanted the ability to publish to multiple locations. An editor might create a article that needs to be indexed but also needs to be pushed out to some location as an Atom feed. At the same time, we might process packets containing certain attributes slightly different than others so the ability to branch helps solve this.
This leads me to the limitation that Michael mentioned regarding cycles. Everything I've described to this point is collectively referred to as Batch-Type Networks in FBP. This topology means that the network generally has a left-to-right/top-to-bottom flow, with packets being created on the left/top side and disposed of on the right/bottom side of the network.
When we introduce cycles then the topology changes to a Loop-Type Network. This sort of topology adds some additional complexity and overhead to the system that I just haven't wrapped my brain around yet. My ambition right now is focused more on building a richer component library and better documentation for Visual Design Studio.
I envision pypes (Visual Design Studio) as more of a mashup or ETL tool/framework. Similar to Yahoo Pipes with the ability to write custom components while also scaling to handle large volumes of data. I actually did a demo at a search conference a few months back that dealt completely with RSS feeds. The basic concept was to mashup a bunch of RSS content for indexing which was a pretty trivial task for pypes. Visual Design Studio provides an interface that allows business users (non-developers) to wire up these components to produce different applications.
See the Python concurrency wiki for a simple example of pypes as well as a comparison of the same solution across several different frameworks. You can also see a few examples at pypes.org where we're in the process of adding more. Read Full Story
Pypes was designed with a specific goal in mind; to quickly process high volumes of data in a completely modular and service oriented way. When I was first faced with this task, I naturally turned to Kamaelia but the "service oriented" requirement proved to be a problem. It's possible that I just didn't understand Kamaelia well enough to address the issue but considering it uses generators, it inherently provides lazy data-flow.
By lazy data-flow, what I mean to say is that generators do not compute any data until it's asked for. This essentially means data is pulled through the system by creating components, that at some point, generate data. This is in contrast to eager data-flow models where data is pushed into the system without ever being requested.
Why does this matter? Pypes was originally designed to be a content processing framework capable of processing large volumes of data prior to being indexed for search (normalization, classification, information extraction, etc.). When it comes to indexing content, there are lots of systems/sources in which data resides (CMS, DB, Filesystem, Email, etc.). It's quite inefficient, and typically fragile, to poll these systems for new data. Pypes allows these systems to push new data out to a service oriented processing framework.
Another big concern here is that many of these systems have language specific APIs. Writing a component that pulls data from FileNet using Python isn't possible. I felt it was important that this "connector" functionality be isolated from the task of content processing. Pypes uses the notion of Adapters to adapt various incoming data formats to a unified data model called a Packet (using FBP terminology). Adapters can even consume batches of packets that are disassembled and streamed through the system.
Why not just create a linear pipeline? Quite simply, we wanted the ability to publish to multiple locations. An editor might create a article that needs to be indexed but also needs to be pushed out to some location as an Atom feed. At the same time, we might process packets containing certain attributes slightly different than others so the ability to branch helps solve this.
This leads me to the limitation that Michael mentioned regarding cycles. Everything I've described to this point is collectively referred to as Batch-Type Networks in FBP. This topology means that the network generally has a left-to-right/top-to-bottom flow, with packets being created on the left/top side and disposed of on the right/bottom side of the network.
When we introduce cycles then the topology changes to a Loop-Type Network. This sort of topology adds some additional complexity and overhead to the system that I just haven't wrapped my brain around yet. My ambition right now is focused more on building a richer component library and better documentation for Visual Design Studio.
I envision pypes (Visual Design Studio) as more of a mashup or ETL tool/framework. Similar to Yahoo Pipes with the ability to write custom components while also scaling to handle large volumes of data. I actually did a demo at a search conference a few months back that dealt completely with RSS feeds. The basic concept was to mashup a bunch of RSS content for indexing which was a pretty trivial task for pypes. Visual Design Studio provides an interface that allows business users (non-developers) to wire up these components to produce different applications.
See the Python concurrency wiki for a simple example of pypes as well as a comparison of the same solution across several different frameworks. You can also see a few examples at pypes.org where we're in the process of adding more. Read Full Story
Sunday, August 16, 2009
Flow-Based Programming with Pypes
I finally had a chance to launch a beta release of pypes this week. Pypes is a framework for designing data flow applications using flow-based programming techniques. It originated from the necessity to process huge amounts of data prior to indexing. It was inspired by a wide range of concepts which resulted in it being a hybrid of several different techniques and/or architectural styles.
I would say that pypes most closely resembles J. Paul Morrison's Flow-Based architectural style as it satisfies what he refers to as the "three main legs" of FBP:
1. asynchronous processes
2. data packets with a lifetime of their own
3. external definition of connections
Pypes achieves its asynchronous behavior by leveraging Stackless Python. There's been a lot of buzz (and frameworks) designed around Python's generator syntax and I started down that path as well. I ultimately turned to Stackless because performance was a priority.
It's not uncommon to create data-flow graphs that range anywhere from 5 to 30 different components. In the same breath, it's not uncommon to process documents numbering in the millions. Ten million documents switching across 30 different components equates to a whole lot of context switching.
This is not to say that context switching between generators is necessarily slow, but task scheduling also plays a role. I personally haven't done any benchmarks in this area but others have.
In the context in which pypes was designed, I simply needed blazing fast performance with superior scalability. Pypes scales up using the (new) Python 2.6 multiprocessing module. Essentially an instance of the graph is placed on each CPU and/or core in order to achieve parallel processing. Incoming packets are then load balanced between instances using a multiprocessing Queue.
Pypes scales out using REST (well actually Visual Data Studio adds this functionality on top of pypes). Visual Data Studio is a WSGI application that provides a REST interface to the underlying pypes framework. Users can simply POST documents to the system for processing (a very common use case in search indexing). VDS also provides a plugin architecture that allows users to drop new/custom components into the system (using Python's egg format) as well as templates (via Paste) for generating the boilerplate code necessary for creating custom components. Essentially, VDS is to pypes what SOLR is to Lucene.
It's one thing to achieve modularity at a code level but providing that same modularity for non-developers is a more difficult goal. This is where Morrison's idea of external definitions of connections is essential. Without this it becomes impossible to achieve true modularity. VDS/pypes leverages this concept to provide a completely visual development and design environment that allows business users to build business processing logic without the need for any programming experience whatsoever. This was another highly prioritized design goal considering business logic tends to fluctuate long after the original search architects and engineers complete.
Documentation at this time is pretty scarce but I'm working on getting together an extensive set of examples and tutorials (including podcasts). In the meantime, feel free to email with any questions, concerns, or complaints and I'll be happy to address them. It's always interesting to see people using software in ways you never originally designed it for. Of course this also helps shape its future. Read Full Story
I would say that pypes most closely resembles J. Paul Morrison's Flow-Based architectural style as it satisfies what he refers to as the "three main legs" of FBP:
1. asynchronous processes
2. data packets with a lifetime of their own
3. external definition of connections
Pypes achieves its asynchronous behavior by leveraging Stackless Python. There's been a lot of buzz (and frameworks) designed around Python's generator syntax and I started down that path as well. I ultimately turned to Stackless because performance was a priority.
It's not uncommon to create data-flow graphs that range anywhere from 5 to 30 different components. In the same breath, it's not uncommon to process documents numbering in the millions. Ten million documents switching across 30 different components equates to a whole lot of context switching.
This is not to say that context switching between generators is necessarily slow, but task scheduling also plays a role. I personally haven't done any benchmarks in this area but others have.
In the context in which pypes was designed, I simply needed blazing fast performance with superior scalability. Pypes scales up using the (new) Python 2.6 multiprocessing module. Essentially an instance of the graph is placed on each CPU and/or core in order to achieve parallel processing. Incoming packets are then load balanced between instances using a multiprocessing Queue.
Pypes scales out using REST (well actually Visual Data Studio adds this functionality on top of pypes). Visual Data Studio is a WSGI application that provides a REST interface to the underlying pypes framework. Users can simply POST documents to the system for processing (a very common use case in search indexing). VDS also provides a plugin architecture that allows users to drop new/custom components into the system (using Python's egg format) as well as templates (via Paste) for generating the boilerplate code necessary for creating custom components. Essentially, VDS is to pypes what SOLR is to Lucene.
It's one thing to achieve modularity at a code level but providing that same modularity for non-developers is a more difficult goal. This is where Morrison's idea of external definitions of connections is essential. Without this it becomes impossible to achieve true modularity. VDS/pypes leverages this concept to provide a completely visual development and design environment that allows business users to build business processing logic without the need for any programming experience whatsoever. This was another highly prioritized design goal considering business logic tends to fluctuate long after the original search architects and engineers complete.
Documentation at this time is pretty scarce but I'm working on getting together an extensive set of examples and tutorials (including podcasts). In the meantime, feel free to email with any questions, concerns, or complaints and I'll be happy to address them. It's always interesting to see people using software in ways you never originally designed it for. Of course this also helps shape its future. Read Full Story
Monday, May 4, 2009
Bringing Entity Extraction to SOLR
We've done a proof-of-concept that brings NER capabilities to SOLR. Faceted navigation enhances the search experience and provides a more exploratory driven process, allowing the user to glean information about data they wouldn't typically get from a standard result set.
At the moment we're using machine learning techniques to extract names from the documents at index time. We've written some code that leverages a custom version of MALLET. Our approach to entity extraction is to view it as a classification problem which can then be solved using sequence labeling. CRF seems to be at the forefront of that research, augmenting maximum entropy Markov models (MEMM) and solving what is known as the label bias problem.
Our reasoning for using a more sophisticated approach was the potential ability to extract more difficult entities where grammar based approaches typically fail. Grammar based approaches also tend to be language dependent and cumbersome to build and maintain. Statistical methods require large volumes of annotated data but the process itself is simple and can be performed by novice developers who have no formal background in computational linguistics.
The extractor we built for the POC was modestly trained but the focus was more on integration points. The extracted names are indexed in a multi-valued field which we use to create a facet at query time.
Here's a screenshot showing the POC work we did. We then crawled a site about a sax player named Jon Smith. You can see the extracted names along with the facet counts displaying the number of occurrences each entity represents within the result set.
Read Full Story
At the moment we're using machine learning techniques to extract names from the documents at index time. We've written some code that leverages a custom version of MALLET. Our approach to entity extraction is to view it as a classification problem which can then be solved using sequence labeling. CRF seems to be at the forefront of that research, augmenting maximum entropy Markov models (MEMM) and solving what is known as the label bias problem.
Our reasoning for using a more sophisticated approach was the potential ability to extract more difficult entities where grammar based approaches typically fail. Grammar based approaches also tend to be language dependent and cumbersome to build and maintain. Statistical methods require large volumes of annotated data but the process itself is simple and can be performed by novice developers who have no formal background in computational linguistics.
The extractor we built for the POC was modestly trained but the focus was more on integration points. The extracted names are indexed in a multi-valued field which we use to create a facet at query time.
Here's a screenshot showing the POC work we did. We then crawled a site about a sax player named Jon Smith. You can see the extracted names along with the facet counts displaying the number of occurrences each entity represents within the result set.
Sunday, May 3, 2009
Boolean Retrieval: What You're Not Being Told
Despite the wide-scale criticism by many researchers, boolean retrieval models continue to dominate the commercial search space. The long recognized limitations and inadequacies of boolean retrieval models seem to have had no discernible effect on the industry. Entities such as Google have conditioned users into believing their information needs are being met but past research proves otherwise.
In a study conducted by Blair & Maron in 1985 involving a 40,000 document case, lawyers estimated that a manual search would find 75% of relevant documents, when in fact the research showed only 20% or so had actually been found. Our information needs have grown exponentially over the past few decades while search technology has remained relatively stagnant. Statistics such as these are particularly alarming from the perspective of e-discovery where justice is at stake.
So what exactly is boolean retrieval anyway? For starters, it the basis by which most of you search for information. Boolean retrieval models use Boolean algebra to partition a set of documents into two distinct classes; documents which are relevant to the user's information needs and those which are not.
Allow me to start by providing some background knowledge and insight as to how boolean retrieval models operate.
Assume we have a set of four documents in a fictitious search engine D where D = { dA, dB, dC, dD }.
Now consider that each document in D contains tokens (words) from a finite set T where T = {t1, t2, t3, ..., tn }.
This can be represented internally as a binary incidence matrix M where M is shown just below.
The idea here is simple; if a term tx appears in a document dx then we record a 1 in our incidence matrix where dx and tx intersect, otherwise we record a 0. For example, t1 appears in documents dA, dB, and dD but does not appear in document dC.
Once we have constructed our incidence matrix, issuing boolean queries becomes trivial. We apply boolean algebra to the term vectors representing each query term. If the resulting column is a 1 then the document corresponding to that column is part of the result set R. Consider the following example queries.
Return all documents that contain the term (keyword) t3 would yield R = { dA, dC } because 1010 = 1010
Return all documents that contain the term t3 AND t7 would yield R = { dC } because 1010 AND 0111 = 0010
Return all documents that contain the term t2 OR t3 would yield R = { dA, dB, dC, dD } because 0101 OR 1010 = 1111
Return all documents that contain the term t7 AND t1 AND NOT t5 would yield R = { dB } because 0111 AND 1101 AND NOT 0001 = 0101 AND 1110 = 0100
From this simple example you should now have a very basic understanding of how boolean retrieval models operate. With that, it should be evident that any given document represented in the incidence matrix is either part of a particular result set or not; either relevant or not relevant to the user's information need.
In the real world however, relevance is not so black and white. Information typically has a degree of relevance and cannot simply be partitioned into two distinct sets.
The astute reader might argue that typical search engines rank documents by weighting metrics that are applied to each document in the result set. This is typically referred to as extended Boolean retrieval and it still doesn't address the primary concern.
What this does is shift responsibility from the designer to the user. It requires users to have sophisticated knowledge about boolean operators. At the same time, the search engine ranking heuristics have to be tuned in such a way as to align themselves with the expectations of the user. The problem is that users expectations will vary from one user to another. The search engine might be configured to boost (favor) documents that contain matching query terms in the title (as opposed to the document body for instance). Can such rigid constraints possibly meet the needs of every user?
Google has had some success with their PageRank algorithm which essentially provides a popularity score to each document to help determine overall relevance. Does popularity really have any relation to relevance? A document can be extremely relevant regardless of its popularity and one has to wonder just how effective this approach is.
To complicate matters, algorithms like PageRank don't really apply to information residing outside the World Wide Web (hypermedia in general). As organizations around the globe continue to store larger volumes of data, search is becoming more commonplace. Just how effective are current techniques used within enterprise search? Are the current ranking algorithms sufficient to meet the growing demands of information seekers?
To provide yet another argument, consider a document titled: The Guide to domestic car repair for all models excluding Ford. For arguments sake let assume we have a user whose information need involves domestic car repair for any model with the exception of Ford. A reasonable user would construct a Boolean query such as domestic cars AND NOT ford.
One might be surprised to find out that the document above is excluded from the result set. The problem here is that the term Ford appears in the title so the boolean NOT operator forces exclusion. Of course no amount a rank tuning can prevent this from happening. In a controlled test environment these issues become evident but in searching larger systems of potentially unexplored data, these sort of relevance problems go unnoticed. I would speculate that this is one indication of why Boolean retrieval models are still common in the enterprise search space.
So what options are available? Luckily these very same questions and concerns have sparked new life into research based on Probabilistic Information Retrieval. Probabilistic IR models predict the probability that a given document will be relevant to a given query then ranks the document according to its probability of relevance. Documents are no longer considered to be simply relevant or not, rather they contain a degree of relevance.
There's a lot of research to support this type of model and it's actually been around for decades. If you're more pragmatic by nature then I suggest giving Xapian a try. It's an open source, free search engine you can download and use. It incorporates a probabilistic relevance model (Okapi BM25) to score documents and it's used to drive several high volume, high profile, sites on the Internet including Debian and Gmame to name just a few.
Another exciting area of research is the use of language models for information retrieval. There's an ongoing cooperative effort between the University of Massachusetts and Carnegie Mellon University to build language modeling information retrieval tools. In particular they have developed Indri, an IR system that merges ideas from Bayesian inference networks and statistical language modeling approaches. Indri is also free to download and use in both academic and commercial environments (and of course, personal exploration). Read Full Story
In a study conducted by Blair & Maron in 1985 involving a 40,000 document case, lawyers estimated that a manual search would find 75% of relevant documents, when in fact the research showed only 20% or so had actually been found. Our information needs have grown exponentially over the past few decades while search technology has remained relatively stagnant. Statistics such as these are particularly alarming from the perspective of e-discovery where justice is at stake.
So what exactly is boolean retrieval anyway? For starters, it the basis by which most of you search for information. Boolean retrieval models use Boolean algebra to partition a set of documents into two distinct classes; documents which are relevant to the user's information needs and those which are not.
Allow me to start by providing some background knowledge and insight as to how boolean retrieval models operate.
Assume we have a set of four documents in a fictitious search engine D where D = { dA, dB, dC, dD }.
Now consider that each document in D contains tokens (words) from a finite set T where T = {t1, t2, t3, ..., tn }.
This can be represented internally as a binary incidence matrix M where M is shown just below.
| dA | dB | dC | dD | |
| t1 | ||||
| t2 | ||||
| t3 | ||||
| t4 | ||||
| t5 | ||||
| t6 | ||||
| t7 | ||||
| ... | ||||
| tn | ||||
The idea here is simple; if a term tx appears in a document dx then we record a 1 in our incidence matrix where dx and tx intersect, otherwise we record a 0. For example, t1 appears in documents dA, dB, and dD but does not appear in document dC.
Once we have constructed our incidence matrix, issuing boolean queries becomes trivial. We apply boolean algebra to the term vectors representing each query term. If the resulting column is a 1 then the document corresponding to that column is part of the result set R. Consider the following example queries.
Return all documents that contain the term (keyword) t3 would yield R = { dA, dC } because 1010 = 1010
Return all documents that contain the term t3 AND t7 would yield R = { dC } because 1010 AND 0111 = 0010
Return all documents that contain the term t2 OR t3 would yield R = { dA, dB, dC, dD } because 0101 OR 1010 = 1111
Return all documents that contain the term t7 AND t1 AND NOT t5 would yield R = { dB } because 0111 AND 1101 AND NOT 0001 = 0101 AND 1110 = 0100
From this simple example you should now have a very basic understanding of how boolean retrieval models operate. With that, it should be evident that any given document represented in the incidence matrix is either part of a particular result set or not; either relevant or not relevant to the user's information need.
In the real world however, relevance is not so black and white. Information typically has a degree of relevance and cannot simply be partitioned into two distinct sets.
The astute reader might argue that typical search engines rank documents by weighting metrics that are applied to each document in the result set. This is typically referred to as extended Boolean retrieval and it still doesn't address the primary concern.
What this does is shift responsibility from the designer to the user. It requires users to have sophisticated knowledge about boolean operators. At the same time, the search engine ranking heuristics have to be tuned in such a way as to align themselves with the expectations of the user. The problem is that users expectations will vary from one user to another. The search engine might be configured to boost (favor) documents that contain matching query terms in the title (as opposed to the document body for instance). Can such rigid constraints possibly meet the needs of every user?
Google has had some success with their PageRank algorithm which essentially provides a popularity score to each document to help determine overall relevance. Does popularity really have any relation to relevance? A document can be extremely relevant regardless of its popularity and one has to wonder just how effective this approach is.
To complicate matters, algorithms like PageRank don't really apply to information residing outside the World Wide Web (hypermedia in general). As organizations around the globe continue to store larger volumes of data, search is becoming more commonplace. Just how effective are current techniques used within enterprise search? Are the current ranking algorithms sufficient to meet the growing demands of information seekers?
To provide yet another argument, consider a document titled: The Guide to domestic car repair for all models excluding Ford. For arguments sake let assume we have a user whose information need involves domestic car repair for any model with the exception of Ford. A reasonable user would construct a Boolean query such as domestic cars AND NOT ford.
One might be surprised to find out that the document above is excluded from the result set. The problem here is that the term Ford appears in the title so the boolean NOT operator forces exclusion. Of course no amount a rank tuning can prevent this from happening. In a controlled test environment these issues become evident but in searching larger systems of potentially unexplored data, these sort of relevance problems go unnoticed. I would speculate that this is one indication of why Boolean retrieval models are still common in the enterprise search space.
So what options are available? Luckily these very same questions and concerns have sparked new life into research based on Probabilistic Information Retrieval. Probabilistic IR models predict the probability that a given document will be relevant to a given query then ranks the document according to its probability of relevance. Documents are no longer considered to be simply relevant or not, rather they contain a degree of relevance.
There's a lot of research to support this type of model and it's actually been around for decades. If you're more pragmatic by nature then I suggest giving Xapian a try. It's an open source, free search engine you can download and use. It incorporates a probabilistic relevance model (Okapi BM25) to score documents and it's used to drive several high volume, high profile, sites on the Internet including Debian and Gmame to name just a few.
Another exciting area of research is the use of language models for information retrieval. There's an ongoing cooperative effort between the University of Massachusetts and Carnegie Mellon University to build language modeling information retrieval tools. In particular they have developed Indri, an IR system that merges ideas from Bayesian inference networks and statistical language modeling approaches. Indri is also free to download and use in both academic and commercial environments (and of course, personal exploration). Read Full Story
Thursday, April 30, 2009
Open Source Enterprise Search
With the 2009 Infonortics Search Meeting concluded, it's even more apparent that enterprise search (search in general) has become a commodity. With most of the industry leaders showcasing their offerings, I had a hard time differentiating the choices. They all seemed to offer some variation of semantic analysis and faceted browsing has become a standard search feature as well.
More importantly, I didn't see many commercial value add-ons that I can't get from Lucene or more appropriately, Solr. With Lucid Imagination now offering commercial support, it's hard to imagine (pun intended) why people would continue paying the six digit price tag tied to these proprietary search products.
I'm not the only one noticing hesitation in the search market, Steve Arnold also blogged about harsh times for commercial search vendors due to the economic downturn. With the current state of the economy it's even harder to imagine paying an inflated premium for enterprise search when you already have folks like Apple, Comcast, Netflix, CNET, IBM, AOL, LinkedIn, and MySpace (just to name a few) using Solr powered Lucene. This should surely provide some level of comfort about the validity of the technology itself.
We've implemented search architectures and systems for just about every major player out there including The Dow Jones, The Associated Press, Disney, AOL, The Financial Times, McGraw Hill, Standard & Poors, and Wolters Kluwer to name but a few. There aren't too many features you won't get from Solr at a fraction of the price. I think one of the limiting factors was definitely support and you'd see companies like AOL, who had a big enough internal IT staff, supporting it themselves. This strategy wasn't feasible for many organizations who wanted accountability and support.
With Lucid paving the way, accountability is now a reality. I'm very excited to see open source search at a breaking point and we're in the process of forming a business relationship with Lucid to build out their professional services offerings (an area where we have extensive knowledge).
We've also been quietly putting together our own Solr offering referred to internally as Plasma. We've essentially taken Solr and polished it up in terms of user interfaces. We're providing a default search interface and some pre-configured settings that provide a shrink wrapped package that meets the needs of most general users. With this, we've done the work of integrating Nutch as the default crawler for a complete "out of the box" solution.
Semantic analysis is another area of interest for us. In fact, we spent the last year developing named entity taggers based on Conditional Random Fields [Lafferty, McCallum, Pereira]. We took a stochastic approach so we'd have the ability to deal with tougher entities like Products for which no pattern can be derived.
That work was recently integrated into Plasma and so we've brought entity extraction capabilities to Solr and we're offering that functionality as part of Plasma. Another really cool technology we've been working on is a component based content processing framework called Pypes. The framework uses coroutines to drive a sophisticated component model that eliminates the need for threading and all the nasty semantics (and bugs) that go along with it. Dr. Knuth would be proud!
Pypes is a product that evolved from necessity. Every enterprise search implementation is tasked with transforming content and moving it into the index. The idea of Pypes was taken from Unix pipes and the notion of breaking a complex task into smaller components that focus on one particular aspect. Pypes takes that ideology to the next level and allows users to create graph structures, making branching and merging of data possible (what you'd expect from an ETL application). Best of all, we're releasing the framework as an open source project and we'll be providing publishing components for Solr.
I think every commercial search vendor out there should be worried. Open Source Enterprise Search is about to take search into all the various organizations that couldn't warrant the six digit price tag previously associated with enterprise search products. I'd imagine it's going to displace quite a few commercial vendor deployments along the way and in fact, I know of a few already. Read Full Story
More importantly, I didn't see many commercial value add-ons that I can't get from Lucene or more appropriately, Solr. With Lucid Imagination now offering commercial support, it's hard to imagine (pun intended) why people would continue paying the six digit price tag tied to these proprietary search products.
I'm not the only one noticing hesitation in the search market, Steve Arnold also blogged about harsh times for commercial search vendors due to the economic downturn. With the current state of the economy it's even harder to imagine paying an inflated premium for enterprise search when you already have folks like Apple, Comcast, Netflix, CNET, IBM, AOL, LinkedIn, and MySpace (just to name a few) using Solr powered Lucene. This should surely provide some level of comfort about the validity of the technology itself.
We've implemented search architectures and systems for just about every major player out there including The Dow Jones, The Associated Press, Disney, AOL, The Financial Times, McGraw Hill, Standard & Poors, and Wolters Kluwer to name but a few. There aren't too many features you won't get from Solr at a fraction of the price. I think one of the limiting factors was definitely support and you'd see companies like AOL, who had a big enough internal IT staff, supporting it themselves. This strategy wasn't feasible for many organizations who wanted accountability and support.
With Lucid paving the way, accountability is now a reality. I'm very excited to see open source search at a breaking point and we're in the process of forming a business relationship with Lucid to build out their professional services offerings (an area where we have extensive knowledge).
We've also been quietly putting together our own Solr offering referred to internally as Plasma. We've essentially taken Solr and polished it up in terms of user interfaces. We're providing a default search interface and some pre-configured settings that provide a shrink wrapped package that meets the needs of most general users. With this, we've done the work of integrating Nutch as the default crawler for a complete "out of the box" solution.
Semantic analysis is another area of interest for us. In fact, we spent the last year developing named entity taggers based on Conditional Random Fields [Lafferty, McCallum, Pereira]. We took a stochastic approach so we'd have the ability to deal with tougher entities like Products for which no pattern can be derived.
That work was recently integrated into Plasma and so we've brought entity extraction capabilities to Solr and we're offering that functionality as part of Plasma. Another really cool technology we've been working on is a component based content processing framework called Pypes. The framework uses coroutines to drive a sophisticated component model that eliminates the need for threading and all the nasty semantics (and bugs) that go along with it. Dr. Knuth would be proud!
Pypes is a product that evolved from necessity. Every enterprise search implementation is tasked with transforming content and moving it into the index. The idea of Pypes was taken from Unix pipes and the notion of breaking a complex task into smaller components that focus on one particular aspect. Pypes takes that ideology to the next level and allows users to create graph structures, making branching and merging of data possible (what you'd expect from an ETL application). Best of all, we're releasing the framework as an open source project and we'll be providing publishing components for Solr.
I think every commercial search vendor out there should be worried. Open Source Enterprise Search is about to take search into all the various organizations that couldn't warrant the six digit price tag previously associated with enterprise search products. I'd imagine it's going to displace quite a few commercial vendor deployments along the way and in fact, I know of a few already. Read Full Story
Subscribe to:
Posts (Atom)
Design by Luka Cvrk and Released under a Creative Commons Licence



