CSharp: A simple Lucene.Net indexing and search class

From CodeMinima <Code Concised>
Jump to: navigation, search
I LOVE CODEMINIMA!

C#: A simple Lucene.Net indexing and search class

Contents

Description

This code samples demonstrates a simple class that indexes any kind of text string and also presents a method for searching within the indexed database for a particular piece of text.

Prerequisites

  1. C# compiler with any standard ASCII editor like Notepad, or
  2. Visual Studio Express with C# compiler, or
  3. Visual Studio.Net or better with C# compiler.

Prior preparations

  1. The Lucene.Net assembly DLL has been added to the project. You can download the Lucene.Net assembly from the official Lucene home page at apache.org. To know how to add an assembly's reference to your project, look up CSharp: How to add an assembly reference to your Visual Studio project.

Code

  1. using System;
  2. using System.Collections.Generic;
  3. using System.Linq;
  4. using System.Text;
  5. using Lucene.Net.Store;
  6. using Lucene.Net.Analysis;
  7. using Lucene.Net.Analysis.Standard;
  8. using Lucene.Net.Index;
  9. using Lucene.Net.Documents;
  10. using Lucene.Net.Search;
  11. using Lucene.Net.QueryParsers;
  12.  
  13. class MyLuceneIndexer {
  14.     private const string DOC_ID_FIELD_NAME = "ID_FIELD";
  15.  
  16.     private string _fieldName;
  17.     private string _indexDir;
  18.  
  19.     public MyLuceneIndexer (string indexDir, string fieldName) {
  20.         _indexDir = indexDir;
  21.         _fieldName = fieldName;
  22.     }
  23.  
  24.     /// <summary>
  25.     /// This method indexes the content that is sent across to it. Each piece of content (or "document")
  26.     /// that is indexed has to have a unique identifier (so that the caller can take action based on the
  27.     /// document id). Therefore, this method accepts key-value pairs in the form of a dictionary. The key
  28.     /// is a ulong which uniquely identifies the string to be indexed. The string itself is the value
  29.     /// within the dictionary for that key. Be aware that stop words (like the, this, at, etc.) are _not_
  30.     /// indexed.
  31.     /// </summary>
  32.     /// <param name="txtIdPairToBeIndexed">A dictionary of key-value pairs that are sent by the caller
  33.     /// to uniquely identify each string that is to be indexed.</param>
  34.     /// <returns>The number of documents indexed.</returns>
  35.     public int Index (Dictionary<ulong, string> txtIdPairToBeIndexed) {
  36.         IndexWriter indexWriter = new IndexWriter (_indexDir, new StandardAnalyzer (), true);
  37.         indexWriter.SetUseCompoundFile (false);
  38.  
  39.         Dictionary<ulong, string>.KeyCollection keys = txtIdPairToBeIndexed.Keys;
  40.  
  41.         foreach (ulong id in keys) {
  42.             string text = txtIdPairToBeIndexed[id];
  43.             Document document = new Document ();
  44.             Field bodyField = new Field (_fieldName, text, Field.Store.YES, Field.Index.TOKENIZED);
  45.             document.Add (bodyField);
  46.             Field idField = new Field (DOC_ID_FIELD_NAME, (id).ToString (), Field.Store.YES, Field.Index.TOKENIZED);
  47.             document.Add (idField);
  48.             indexWriter.AddDocument (document);
  49.         }
  50.  
  51.         int numIndexed = indexWriter.DocCount ();
  52.         indexWriter.Optimize ();
  53.         indexWriter.Close ();
  54.  
  55.         return numIndexed;
  56.     }
  57.  
  58.     /// <summary>
  59.     /// This method searches for the search term passed by the caller.
  60.     /// </summary>
  61.     /// <param name="searchTerm">The search term as a string that the caller wants to search for within the
  62.     /// index as referenced by this object.</param>
  63.     /// <param name="ids">An out parameter that is populated by this method for the caller with docments ids.</param>
  64.     /// <param name="results">An out parameter that is populated by this method for the caller with docments text.</param>
  65.     /// <param name="scores">An out parameter that is populated by this method for the caller with docments scores.</param>
  66.     public void Search (string searchTerm, out ulong[] ids, out string[] results, out float[] scores) {
  67.         IndexSearcher indexSearcher = new IndexSearcher (_indexDir);
  68.         try {
  69.             QueryParser queryParser = new QueryParser (_fieldName, new StandardAnalyzer ());
  70.             Query query = queryParser.Parse (searchTerm);
  71.             Hits hits = indexSearcher.Search (query);
  72.             int numHits = hits.Length ();
  73.  
  74.             ids = new ulong[numHits];
  75.             results = new string[numHits];
  76.             scores = new float[numHits];
  77.  
  78.             for (int i = 0; i < numHits; ++i) {
  79.                 float score = hits.Score (i);
  80.                 string text = hits.Doc (i).Get (_fieldName);
  81.                 string idAsText = hits.Doc (i).Get (MyLuceneIndexer.DOC_ID_FIELD_NAME);
  82.                 ids[i] = UInt64.Parse (idAsText);
  83.                 results[i] = text;
  84.                 scores[i] = score;
  85.             }
  86.         } finally {
  87.             indexSearcher.Close ();
  88.         }
  89.     }
  90. }
  91.  
  92. class Program {
  93.     static void Main (string[] args) {
  94.         string indexDir = @"C:\Lucene\";
  95.         string fieldName = "TEXT_MATTER";
  96.         MyLuceneIndexer indexer = new MyLuceneIndexer (indexDir, fieldName);
  97.  
  98.         string txt1 = "Patience and faith is what the sea teaches.";
  99.         string txt2 = "Nothing happens until something moves.";
  100.         string txt3 = "Behold the turtle. He makes progress only when he sticks his neck out.";
  101.         string txt4 = "All that we need to make us happy is something to be enthusiastic about.";
  102.         string txt5 = "Nothing in this world can take the place of persistence.";
  103.  
  104.         Dictionary<ulong, string> contentIdPairs = new Dictionary<ulong, string> ();
  105.         contentIdPairs.Add (1, txt1);
  106.         contentIdPairs.Add (3, txt2);
  107.         contentIdPairs.Add (5, txt3);
  108.         contentIdPairs.Add (7, txt4);
  109.         contentIdPairs.Add (9, txt5);
  110.  
  111.         // Indexing:
  112.         int numIndexed = indexer.Index (contentIdPairs);
  113.         Console.WriteLine ("Indexed {0} docs.", numIndexed);
  114.         Console.WriteLine ();
  115.  
  116.         // Searching:
  117.         ulong[] ids;
  118.         string[] results;
  119.         float[] scores;
  120.  
  121.         int numHits;
  122.  
  123.         string searchTerm1 = "patience";
  124.         Console.WriteLine ("Searching for the term \"{0}\"...", searchTerm1);
  125.         indexer.Search (searchTerm1, out ids, out results, out scores);
  126.         numHits = ids.Length;
  127.         Console.WriteLine ("Number of hits == {0}.", numHits);
  128.         for (int i = 0; i < numHits; ++i) {
  129.             Console.WriteLine ("{0}) Doc-id: {1}; Content: \"{2}\" with score {3}.", i + 1, ids[i], results[i], scores[i]);
  130.         }
  131.         Console.WriteLine ();
  132.  
  133.         string searchTerm2 = "something";
  134.         Console.WriteLine ("Searching for the term \"{0}\"...", searchTerm2);
  135.         indexer.Search (searchTerm2, out ids, out results, out scores);
  136.         numHits = ids.Length;
  137.         Console.WriteLine ("Number of hits == {0}.", numHits);
  138.         for (int i = 0; i < numHits; ++i) {
  139.             Console.WriteLine ("{0}) Doc-id: {1}; Content: \"{2}\" with score {3}.", i + 1, ids[i], results[i], scores[i]);
  140.         }
  141.         Console.WriteLine ();
  142.  
  143.         string searchTerm3 = "happy turtle";
  144.         Console.WriteLine ("Searching for the term \"{0}\"...", searchTerm3);
  145.         indexer.Search (searchTerm3, out ids, out results, out scores);
  146.         numHits = ids.Length;
  147.         Console.WriteLine ("Number of hits == {0}.", numHits);
  148.         for (int i = 0; i < numHits; ++i) {
  149.             Console.WriteLine ("{0}) Doc-id: {1}; Content: \"{2}\" with score {3}.", i + 1, ids[i], results[i], scores[i]);
  150.         }
  151.         Console.WriteLine ();
  152.  
  153.         string searchTerm4 = "\"happy turtle\"";
  154.         Console.WriteLine ("Searching for the term \"{0}\"...", searchTerm4);
  155.         indexer.Search (searchTerm4, out ids, out results, out scores);
  156.         numHits = ids.Length;
  157.         Console.WriteLine ("Number of hits == {0}.", numHits);
  158.         for (int i = 0; i < numHits; ++i) {
  159.             Console.WriteLine ("{0}) Doc-id: {1}; Content: \"{2}\" with score {3}.", i + 1, ids[i], results[i], scores[i]);
  160.         }
  161.         Console.WriteLine ();
  162.     }
  163. }

Output

Indexed 5 docs.

Searching for the term "patience"... Number of hits == 1. 1) Doc-id: 1; Content: "Patience and faith is what the sea teaches." with score 0.8383772.

Searching for the term "something"... Number of hits == 2. 1) Doc-id: 3; Content: "Nothing happens until something moves." with score 0.6609862. 2) Doc-id: 7; Content: "All that we need to make us happy is something to be enthusiastic about." with score 0.472133.

Searching for the term "happy turtle"... Number of hits == 2. 1) Doc-id: 7; Content: "All that we need to make us happy is something to be enthusiastic about." with score 0.2117222. 2) Doc-id: 5; Content: "Behold the turtle. He makes progress only when he sticks his neck out." with score 0.1693778.

Searching for the term ""happy turtle""... Number of hits == 0.

Searching for the term "that"... Number of hits == 0.

Explanation

At the most basic level, the class MyLuceneIndexer presents two methods to the outside world, Index() and Search(). As the names of these methods imply, their functionality is to simply index text and search for text passed by the calling code, respectively.

The Index() method presents a dictionary as a parameter to the caller. Why? Because, for each piece of text that it indexes, it would also expect the caller to provide it with a uniquely identifying id against which to save it. This is very similar to what happens in web-based search engines: initiating a search for a particular term results in not just the text getting highlighted on the search page, but also the uniquely identifying url or page where that text occurs. In our case we have merely used an unsigned long for the purpose of identifying the particular piece of text that is to be indexed. We pass these pairs of unsigned longs (keys) and strings (values) as a dictionary to the Index() method. We use the following five adages or proverbs for indexing:

  1. Patience and faith is what the sea teaches.
  2. Nothing happens until something moves.
  3. Behold the turtle. He makes progress only when he sticks his neck out.
  4. All that we need to make us happy is something to be enthusiastic about.
  5. Nothing in this world can take the place of persistence.

Within the Index() method itself, we extract all the keys from the dictionary, and for each key we retrieve the value (the string to be indexed) and index it. For each piece of text that has to be indexed within the Lucene index, we have a Document object. This Document object can take custom field names from the program. Thus, we save the body of the string itself as TEXT_MATTER, while we save the id of that string as an ID_FIELD field. (These fields come in handy during search.) These fields are saved within the Document object, and the Document object itself is added to the indexer which then proceeds to index the content based on the field information added to the Document object.

During the search process, the Search() method returns the results based on the search term passed as a string. The various search terms yield different numbers of hits, as can be seen in the output. The Search() method defines four parameters: 1) the search term itself; 2, 3, 4) out parameters for storing the ids, results and their scores, respectively. The output is to be "read" from these out arguments by the caller.

A couple of significant things that we see in the output is that while the first two search terms (patience and something) return the strings as expected, the last three search terms return somewhat unexpected results. While happy turtle returns two results, with each of the two terms occurring independently, the search term "happy turtle" (with the quotes) returns zero hits. Why? These are very similar to what is known in search engine parlance as exact match and broad match results. So the former search term happy turtle returned a broad match with either of the two terms being in the returned result, while the latter search term "happy turtle" (with quotes) returned an exact match; since the term "happy turtle" does not occur exactly as it is specified in the search term in any of the indexed phrases, we get zero hits.

The last search term, that, occurs in the fourth phrase that we indexed (All that we need to make us happy is something to be enthusiastic about.). Yet the Search() method did not return any results for this term. The reason for this is that certain words which could be classified as stop words do not get indexed: words like the, that, this, at, and so forth.

Additional notes

The third argument in the IndexWriter constructor is a boolean, which tells it to create the index if if doesn't already exist (if true).

See also

Lucene at Apache
CSharp: How to add an assembly reference to your Visual Studio project

Further reading

Wikipedia article on Lucene.net
CodeMinima blog article on Lucene.Net

Author link

Najeeb (talk)

Discussions

Please do leave your comment. Thank you!

and

Nobody voted on this yet
 You need to enable JavaScript to vote
Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox
Facebook Page
Twitter