Incremental searching using LINQ to XML

I have a requirement in the current project to provide a method of converting a title entered in an edit box, into a corresponding code number. The current customers implementation is page based (web style) where you enter words, and click search. after a couple of seconds the results are shown on a page based grid when there are too many results to display. The search also finds ANY of the terms, and doesn’t match on partial words. The time constraints for the telephone operators using this system are quite harsh, and any delay in finding the correct codes is quite a problem.

I think I can provide something much better than the default searching capabilities, which will improve our customers experience, and provide a better service for their clients.

I’ve decided that a real-time incremental search function will give the best results, so given my current project is using .NET 3.5 I decided this was as good a time as any to dive in and figure out how to use LINQ properly. I’ve played with the 101 samples but not really had a chance to apply anything I’ve learned in my own projects.

So, the requirement is to provide incremental matching of multiple partial terms across an XML file with 28,000 nodes, where we’re interested in the title and code elements. I also want to match on all terms entered in the search to narrow the quantity of matches, rather than increasing them.

So, for the first spike, I used an XPath query and a loop to extract the nodes with matching titles. I needed to get a feel for how fast the search would be given we have 28,000 nodes to parse!

 

Here’s the initial first pass at getting it working

   1:     listBox1.Items.Clear();
   2:     XPathDocument xpd = new XPathDocument(@"f:text.xml");
   3:     string[] terms = textBox1.Text.Trim().ToLower().Split(new char[] { ' ' });
   4:     if (terms.Length < 1 || terms[0].Length < 1)
   5:         return;
   6:     StringBuilder sb = new StringBuilder();
   7:     for (int c = terms.Length; c != 0; c--)
   8:     {
   9:         sb.Append("contains(translate(TITLE,'ABCDEFGHIJKLMNOPQRSTUVWXYZ',"
  10:                     "'abcdefghijklmnopqrstuvwxyz'),'");
  11:         sb.Append(terms[c - 1]);
  12:         sb.Append("')");
  13:         if (c - 1 > 0)
  14:             sb.Append(" and ");
  15:     }
  16:     XPathNavigator xpn = xpd.CreateNavigator();
  17:     XPathNodeIterator Iterator = xpn.Select("results/row[" + sb.ToString() + "]");
  18:     int resultCount = 0;
  19:     while (Iterator.MoveNext() && resultCount++ < 100)
  20:     {
  21:         listBox1.Items.Add(Iterator.Current.SelectSingleNode("TITLE").Value);
  22:     }

Since we’re not going to edit the XML,  we’re using an XPathDocument for speed. Also it appears that the .Net version of XPath doesn’t support the lower-case (XPath 2.0)function, so I have to use the trick of translate with an upper to lower case parameter strings, so we can match only on lower case strings.

Note also, that we need to add a loop which appends the AND conjunction based on how many search terms we have passed in. Using AND means that the search narrows the more terms you enter, rather than the current customers implementation which just searches for any term in isolation, thereby increasing the matches with the more terms you enter!

Using the the performance counters, I time searches for strings at both extremes of the file.

Searching for titles near the front of the list gives an average of ~0.06s

Searching for titles near the bottom of the list averages ~0.1s

Not too shabby for a 17mb file! Now I have a rough working version, I can create a LINQ to XML version, and compare. There’s no point in adding LINQ if it lowers the performance!

So, after a couple of passes whilst on the LINQ learning curve, I end up with the final version below.

   1:             string[] terms = textBox1.Text.Trim().ToLower().Split(new char[] { ' ' });
   2:             if (terms.Length < 1 || terms[0].Length < 1)
   3:                 return;
   4:             listBox1.ItemsSource = (
   5:                 from row in titles.Elements()
   6:                 where (terms.All(s => row.Element("TITLE").Value.ToLower().Replace("'", "").Contains(s)))
   7:                 select string.Format("{0} : {1} ", row.Element("CODE").Value, row.Element("TITLE").Value)).Take(100);

WOW!. The first thing you notice, is that we’ve shrunk the code from 22 lines down to 7, and that 7 lines is actually…. ~1 line of LINQ code! Amazing!

This line of linq goes through each element in the input terms (the ALL method) and selects only the nodes where all terms are present. It then returns an IEnumerable<> that is put straight into the ItemsSource of the listbox, so the controls databinding can take over and display the contents of the result. But is the Linq code any faster?

Well, the timings for the same search terms above are

Items near to of list ~0.03s, and

items near the bottom ~0.04s

So, based on my rudimentary timings with the two extreme cases, it’s feasible to use either method for incremental searches. For the final version I set up a key press timer, and only perform the search 500ms after the last key press. This prevents the edit box becoming unresponsive when there’s a delay between each key press.

Although LINQ has quite a steep learning curve, I believe its well worth the effort and I’ll definitely be thinking about iteration over collections in a whole new way!

This entry was posted in General. Bookmark the permalink.

Leave a Reply