dallaway.com - Writing - Spindle |
I switch a few site searches over from using a Swish-e CGI to using a Java-based packaged called Spindle. These are my notes. The summary is: it works very well.
Contents
Original: July 2002. This version: $Id: index.html,v 1.6 2004/06/20 21:34:51 richard Exp $
My needs for site search are pretty simple: a fast indexing tool to scan a web site; and a fast search tool that will find relevant stuff.
Swish-e is a C-based lump of code plus a bit of Perl to do the site crawling. It works fine, but I've found it gets a bit slow. Also, the CGI architecture for performing the search doesn't scale too well, and I wasn't in the mood for fixing it.
Lucene has been around for a while as an indexing tool but I've ignored it because in my day job I need multilingual search, and Lucene doesn't have stemmers for many languages -- not yet, check regularly because they are adding more. However, for a couple of personal web sites, I didn't need anything other than English, so I decided to try Lucene.
To make life easier, Bitmechanic have wrapped Lucene with a web crawler and called the package Spindle. I installed version 0.90 (dated 2002-03-30)....
The first step with Spindle is to create an index. I downloaded and installed the Spindle JAR files on a Linux server and used this script ("spindle") to fire off the indexing:
#!/bin/sh # spindle -- Runs the spindle HTTP spider to index a site SPINDLE=~/local/spindle-0.90 CLASSPATH=$SPINDLE/lib/spindle.jar:$SPINDLE/lib/lucene-1.2-rc4.jar:$SPINDLE/lib/jsse.jar java -cp $CLASSPATH com.bitmechanic.spindle.Spider "$@"
Note that contrary to the documentation you do need jsse.jar
for indexing (at least you do with JDK 1.3). You can download this jar from the Java Secure Socket Extension site.
The command to index this very site is:
spindle -u http://www.dallaway.com/ -d /home/richard/html/dallaway/spindle/ -e .cgi -e .jar -e .zip -e /comment -v -dt p -dt span -dt h1 -n
In summary this says: search www.dallaway.com, storing the index in the "spindle" directory, ignoring any .cgi, .jar or .zip files, oh, and be verbose about it. I'll explain the other command line arguments later, in the section on my hacks.
Having been use to Swish-e, when I ran my first site index I was expecting the indexing to be complete in an hour or so. What I saw was this: Indexed 1047 URLs (2546 KB) in 137 seconds
. That's fast... take my word for it. Probably the main parameter to use to adjust the performance is the number of threads for crawling (the default, which I used, is 2).
Having a built an index, you'll want to manually see what's in it from the command line. Here's the script I use ("search"):
#!/bin/sh SPINDLE=~/local/spindle-0.90 CLASSPATH=$SPINDLE/lib/spindle.jar:$SPINDLE/lib/lucene-1.2-rc4.jar:$SPINDLE/lib/jsse.jar:~/local/listlib-0.91/lib/listlib.jar java -cp $CLASSPATH com.bitmechanic.spindle.Search "$@"
The documentation doesn't mention that you also need Bitmechanic's listlib library on the classpath, but I found that I did need it.
The above script allows you to search the index using a Lucine search phrase. For example, to search the index for the word "auction":
$./search /home/richard/html/dallaway/spindle auction http://www.dallaway.com/acad/cbd/; Title: Outsourced component development; Score: 0.21613583 Description: Outsourced component development Late in 1999 I was asked to develop a small Java applet for a colleague who was working on a bigger project for a client. This wasn't something I was really interested in and nor was it my speciality, but I thought it would http://www.dallaway.com/acad/index.html; Title: Writing; Score: 0.213919 Description: Things I have written down My notes on getting going with Java Web Start for deploying applications over the internet, and learning how to digitally sign code on the cheap. My first attempt at commissioning some code development via an internet auction. My
Two results, showing the page, the title, the score and the description (summary) of the page.
To provide this search functionality on the web it's just a matter of changing the JSP that comes with Spindle. It'd probably be better to write a servlet or struts action that forwards to a display JSP, for proper MVC, but the single JSP works fine for me on the small web sites where I'm using Spindle.
To deploy the JSP, you need to have commons-beanutils.jar
, lucene-1.2-rc4.jar
, listlib.jar
and spindle.jar
in WEB-INF/lib, and also you need to deploy WEB-INF/listlib.tld
. These all come in the Spindle and/or listlib distributions from Bitmechanic. Later version of Lucene will probably work, but I've not tried them out yet.
The index is set in the top of the JSP:
<jsp:setProperty name="search" property="dir" value="/home/richard/html/dallaway/spindle"/>
And that's pretty much it. You can try the searching by using the "Find" box at the top of this page.
I've made five changes to the Spindle distribution:
-dt
(for "description tag"). This option allows you to specify which tags in the HTML should be considered for the description. In the example above I said: -dt p -dt span -dt h1
, meaning take the description only from the <p>
, <span>
or <h1>
tags.<a name='...'>
tag, so
I've modified Spindle so that it splits documents (in the index) based on the presence of <a name='...'>
tags.
You activate this with the -n
option to Spider.ListContainer cannot be applied to ()
. The source code change is in ListContainer.java.TagToken.java
file.
You can download the modified source of
Spider.java, ListContainer.java and TagToken.java, or you can download
a drop-in replacement for spindle.jar and listlib.jar (this is built for
spindle 0.90 using JDK 1.4). Note that this version of spindle.jar
includes listlib, so replace your copy
of spindle.jar
with this version and remove your copy of listlib.jar
.
Lucene works and is shocking fast. Spindle makes it (almost) out-of-the-box no-brainer for web site search. However, check with the Lucene project because they are working on developing something with will replace Spindle, as Spindle does not seem to be actively developed and contains a few bugs.