ISEARCH TUTORIAL
The Complete Guide to Using and Configuring the CNIDR Isearch System
with Examples, Advice, and Editorial Comment
In the beginning, there was "grep".
Grep was good, but it lacked subtlety. It lacked speed. And while
grep was cheap and hence widely used, it wasn't a text searching system.
It wasn't even a text searching program.
It was a regular expression matching program. It didn't sort or rank
or try to get along well with others.
So the computer scientists of the world created Information Retrieval.
It wasn't a product, it was a subject. They wrote many papers.
From the Information Retrieval people (who now called themselves "IR" so
their colleagues would think they were developing remote controls for TVs)
came text search systems. Systems such as "SMART". Systems such as "WAIS".
WAIS was written by a guy who worked for a supercomputer company. He wanted
to create an application for those big, hulking machines. He gave away a
watered-down version of his program so people would see how nice a full-
strength version of it would be on big, hulking machines. The supercomputer
company that he worked for was going bankrupt, so he took his WAIS and he
started his own company to sell WAIS.
The National Science Foundation saw hope in the free version of WAIS, so
they worked with MCNC (it doesn't stand for anything, they're just MCNC) to
create CNIDR (and it does stand for something: Truth, Justice, and the...
no, actually, it used to stand for the Clearinghouse for Networked Information
Discovery and Retrieval, but lately they've changed the "Clearinghouse" to
"Center" since that's easier to type with one hand). CNIDR begat freeWAIS,
which was the old, free WAIS (hence the name) with a series of enhancements.
But freeWAIS had problems. At its heart was the crippled stump of a search
system. Small desktop machines had caught up with the old, hulking machines,
but freeWAIS still only handled small collections. CNIDR said, "This sucks."
I know, I heard them. So they created a new text search system from the
ground up. And they called it Isite. Isearch is the part of Isite that
actually does the searching.
And it was good.
ISEARCH TUTORIAL TABLE OF CONTENTS
1. A Quick Example of Isearch Use.
2. Indexing Collections.
2.1 Arranging the Text.
2.2 Running the Indexer.
2.3 Deciding on Doctypes.
3. Searching.
3.1 Simple Searches.
3.2 Searching in Subfields.
3.3 Boolean Searches.
3.4 Notes on Ranking.
3.5 Wildcards
3.6 Prefixes and Suffixes
4. Maintaining Your Data.
4.1 Removing old Files.
4.2 Adding new Files.
4.3 Moving Files Around.
4.4 Viewing Information.
5. Performance Issues.
5.1 Location of Files for Speed.
5.2 How Much RAM is Enough?
5.3 CPU Load vs. User Skill
1. A Quick Example of Isearch Use.
This section assumes that Isearch has already been installed. We'll assume
that Isearch has been installed in the directory /local/project/Isearch-1.09.09.
Naturally, that name will change depending on the version number and the
preferences of your site, so remember to substitute your own directory name.
If you haven't installed Isearch yet, see the file "QuickStart" in the
Isearch documentation directory.
In this example, we'll index two text files from the Isearch distribution,
"COPYRIGHT" and "README". We only picked these files because we know everyone
has them. Create a directory for the textbase:
sti-gw% mkdir /local/text/example1
sti-gw% cp COPYRIGHT README /local/text/example1
Now create a directory to hold the indexes:
sti-gw% mkdir /local/indexes/example1
Now we can index the text:
sti-gw% Iindex -d /local/indexes/example1/EX1 /local/text/example1/*
Iindex 1.09.09
Building document list ...
Building database /local/indexes/example1/EX1:
Parsing files ...
Parsing /local/text/example1/COPYRIGHT ...
Parsing /local/text/example1/README ...
Indexing 1004 words ...
Merging index ...
Database files saved to disk.
sti-gw% ls -l /local/indexes/example1/
total 16
-rw-r--r-- 1 escott staff 85 Apr 11 12:24 EX1.dbi
-rw-r--r-- 1 escott staff 4016 Apr 11 12:24 EX1.inx
-rw-r--r-- 1 escott staff 24 Apr 11 12:24 EX1.mdg
-rw-r--r-- 1 escott staff 40 Apr 11 12:24 EX1.mdk
-rw-r--r-- 1 escott staff 520 Apr 11 12:24 EX1.mdt
That indexed the text in /local/text/example1 and created indexes in
/local/indexes/example1. Now we can use those indexes to
search for text:
Isearch -d /local/indexes/example1/EX1 fee fee
Isearch 1.09.09
Searching database /local/indexes/example1/EX1:
1 document(s) matched your query, 1 document(s) displayed.
Score File
1. 100 /local/text/example1/COPYRIGHT
Select file #:
The word "fee" occurs only in the file COPYRIGHT. Note that because of
a bug in version 1.09.09 (actually, it's believed to be a bug in gcc
that Isearch exposes) the search term has to be entered twice. This
is corrected in later versions.
2. Indexing Collections.
This section will discuss the issues surrounding indexing of text with
Isearch, including how to arrange the text, decide on a doctype to use,
and actually run the indexer.
2.1 Arranging the Text.
Arranging the text to be indexed is fairly straightforward. Good practice
dictates that the textbase be given its own directory hierarchy to make
maintenance simple. It's also a good idea to not mix up the text files
with the indexes. Maintenance is simpler when they are separate, and there
is also some performance gain to be realized when the indexes and text files
are on separate disks.
The Iindex command will need to be given a list of files to index and
directories to traverse. If you give Iindex the "-r" option then it will
recursively plunge headlong into subdirectories indexing everything it can
find. For simple collections this is probably a good thing, but for more
subtle applications you'll probably want to do the filespace traversal
yourself. This is especially true if you are using some kind of version
control system like SCCS or RCS.
Unix filesystems impose virtually no penalty for using subdirectories, so
feel free to impose quite a bit of organization on your tree of text.
Note that if you have a huge number of small files, you'll want to either
use a lot of subdirectories or else combine the small files into larger ones.
Consider the case of 100,000 files of one line each (an extreme example, but
we see it all the time). If you put 100,000 files into one Unix directory,
file access will be incredibly slow. Consider one of two solutions:
1) Create 100 subdirectories and put 1000 files into each subdirectory.
-or-
2) Concatenate all the files into one 100,000 line file. Index this file
using the "ONELINE" doctype (discussed in Section 2.3).
The second solution is strongly preferred.
Do not put text or index files into a network-mounted filesystem. These
files will be hit hard and hit often, and can bring a network to its knees.
Don't let vendor claims of caching performance fool you either: Isearch
and AFS (or DFS) do not get along well (the files are accessed in ways generally
guaranteed to *not* be cached). So far, experiments with using CacheFS to
back up accesses to NFS 3.0 mounted partitions hasn't been very promising
either. Cheap SCSI disks are below 30 cents per megabyte now. Consider
dedicating two spindles, preferably on separate SCSI controllers, to Isearch
alone and you'll see a lot better performance.
2.2 Running the Indexer.
Iindex takes several command line options. Here is a list of the flags along
with a brief discussion of each:
-d (X) This option specifies the name of the index. The index name doesn't
have to have anything to do with the textbase, but it's a good idea
to make it descriptive. For example, if you have a company phone
book you want to make searchable, it would be a good idea to use a
name like "PHONENUMBERS" or "DIRECTORY" instead of a name like
"INDEX". Index names may have mixed upper and lower case, as long
as you use the cases consistently. The indexes "Phone" and "phone"
refer to entirely separate collections. The index names may also
(and very often will) contain path names to point to a directory.
An option like "-d /local/index/CompanyPhoneBook" is typical.
This option is required for every Iindex command.
-a This option means "add to an existing textbase". Use this if
you want to add a few files later on. Note that you should
use this sparingly, since it can be a little slow. Also avoid
adding files one at a time if possible. It's a lot better to add a
whole bunch at once like:
Iindex -d INDEXNAME -a newFile1 newFile2 newFile3
instead of adding the files one command at a time.
-m (X) This option tells Iindex how many megabytes of text to read and
index in memory at once. It doesn't tell Iindex how many megabytes
total to use, so this figure should generally be about 4 megabytes
less than the total amount of memory for an otherwise quiescent
system. If you're indexing 5.5 megabytes of text, use "-m 6".
It's a good idea from a performance standpoint to have at least as
much RAM available as the size of the textbase to be indexed.
Searching is almost independent of the amount of RAM available, but
indexing needs lots and lots of memory to go quickly. Note that
this means you can lease/borrow/steal time on a Cray CS6400
with 4 gigs of RAM for indexing and then use those indexes on
a modest SparcStation 5 for searching if you have to. We've done
it. If you're short on RAM for indexing, then there will be a lot
of disk activity while smaller indexes are merged.
-s (X) This option informs Iindex that you have multiple logical documents
per physical file, and that they are separated by something. Consider
this example file:
Hi, I'm the first
document.
###
Greetings to all from
the second document.
If you index with:
Iindex -d mySeparatorExample -s "###" exampleFile
then you'll have two logical documents. A search for "greetings" will
match the second paragraph, but not the first.
Use of the -s option can give you, in essence, a quick way to
fake having new doctypes.
-t (X) This option tells Iindex what doctype to use when indexing the files.
A doctype is a module that explains how to find logical documents
and subfields within those documents. It also has code to display
documents that it finds. For example, the "ONELINE" doctype tells
Iindex that the file consists of a bunch of single line logical
documents. The "PARA" doctype says that each paragraph is a document.
The "SGMLTAG" doctype tells Isearch how to find the "
"
fields and so forth inside an SGML document.
If you don't specify a doctype, then the "SIMPLE" doctype is
used by default. SIMPLE doesn't really do much; it assumes that
there is one document per file, no fields, and that presentation is
handled by just dumping the contents of the file.
For more information, see the files "dtconf.inf", "BSn.doc", and
"STI.doc" in the doctype directory.
-f (X) This option causes Iindex to read a list of file names to be indexed
from a file. For example:
ls /local/text/example2 > myListOfFiles
Iindex -d exampleIndex -f myListOfFiles
causes Iindex to read the file "myListOfFiles" and then index every
file it sees in there. This is very useful for huge lists of files.
By default, older Unix systems can't pass command lines more than
10240 characters long. If you have several thousand files to index,
the command line could quickly become too long.
(Note to the confused: No, you would never type a line that long.
But consider how big a command line can get through filename
expansion with wildcards. Try this:
echo /usr/man/*/* | wc
the wordcount utility reports that "echo /usr/man/*/* expanded
into a line 122634 characters long.
(Subnote to the cluefull: So, if that command expanded to 122K, how
could it be passed to "echo" so I could wordcount it? Simple.
I used Solaris 2.5, which allows 1 megabyte command lines.
Nonetheless, there are a lot of Ultrix boxes still in use so it's
worth noting.)).
-r This section says to recursively descend into subdirectories. If you
have a whole tree of text then you can use this option and just give
the top level directory name to index. Iindex will do the rest.
Do not use this option if you're using SCCS or RCS. Iindex will
blithely index the SCCS "s." files and will plunge into RCS directories
and will just generally not do what you expected. How to escape this
dilemma? Use "find". Find is your friend.
Iindex -d watchThisItsCool `find /deeptree -t f \! -name "s.*""`
will cause Iindex to index all files below "/local/text/deeptree"
except for those that begin with "s." (SCCS files).
If you're using the "sccs" convenience shell, then you'll want to
ignore "SCCS" directories.
Find is such a powerful command that you should spend a while getting
to know it better if you haven't already. It especially useful
for maintaining collections of text by doing things like weeding out
old versions of files automatically.
-o (X) The -o option is used to pass information to specific doctypes.
Each doctype treats this option differently (if at all) so check
the documentation for your specific doctype before worrying about
this.
Once you've decided what options to use, let Iindex bang away creating indexes.
Don't be surprised if this takes a long time, especially if you have more
text than will fit into RAM.
2.3 Deciding on Doctypes.
Doctypes are modules that explains how to find logical documents and fields
within those documents. It also has code to display documents that it finds.
The default doctype is "SIMPLE". It tells Iindex that there is one logical
document per file, that there are no fields to index separately, and that
when displaying these documents they will just be dumped out without any
formatting. SIMPLE really is the simplest doctype you can have.
The other CNIDR-supplied doctype is "SGMLTAG". This doctype will make
one logical document per file, will index text occurring inside of
SGML-like markup pairs as fields, and will present the text by dumping it
out without a lot of special formatting. This doctype can be used to
index HTML web pages for making searchable WWW sites.
There are many other doctypes available. See the files "BSn.doc" and "STI.doc"
in the doctype directory for more information. New doctypes are being added
all the time, so look for other "*.doc" files in that directory.
3. Searching.
This section describes the "Isearch" command itself. Isearch is used to
search through the text collection after it has been indexed with Iindex.
3.1 Simple Searches.
The simplest case searches in Isearch are to look for either one word or any
of a list of words in any location within every file in the collection. Consider
searching for "poetry" or "rhyme" or "verse" in a collection of journal
articles:
Isearch -d MLAabstracts poetry rhyme verse
This will display a list of matching items. The items will have their
"headlines" displayed. To select one for viewing, enter the number of
the item and press "Return". To exit, press "Return" by itself.
Note that adding more search terms will generally cause more items to be
returned, rather than fewer. This is counterintuitive, but isn't the big
problem that it seems to be. Isearch has a very sophisticated ranking
technique that will try to decide what documents are most likely to be useful
and will return them at the top of the list. This ranking method actually
works better for longer queries than for short ones.
3.2 Searching in Subfields.
Consider the following HTML text:
Cool Page
This is my excellent Web Page.
If you index this text with the SGMLTAG doctype, then you can search for
words that occur anywhere in the document, or you can search for words that
only occur in the title (inside the and markers).
For example:
sti-gw% Iindex -d why -t sgmltag /local/text/testfile
Iindex 1.10
Building document list ...
Building database why:
Parsing files ...
Parsing /local/text/testfile ...
Indexing 7 words ...
Merging index ...
Database files saved to disk.
sti-gw% Isearch -d why web
Isearch 1.10
Searching database why:
1 document(s) matched your query, 1 document(s) displayed.
Score File
1. 100 /local/text/testfile
Cool Page
Select file #:
sti-gw% Isearch -d why title/web
Isearch 1.10
Searching database why:
0 document(s) matched your query, 0 document(s) displayed.
In the first example, we searched for "web", and lo and behold we found it. In
the second example we looked for "title/web", which means "find me occurrences
of 'web' inside the 'title' field". There aren't any of those, so Isearch says
there were no matching documents.
3.3 Boolean Searches.
This discussion of boolean searches assumes you have at least version 1.10.01.
If you don't, then the brief section of "rpn" queries will apply to you but not
the rest of this section. And if you aren't using at least 1.10.01 then I
strongly urge you to upgrade because you're living with too many bugs.
3.3.1 "Infix" Notation Boolean Searches
Infix notation is the old familiar notation you remember from Cobol, Fortran, C,
and TI calculators.
Reusing our example from section 3.2, we still have the Cool Page document.
To search for a document that contained both "cool" and "page", we could use
a query like:
sti-gw% Isearch -d why -infix cool and page
To search for a document with either cool or page (or even both) we can use:
sti-gw% Isearch -d why -infix cool or page
To search for documents that contain "cool" but not "page", we can say:
sti-gw% Isearch -d why -infix cool andnot page
Note that "andnot" is one word.
Note further than you can say "&&" instead of "and", "||" instead of "or", and
"&!" instead of "andnot". This is handy for recovering C programmers.
You can group things with parenthesis to make more complicated queries, but
remember that most Unix shells place special meaning in parenthesis. Protect
them with quotation marks like this:
sti-gw% Isearch -d why -infix "(page or doc)" and "(cool or cold)"
3.3.2 "RPN" Notation Boolean Searches
RPN notation is the old familiar notation you remember from Forth, HP
calculators, and your compiler writing class.
You probably shouldn't be using RPN unless you know what you're doing, so
here's that last query from the previous section in rpn:
sti-gw% Isearch -d why -rpn page doc or cool cold or and
In a nutshell, you "push" search terms onto a stack and then feed enough
operators to that stack to pop them all off. What could be more obvious?
3.4 Notes on Ranking.
In sections above, we have alluded to a sophisticated ranking algorithm. We use
an algorithm developed by Gerald Salton for the SMART retrieval system.
Without going into great detail, both the search terms and the resulting
documents are scored. Documents that contain more instances of higher scoring
words gets ranked higher.
What makes a high-scoring word? Basically, the fewer documents the word
occurs in, the more important it must be. This is often bogus, but more often
it's useful.
A high-scoring document is one that contains a lot of instances of search terms.
To combine the scores, the document and the search term scores are multiplied
for each matching search term. These products are totaled over the set of
all the search terms. Those who are good at math will recognize this as
a "dot product". Finally, the scores are normalized so the highest score is
1.0.
Generally speaking, using a search word that appears in nearly every document
won't affect the ranking of results. It will, however, slow down the query
tremendously so it still isn't a good idea.
While we're on the subject, it's possible manually alter the search term
weightings, albeit crudely. Consider this case:
sti-gw% Isearch -d why cool:5 page
This tells Isearch to look for "cool" or "page", but give "cool" five times
the normal weight. The following example does the opposite:
sti-gw% Isearch -d why cool:-5 page
This tells Isearch to find documents that contain "page", but if they contain
"cool" then subtract the value of that match five times. In other words, "I
really want to see documents with "page" in them, but if they also contain
"cool" then rank them really low". This is more useful than boolean queries
with "andnot". Consider a set of documents on, say, databases. You might want
a query like:
sti-gw% Isearch -d databases OODBMS andnot RDBMS
Problem is, if there is a good article on OODBMSes that just happens to
say "Unlike braindamaged RDBMSes, our system can solve world hunger" then the
above query will skip that document. Using a negative term weight will
push that document down on the list, but at least it will still be there.
3.5 Wildcards
Isearch supports a limited wildcard facility. You can append an asterix to the
end of a search term to match any word that begins with a given prefix. For
example:
sti-gw% Isearch -d diseases chol*
will find any document with a word that begins with "chol", including
"cholesterol", and yes, even, "cholesteral". This is useful for finding
misspelled words in addition to its obvious uses.
3.6 Prefixes and Suffixes
Isearch will allow you to slightly change the output when it prints a document.
The following query shows this:
sti-gw% Isearch -d why -prefix "" -suffix "" cool
Isearch 1.10
Searching database why:
1 document(s) matched your query, 1 document(s) displayed.
Score File
1. 100 /local/text/testfile
Cool Page
Select file #: 1
Cool Page
This is my excellent Web Page.
The -prefix option lets you specify a string to be printed immediately
before a word that matches. The -suffix option is for a string to be printed
after the word. Note the SGML markup on either side of "Cool" in the above
example.
NOTE: If you're going to specify SGML or HTML on a command line, protect
it with quotation marks. Otherwise, the arrows will be interpreted as file
redirection operators with completely unexpected results.
4. Maintaining Your Data.
After you index your collection and start searching in it, you'll eventually
have to update the data. This section describes how to add, remove, and
relocate your files. The command used for this is "Iutil".
4.1 Removing old Files.
To remove an old file from being considered for a search requires two steps.
The first step is to find the document "key" for the one to be removed:
sti-gw% Iutil -d huh -v
Iutil 1.10
DocType: [Key] (Start - End) File
(* indicates deleted record)
USMARC: [10] (0 - 838) /local/text/marc/demomarc
USMARC: [1839] (839 - 1577) /local/text/marc/demomarc
In this case, we used Iutil's -v option to display the table of documents.
The key for each document appears inside the brackets. We'll delete the second
document, key number 1839:
sti-gw% Iutil -d huh -del 1839
Iutil 1.10
Marking documents as deleted ...
1 document(s) marked as deleted.
To see the deletion:
sti-gw% Iutil -d huh -v
Iutil 1.10
DocType: [Key] (Start - End) File
(* indicates deleted record)
USMARC: [10] (0 - 838) /local/text/marc/demomarc
USMARC: [1839] (839 - 1577) /local/text/marc/demomarc *
The asterix at the end of the last line means that entry has been marked as
deleted.
After deleting one or more documents, it's a good idea to force the deletion
with Iutil -c:
sti-gw% Iutil -d huh -c
Iutil 1.10
Cleaning up database (removing deleted documents) ...
1 document(s) were removed.
sti-gw% Iutil -d huh -v
Iutil 1.10
DocType: [Key] (Start - End) File
(* indicates deleted record)
USMARC: [10] (0 - 838) /local/text/marc/demomarc
You should NEVER remove an actual data file from the collection until you
run Iutil -c to commit the changes. Otherwise, Isearch will display an error
message when you try to search.
4.2 Adding new Files.
There are two ways to add a new file to an existing collection: either reindex
the whole collection from scratch or use incremental indexing. If you have
enough RAM to reindex the whole collection, you should probably do that. It's
actually faster to start from scratch than it is to add a few files piecemeal.
If you don't have that much RAM, then you have no choice.
Here's an example, adding the file /local/text/example1/COPYRIGHT:
sti-gw% Iindex -d huh -a /local/text/example1/COPYRIGHT
Iindex 1.10
Building document list ...
Adding to database huh:
Parsing files ...
Parsing /local/text/example1/COPYRIGHT ...
Indexing 5100 words ...
Merging index ...
Database files saved to disk.
As a general rule, you want to specify as many new files to add at once as
possible. Don't do this one at a time for even as little as two files
because you'll be here for many minutes as it is.
And a note: You (generally) can't modify an existing data file and expect to get
correct results. You'll have to delete the old one and index the new one.
4.3 Moving Files Around.
Sometimes you'll want to move files around in your directory structure, and the
last thing you want to have to do is reindex all of them. Here's how to
relocate the files from the above example:
sti-gw% Iutil -d huh -newpaths
Iutil 1.10
Scanning database for file paths ...
Enter new path or to leave unchanged:
Path=[/local/text/example1/]
> /export/home/escott
Done.
4.4 Viewing Information.
We've already seen "Iutil -v" to look at index files to see what they contain.
Here are some other options to Iutil to see different things:
sti-gw% Iutil -d huh -v
Iutil 1.10
DocType: [Key] (Start - End) File
(* indicates deleted record)
USMARC: [10] (0 - 838) /local/text/marc/demomarc
USMARC: [1839] (839 - 1577) /local/text/marc/demomarc
Pretty basic, right? The "-vi" option shows some summary information:
sti-gw% Iutil -d huh -vi
Iutil 1.10
Database name: /local/project/Isearch-1.10.02-nrn/bin/huh
Global document type: USMARC
Total number of documents: 2
Documents marked as deleted: 0
And the "-vf" option shows information about fields:
sti-gw% Iutil -d huh -vf
Iutil 1.10
The following fields are defined in this database:
001
008
010
LCCN
040
020
ISBN
043
[and so forth. The numbers above are also field names for the USMARC record
format.]
5. Performance Issues.
There are several performance issues with Isearch. They can be roughly divided
into "indexing concerns" and "search concerns". Indexing concerns only
pop when files are indexed, but searching issues will show up every day.
5.1 Location of Files for Speed.
Location of files in your system is important for good speed. Good speed is,
in turn, essential for supporting a lot of simultaneous users.
In general, it's a good idea to locate the index and the data files on separate
disk drives. Do not make the mistake of putting them in separate partitions
on the same drive: this is worse than doing nothing. Generally, the disk usage
pattern will alternate between hitting the index files and the data files in
short, alternating bursts. Near the conclusion of a search the index files
will be accessed sequentially, but until then the pattern appears essentially
random to the operating system. Putting the two sets of files on separate
spindles will help minimize the amount of head motion. This is most noticable
for searching, but affects indexing to some degree. The biggest gains will
be for the largest collections. If you only have 100megs of data or so, then
you'll be hard pressed to measure the gain.
In a related note: consider RAID. In particular, consider using RAID level 0
with a small granularity so you can get the most I/O operations per second.
Most RAID management software will let you crank the configuration toward
either "faster streaming speed" or "more operations per second" and you
definitely want the latter. The only time Isearch will stream a device is
during the result set generation phase of a search when the index file is
essentially streamed (and, incidentally, the textbase storage device is being
hit in a purely random fashion). These bursts don't last long, so most of
the time the small granule size is a winner. Since most Unixes will transfer
one page of data from a device at a minimum, use the "pagesize" command
and set your chunk size to that.
5.2 How Much RAM is Enough?
Answer: how much can you fit into the box?
Better answer: Even that may not be enough.
Indexing is pretty slow with Isearch. To speed it up, Iindex will try to use
lots of RAM. The "-m" option is used to specify how much RAM to allocate
for buffering data. Note that Iindex will use considerably more than what you
specify. On my little Sparc5 with 48 megs of RAM and an additional 16 megs of
swap space, I can usually only use "-m 6" even though I have considerably more
virtual memory available:
elvis% swap -s
total: 41304k bytes allocated + 8720k reserved = 50024k used, 15620k available
elvis% Iindex -d huh -m 7 -a /local/text/example1/COPYRIGHT
Iindex 1.10
Building document list ...
Adding to database huh:
Parsing files ...
Virtual memory exceeded in `new'
Granted, I'm running CDE/Motif, netscape, and a few goodies, but nothing really
serious. To index on this machine I usually shut down to single-user mode
and run Iindex from the console. It's ugly, but it's also a lot faster.
If you need to index a lot of data, you need a lot of RAM to make it go quickly.
Searching is another issue entirely. More RAM helps, essentially without limit,
but the reward isn't as great. The basic search algorithm runs in about 50
lines of code. The problem starts when that basic algorithm starts finding
matches. When it does, those matches are stored in a result set. These
result sets can get pretty big in a hurry. Imagine searching for "knee" in
a textbase on knee injuries: the "knee" result set will get huge. Booleans
won't help you, either. In fact, they make it worse. Imagine searching for
"knee" and "dislocation". You'll still have to drag around the monster
"knee" result set, and then you'll have to make a second one for "dislocation",
and then finally you'll have to create a third one that is the union of the two.
Isearch will delete the first two when the third one is created, but only after
the third one has been finished. It can add up in a hurry.
While the relational database people have spent years optimizing for
this situation, Isearch isn't that clever yet. Moral to the story:
make sure your users know to use "good quality" search terms. They'll
get better results and your office won't look like SIMM City.