`sidx`

Small in-memory inverted index with TF-IDF scoring - create an index (standard word-token or character-bigram analyser), add documents with optional weighted metadata fields, search with a free-text query returning scored {id, score, document} hits sorted by relevance, limited to a configurable number of results.

Load with: use sidx

What this module does

sidx is a self-contained full-text search engine. It builds an inverted index mapping terms to documents with per-document term frequency and position information. At search time it applies a simplified TF-IDF score (term frequency × inverse document frequency × field weight) and returns hits sorted by score descending.

The standard analyser tokenises on non-alphanumeric characters and lowercases. The "ngram" analyser additionally generates character bigrams from each token, enabling prefix and fuzzy matching.

Quick example

use sidx

ix = srceb()  # standard analyser

# Add documents (id, content, optional fields map)
ix = src8a(ix, "doc1", "ilusm is a small scripting language", nil)
ix = src8a(ix, "doc2", "ilusm stdlib has many modules", nil)
ix = src8a(ix, "doc3", "sorting algorithms and data structures",
    {title: "Algorithms"})

# Search - returns list of {id, score, document}
hits = srcva(ix, "ilusm modules", 10)
trl.ech(hits, \(h) prn(h.id + " " + str(h.score)))

# Character-bigram index (fuzzy)
fzix = srcab("ngram")
fzix = src8a(fzix, "a", "hello world", nil)
results = srcva(fzix, "hel", 5)

Functions

Index creation

srceb()

Creates a new index with the "standard" word-token analyser.

srcab(analyzer)

Creates a new index with the specified analyser. Pass "ngram" for character-bigram analysis, or nil for standard.

Indexing

src8a(ix, doc_id, content, fields)

Adds or replaces document doc_id. content is the main body text (weight 1.0). fields is an optional object of additional field values - field weights are looked up from ix.field_weights. Returns the updated index.

Search

srcva(ix, query, limit)

Tokenises query and computes TF-IDF scores for each document containing any query term. Returns up to limit results as a list of {id, score, document} objects, sorted by score descending.

Notes

All data is held in memory - not suitable for large corpora.
Requires trl, txt, rex, mth, and tim.