Follow

Follow
Lyra v0.3.0 - What's new

Lyra v0.3.0 - What's new

Learn the new features of Lyra v0.3.0, the JavaScript-based full-text search engine.

Michele Riva's photo
Michele Riva
·Nov 14, 2022·

6 min read

Table of contents

  • Video version
  • TL;DR
  • BREAKING: New returning documents shape
  • TF-IDF Ranking
  • Semi-schemaless
  • No reserved properties anymore
  • Internals
  • General fixes
  • Aknowledgements

Lyra just hit a new minor release!

Version 0.3.0 introduces some great new features, as well as general performance enhancements and bug fixes.

It also introduces a new breaking change, so make sure to check it out.

Video version

TL;DR

  • BREAKING: New returning documents shape
    Lyra now returns all the documents inside of { id: string, score: number, document: T }, where document contains the original document.

  • Token relevance
    Lyra search now ranks results based on search token relevance.

  • Semi-schemaless
    Lyra now is semi-schemaless; choose which properties you want to index, and avoid describing large, complex schemas if you have non-searchable properties.

  • No reserved properties anymore
    With older versions of Lyra, property names such as id were reserved. Now you can insert documents containing these properties without any problem.

  • Internals
    Lyra now exposes its internal methods to allow everyone to write their own integrations quickly.

  • General fixes
    New performance enhancements and bug fixes.

BREAKING: New returning documents shape

Before Lyra v0.3.0, the search function used to return the following object:

{
  elapsed: 300n,
  count: 1,
  hits: [
    {
      "id": "35026070-456125",
      "foo": "bar",
      ...
    }
  ]
}

Starting from v0.3.0, Lyra modified the way documents get returned, enriching the information contained in the hits property:

{
  elapsed: 300n,
  count: 1,
  hits: [
    {
      "id": "35026070-456125",
      "score": 0.04449919616347632
      "document": {
        "foo": "bar"
        ...
      }
    }
  ]
}

You can now access the full, original document by accessing the new document property inside every hits object.

The score property will tell you how relevant the search result is.

TF-IDF Ranking

One new significant feature is that Lyra now sorts search results by relevance.

Starting from v0.3.0, Lyra implements the TF-IDF ranking algorithm to provide more relevant results while performing any kind of search.

Before v0.3.0, Lyra would sort all the search results by document ID (like in a FIFO queue). Now it sorts everything by relevance.

To explain this concept, let's pretend we have the following four documents:

  • {"id-01": "The quick brown fox jumps over the lazy dog"}
  • {"id-02": "I love my dog!"}
  • {"id-03": "This quick fox is jumping over a giraffe. What a fox!"}
  • {"id-04": "I love this brown fox. Fox is my favorite animal ever. Give me a fox"}

If we're going to search for "quick fox", for example, we can clearly see that some results might be more appropriate than others; the way we define which document is more relevant is by performing a set operation following the term frequencies - inverted document frequencies algorithm.

Every time you insert new data, Lyra will tokenize the new document, then, for each token, it will:

  • Count the number of times the token appears in a particular document
  • Calculate the TF value, that is to say, the term count divided by the number of words in a particular document
  • Count the number of documents in which the token appears
  • Calculate the DF, which is the document count divided by the total number of documents
  • Calculate the IDF, the inverse of DF, after which a logarithmic function is applied

Every time a new search operation is performed, Lyra will use the data above to calculate how much a given document is relevant to the search query by assigning it a score.

The results will always be sorted in descending order, from the more relevant, to the less relevant.

Learn more about TF-IDF here: learndatasci.com/glossary/tf-idf-term-frequ..

Semi-schemaless

When working on large datasets, it is common to have documents with a large number of properties, and maybe some of them are not even relevant for any search purpose.

Also, consider that currently Lyra, including v0.3.0, only performs search operations on strings.

With that being said, let's consider the following schema:

import { create } from '@lyrasearch/lyra'

const db = create({
  schema: {
    author: 'string',
    quote: 'string',
    favorite: 'boolean', // <-- unsearchable
    tags: 'string[]' // <-- unsupported type!
  }
})

Why does Lyra need to know that a given property is of a certain type if is not searchable?

The main reason for Lyra to know types is because we're experimenting with the possibility of performing filtering operations depending on booleans, numbers, etc.

Starting from v0.3.0, it will no longer be necessary to list any non-searchable property as part of the Lyra schema.

In fact, it will be possible to rewrite the schema definition above as follows:

import { create } from '@lyrasearch/lyra'

const db = create({
  schema: {
    author: 'string',
    quote: 'string',
  }
})

and still, be able to insert documents like:

{
  "author": "Rumi",
  "quote": "Patience is the key to joy",
  "isFavorite": true,
  "tags": ["inspirational", "deep"]
}

or even documents with different shapes:

[
  {
    "author": "Rumi",
    "quote": "Patience is the key to joy",
    "isFavorite": true,
    "tags": ["inspirational", "deep"]
  },
  {
    "author": "Rumi",
    "quote": "Grace comes to forgive and then forgive again",
    "score": 10,
    "link": null
  }
]

of course, it will only be possible to perform search operations on known properties, in that case, author and quote, which will always need to be of type string (as stated during the schema definition).

No reserved properties anymore

Before Lyra v0.3.0, some property names such as id were forbidden. For instance, inserting the following document would have caused an error:

{
  "id": "12939123", // <--- "id" was a reserved property name
  "foo": "bar",
  "favorite": true
}

Starting with v0.3.0, there will be no forbidden properties, so the document above will be totally fine.

Internals

Extending Lyra has just become easier than ever. Starting from v0.3.0, Lyra exposes some of its internals:

import {
  formatNanoseconds,
  getNanosecondsTime,
  intersectTokenScores,
  includes,
  boundedLevenshtein,
  tokenize
 } from '@lyrasearch/lyra/dist/esm/internals'

Every exposed method comes with its own type definition.

Let's break them down:

  • formatNanoseconds: takes a BigInt as input and returns a human-readable string.
    import { formatNanoseconds } from '@lyrasearch/lyra/dist/esm/internals'
    formatNanoseconds(30000n) // "30μs"
    
  • getNanosecondsTime: gets the current time with nanoseconds-precision. Returns a BigInt.
    import { getNanosecondsTime } from '@lyrasearch/lyra/dist/esm/internals'
    getNanosecondsTime() // 1363500821581208n
    
  • intersectTokenScores: returns the intersection of N arrays.
import { intersectTokenScores } from '@lyrasearch/lyra/dist/esm/internals'
intersectTokenScores([
  [
    ["foo", 1],
    ["bar", 1],
    ["baz", 2],
  ],
  [
    ["foo", 4],
    ["quick", 10],
    ["brown", 3],
    ["bar", 2],
  ],
  [
    ["fox", 12],
    ["foo", 4],
    ["jumps", 3],
    ["bar", 6],
  ],
])

// Result: [["foo", 9], ["bar", 9]]
  • includes: faster alternative to Array.prototype.includes.
    import { includes } from '@lyrasearch/lyra/dist/esm/internals'
    includes([10,20,30], 10) // true
    
  • boundedLevenshtein: Computes the Levenshtein distance between two strings (a, b), returning early with -1 if the distance is greater than the given tolerance. It assumes that tolerance >= ||a| - |b|| >= 0.
    import { boundedLevenshtein } from '@lyrasearch/lyra/dist/esm/internals'
    boundedLevenshtein("moon", "lions", 3) // { isBounded: true, distance: 3 }
    
  • tokenize: tokenizes an input string:
    import { tokenize } from '@lyrasearch/lyra/dist/esm/internals'
    tokenize("hello, world!") // ["hello", "world"]
    

General fixes

With #167 and #166, we introduced a good number of performance optimizations and cleared the code.

Aknowledgements

This release has been made possible by:

A special "thank you" to NearForm for sponsoring Lyra.

Did you find this article valuable?

Support Lyra by becoming a sponsor. Any amount is appreciated!

See recent sponsors Learn more about Hashnode Sponsors
 
Share this