meilisearch#

Warning

This page is not ready.

It’s waiting for structuration and better wording.

TODO list#

There is a lot of files to read and understand. My current priority is to understand MatcherBuilder and how it is called.

  • meilisearch/src/search/mod.rs

  • milli/src/search/new/matches/mod.rs definition of MatcherBuilder.

  • MatchingWords ?

I would also need to document the notion of transaction. I think it is related to heed, and thus to LMDB.

Currently documenting (WIP)#

  • TODO documents:

    • execute_search

    • MatcherBuilder

execute a query

  • ctx &mut SearchContext

  • query: Option<&str>,

  • terms_matching_strategy: TermsMatchingStrategy

  • scoring_strategy: ScoringStrategy

  • exhaustive_number_hits: bool

  • mut universe: RoaringBitmap,

  • sort_criteria: &Option<Vec<AscDesc>>

  • distinct: &Option<String>

  • geo_strategy: geo_sort::Strategy

  • from: usize

  • length: usize

  • words_limit: Option<usize>

  • placeholder_search_logger: &mut dyn SearchLogger<PlaceholderQuery>

  • query_graph_logger: &mut dyn SearchLogger<QueryGraph>

  • time_budget: TimeBudget

  • ranking_score_threshold: Option<f64>

  • locales: Option<&Vec<Language>>

Hints:

  • To set universe to all documents, index.documents_ids(rtxn). If filtering is needed, use filtered_universe.-

  • logging:

    • DefaultSearchLogger

    • placeholder_search_logger ?

    • query_graph_logger ?

struct SearchContext#
  • index: &Index

  • txn:: &RoTxn

  • db_cache: DatabaseCache

  • word_interner: DedupInterner<String>

  • phrase_interner: DedpupInterner<Phrase>

  • term_interner: Interner<QueryTerm>

  • phrase_docids: PhraseDocIdsCache

  • restricted_fids: Option<RestrictedFids>

trait SearchLogger#

type generics: Q: RankingRuleQueryTrait

  • initial_query: Logs the initial query

  • initial_universe: logs the value of the initial set of all candidates

  • query_for_initial_universe: Logs the query that was used to compute the set of all candidates

  • ranking_rules: Logs the ranking rules used to perform the search query

  • start_iteration_ranking_rule: Logs the start of a ranking rule’s iteration

  • next_bucket_ranking_rule: Logs the end of the computation of a ranking rule bucket

  • skip_bucket_ranking_rule: Logs the skipping of a ranking rule bucket

  • end_iteration_ranking_rule: Logs the end of a ranking rule’s iteration

  • add_to_results: Logs the addition of document ids to the final results.

  • log_internal_state: Logs an internal state in the search algorithm

The empty struct DefaultSearchLogger implements this trait with empty empty implementations.

RankingRuleQueryTrait is a discrete trait implemented only by specific struct:

  • PlaceHolderQuery

  • QueryGraph

    • QueryNodeData enum

      • Term(LocatedQueryTermSubset) is a regular node representing a word or combination of words.

      • Deleted represents a node that was deleted.

      • Start unique, represents the start of the query.

      • End unique represents the end of a query.

    • QueryNode

      • data: QueryNodeData

      • predecessors: SmallBitmap<QueryNode>

      • successors: SmallBitmap<QueryNode>

    • QueryGraph

      • root_node: Interned<QueryNode>: The index of the start within self.nodes.

      • end_node:: Interned<QueryNode>: The index of the end node within self.nodes.

      • nodes: FixedSizeInterner<QueryNode>: The list of all query nodes.

  • QueryGraph::from_query(ctx: &SearchContext, terms: &[LocatedQueryTerm])

    • To create terms you have to do before:

      1. tokbuilder = TokenizerBuilder::new()

      2. configure tokbuilder:

        • add the stop words

        • add separators

        • set up the words dict

      3. add locales with allow_list

      4. create the tokenizer tokenizer

      5. invoke tokenizer.tokenize(query)

      6. use located_query_terms_from_tokens(ctx, tokens, words_limit)

What is a ranking rule ?

What is Criterion:

  • Words

  • Typo

  • Attribute

  • Proximity

  • Exactness

  • Sort

  • Asc

  • Desc

get_ranking_rules_for_query_graph_search: Return the list of initialized ranking rules to be used for a query graph search.

  • ctx: &SearchContext

  • sort_criteria: Option<Vec<AscDesc>>

  • geo_strategy: geo_strategy::Strategy

  • terms_matching_strategy: TermsMatchingStrategy

it returns a BoxRankingRule<QueryGraph>.

impl Index#
fn documents_ids#

Returns the internal document ids

From Query API to Milli#

the endpoint to search a specific index is: POST /indexes/<index_uuid>/search (meilisearch doc).

The different files that defines the routes itself:

  • meilisearch/src/routes/mod.rs: All the routes

  • meilisearch/src/routes/indexes/mod.rs: All the index routes

  • meilisearch/src/routes/indexes/search.rs the configure function does the route configuration.

  • two functions for both POST and GET.

    • search_with_url_query

    • search_with_post

The POST endpoint refers to SearchQuery type. See later for that.

What is an index scheduler ? it is in a separate crate.

What is a SearchAggregator ? it seems related to analytics.

The important function is perform_search(&index, query, search_kind, retrieve_vectors, index_scheduler.features()).

Both SearchQuery and perform_search are in the src/search/mod.rs module.

In the perform_search:

  • prepare_search this function returns an interesting type: milli::Search.

  • search_from_kind(search_kind, search) at this level it entierely dispatch to search.execute or search.execute_hybrid depeding on search_kind.

  • make_hits I suspect this function will takes the document_ids and hits and tries to highlight the important bits that match in the document fields.

So the important point above seems to be milli::Search. Both struct and impl is defined in milli/src/search/mod.rs.

Let’s focus on execute(...):

  • create a search context.

  • filter the universe.

  • important execute_search(...)

  • located_query_terms -> matching_words

for execute_search see above.

Transactions#

Index impl has the following functions related to transactions:

  • read_txn(&self) and static_read_txn(&self)

  • write_txn(&self)

Those functions are simple wrappers around env: heed::Env.

We thus can say everything related to transactions/session/concurrency is dealt at LMDB level.

A good starting point is the LMDB documentation.