Supported Indexes & Matrices In MilVus.


Supported Indexes & Metrices In MilVus.

Table Of Contents:

  1. Supported Indexes.
  2. Supported Metrices.

(1) Supported Indexes

  • Indexes are an organization unit of data. You must declare the index type and similarity metric before you can search or query inserted entities. 
  • If you do not specify an index type, Milvus will operate brute-force search by default.

Index Types:

  • Indexes are an organization unit of data. You must declare the index type and similarity metric before you can search or query inserted entities. 
  • If you do not specify an index type, Milvus will operate brute-force search by default.
  1. HNSW: HNSW is a graph-based index and is best suited for scenarios that have a high demand for search efficiency. There is also a GPU version GPU_CAGRA, thanks to Nvidia’s contribution.
  2. FLAT: FLAT is best suited for scenarios that seek perfectly accurate and exact search results on a small, million-scale dataset. There is also a GPU version GPU_BRUTE_FORCE.
  3. IVF_FLAT: IVF_FLAT is a quantization-based index and is best suited for scenarios that seek an ideal balance between accuracy and query speed. There is also a GPU version GPU_IVF_FLAT.
  4. IVF_SQ8: IVF_SQ8 is a quantization-based index and is best suited for scenarios that seek a significant reduction in disk, CPU, and GPU memory consumption as these resources are very limited.
  5. IVF_PQ: IVF_PQ is a quantization-based index and is best suited for scenarios that seek high query speed even at the cost of accuracy. There is also a GPU version GPU_IVF_PQ.
  6. SCANN: SCANN is similar to IVF_PQ in terms of vector clustering and product quantization. What makes them different lies in the implementation details of product quantization and the use of SIMD (Single-Instruction / Multi-data) for efficient calculation.
  7. DiskANN: Based on Vamana graphs, DiskANN powers efficient searches within large datasets.

(2) Similarity Metrics

  • In Milvus, similarity metrics are used to measure similarities among vectors.

  • Choosing a good distance metric helps improve classification and clustering performance significantly. Depending on the input data forms, a specific similarity metric is selected for optimal performance.

  • The metrics that are widely used for floating point embeddings include:

  • Cosine: This metric is normalized IP, generally used in text similarity search (NLP).
  • Euclidean distance (L2): This metric is generally used in the field of computer vision (CV).
  • Inner product (IP): This metric is generally used in the field of natural language processing (NLP). The metrics that are widely used for binary embeddings include:
  • Hamming: This metric is generally used in the field of natural language processing (NLP).
  • Jaccard: This metric is generally used in the field of molecular similarity search.

Leave a Reply

Your email address will not be published. Required fields are marked *