Finesse Benchmark Observatory

Welcome to the Finesse Benchmark Observatory

This observatory visualizes the performance of various models across different context windows and hardware environments. Here's how to interpret the key metrics:

RSS (Robust Separation Score)

The RSS score quantifies a model's ability to maintain the semantic integrity of merged data. A higher RSS score indicates that the model excels at combining chunks of data into a longer sequence without losing the original meanings, effectively separating 'signal' from 'noise'. For a deeper understanding, refer to the RSS whitepaper.

Latency (Time in milliseconds)

Latency measures the processing time, categorized by two distinct scenarios to reflect real-world usage patterns:

  • 'Full Time' (total_latency - Cold Start Scenario): This captures the complete end-to-end processing time, including both the initial embedding of raw text chunks (e.g., breaking down long contexts into smaller pieces and converting them to vectors) and the subsequent merging operation to combine them into a unified representation. It represents a cold start where no prior caching or pre-processing is assumed—ideal for evaluating the full computational cost from scratch.

  • 'Merging Only' (synthesis_latency - Warm Start Scenario): This focuses solely on the time required for the merging operation when the text chunks are already pre-embedded as vectors (e.g., from a cache or prior computation). It isolates the model's synthesis efficiency in combining existing embeddings, which is especially critical for sequence-merger models like enzoescipy/sequence-merger-malgeum, or the enzoescipy/sequence-merger-sarang where repeated merging on cached data is common in production.

Key Insight: Use 'Full Time' to assess overall resource demands in worst-case scenarios, and 'Merging Only' to benchmark the pure merging speed in optimized, cached workflows. This distinction helps compare models fairly across different deployment contexts.

Note on Native Long-Context Embedders:

For models that natively handle long contexts (e.g., nomic-embed-text-v1.5), the concept of a separate 'Merging Only' step does not exist, as they process the entire text at once. In this benchmark, to maintain a consistent comparison framework, their Merging Only latency is reported as being identical to their Full Time latency, reflecting the end-to-end processing time for each synthesis step.


Explore the trajectories to find models that balance high RSS with low latency for your specific needs.

Submition

To submit your benchmark to this leaderboard, you should send the .pt and .json file generated by the python finesse-benchmark package. please use this Colab Notebook.

submition email : enzoescipy@gmail.com

Source

Submitted .json and .py files are stored in, seperated database : enzoescipy/finesse-benchmark-results database. All data from this HF space loaded from that database.

2000 32000
Time Mode
GPU Instance (Hardware Environment)
1 50
Sort By

Performance Summary