Finesse Benchmark Observatory

Welcome to the Finesse Benchmark Observatory

This observatory visualizes the performance of various models across different context windows and hardware environments. Here's how to interpret the key metrics:

RSS (Robust Separation Score)

The RSS score quantifies a model's ability to maintain the semantic integrity of merged data. A higher RSS score indicates that the model excels at combining chunks of data into a longer sequence without losing the original meanings, effectively separating 'signal' from 'noise'. For a deeper understanding, refer to the RSS whitepaper.

Latency (Time in milliseconds)

Latency measures the processing time, categorized by two distinct scenarios to reflect real-world usage patterns:

'Full Time' (total_latency - Cold Start Scenario): This captures the complete end-to-end processing time, including both the initial embedding of raw text chunks (e.g., breaking down long contexts into smaller pieces and converting them to vectors) and the subsequent merging operation to combine them into a unified representation. It represents a cold start where no prior caching or pre-processing is assumed—ideal for evaluating the full computational cost from scratch.
'Merging Only' (synthesis_latency - Warm Start Scenario): This focuses solely on the time required for the merging operation when the text chunks are already pre-embedded as vectors (e.g., from a cache or prior computation). It isolates the model's synthesis efficiency in combining existing embeddings, which is especially critical for sequence-merger models like enzoescipy/sequence-merger-malgeum, or the enzoescipy/sequence-merger-sarang where repeated merging on cached data is common in production.

Key Insight: Use 'Full Time' to assess overall resource demands in worst-case scenarios, and 'Merging Only' to benchmark the pure merging speed in optimized, cached workflows. This distinction helps compare models fairly across different deployment contexts.

Note on Native Long-Context Embedders:

For models that natively handle long contexts (e.g., nomic-embed-text-v1.5), the concept of a separate 'Merging Only' step does not exist, as they process the entire text at once. In this benchmark, to maintain a consistent comparison framework, their Merging Only latency is reported as being identical to their Full Time latency, reflecting the end-to-end processing time for each synthesis step.

Explore the trajectories to find models that balance high RSS with low latency for your specific needs.

Submition

To submit your benchmark to this leaderboard, you should send the .pt and .json file generated by the python finesse-benchmark package. please use this Colab Notebook.

submition email : enzoescipy@gmail.com

Source

Submitted .json and .py files are stored in, seperated database : enzoescipy/finesse-benchmark-results database. All data from this HF space loaded from that database.

Max Context Window (Tokens)

2000 32000

Time Mode

Total Time Merging Only

GPU Instance (Hardware Environment)

Log Scale Latency

Model Search

Top-K Models

1 50

Sort By

Avg RSS Score Avg Latency (ms)

Performance Trajectory

Performance Summary

Anatomy Lab: Probe Vector Analysis

Prologue: Why did we embark on this journey? - Coffee and Memory

Welcome, everyone. First, I owe you an explanation of the problem I set out to solve. It's thoroughly detailed in my blog article. But it would be a shame to just leave it at that.

The Core Challenge: Memory in Massive Time-Series Data

The problem I wanted to tackle lies within Retrieval-Augmented Generation (RAG) for massive volumes of time-series text data. In simple terms, think of it as imbuing AI with memory using RAG.

The Coffee & Code Story

Imagine analyzing a developer's chat logs, who spent many nights coding with coffee. When bringing up these logs via RAG, a keyword like "coffee" cannot retrieve "the passion of development spent burning the midnight oil."

What's the first hurdle you'd encounter? Typical RAG simply extracts the 1st, 2nd, and 3rd most relevant documents and feeds them to the AI. However, approaching memory from a RAG perspective completely changes the scenario.

The Temporal Connection Problem

There are instances where contextually unrelated events, occurring at similar times, must be conveyed together to make sense. To force this connection, you'd have to dynamically embed adjacent logs using a sliding window and run RAG again, just to piece together the all-nighter next to the coffee.

It's convoluted. Truly.

"How on earth can we avoid the catastrophe of '8192 token sliding windows 100 times per conversation'?"

The Solution: Embedding Embeddings

I found the answer in an "embedder that embeds embeddings." I simply call it a sequence-merger.

What if we could extract meaning again from already embedded numerical vector clusters, and quickly combine them into a single new embedding vector? I believed the situation would change.

The details and technical breakthroughs are captured in the blog post linked above.

"So, how do we evaluate the performance of this sequence-merger?"

This is our topic, and this is where our journey truly begins.

The Challenge - What if you have only a 50-character memory?

First, what I want to do is create new embeddings from existing embeddings. So, naturally, we need to start by embedding some text.

The Input Interface: Constrained but Purposeful

Below, I've prepared a section where you can easily embed your desired sentences. When you try it, you'll notice a deliberate 50-character limit.

This isn't arbitrary; it's to simulate the current reality where embedders usable on embedded devices often have a token limit of 512 tokens, approximately 2000 characters. Most are limited to this.

The Realistic Scenario: Limited Memory, Ambitious Goals

"My primary embedding engine can only embed 50 characters. But I need to combine the meaning within these 5 embeddings to create a new, singular embedding."

That's the challenge.

Guidance: First Step into the Memory Lab

Ready to Begin?

Once you've entered your probes, feel free to click the Analyze button, and the embeddings will be generated instantly.

Visualizing the Magic

I've prepared a simple PCA-reduced graph for you to comfortably view the results.

Enjoy the visualization! See how your words transform into clusters of meaning in the vector space.

Probe A

Probe B

Probe C

Probe D

Probe E

app.anatomylab.item.pca-plot

The Heart of the Matter: Measuring Merger Performance - The A, AB, ABC… Trajectory

Now, with the coordinates etched for 5 texts, their embeddings are complete. Now to the main point:

"So, you claim to have built an encoder that combines embeddings; how do you measure its performance?"

The Practical Approach: Sequential Synthesis

I first considered what the most practical thing to do would be. The moment I built the sequence-merger, I thought:

Let's try combining them sequentially. If we have texts A, B, C, D, E, I wanted to see how the vector changes as we merge them one by one: A → AB → ABC → ABCD → ABCDE.

Witness the Process in Action

So here it is! Below, I've prepared the Synthesis button.

When you click this button, my sequence-merger will activate and perform the synthesis in the order A → AB → ABC → ABCD → ABCDE. At the same time, you'll be able to visually observe how the coordinates of the combined vector change during this A → AB → ... process.

Please, click the button.

Observe the trajectory of memory formation unfold before your eyes.

app.anatomylab.item.synth-plot

Validating the Synthesis: Beyond Visual Movement (The Cosine Similarity Heatmap)

Now, you've visually confirmed that it's possible to create synthetic embeddings using only the embeddings of five sentences, without recourse to the original texts!

Does Mobility Tell the Whole Story?

Did the synthetic vectors in the A → AB → ... process move well towards the average coordinates of the five original vectors?

However, judging the quality of synthesis results solely by this "mobility" isn't straightforward. Therefore, I adopted a concept widely used in RAG: cosine similarity.

Let's think about this. Naturally, AB, synthesized from A and B, should be similar to A and B. The AB embedding must inherit the meaning contained in both A and B.

Extending this, the ABCDE synthetic vector should be similar to A, B, C, D, and E.

Building the Heatmap

With this intuition, we can create a heatmap. Place the original embeddings A, B, C, D, E on the x-axis. On the y-axis, stack the merged embeddings: A, AB, ABC, ABCD, ABCDE.

At each intersection, for example, where C meets AB, we measure the cosine similarity between C and AB (these should have low similarity). And we record that value in the cell.

Can you imagine what shape this might take? A pattern that reveals the hidden truths of semantic merging.

app.anatomylab.item.static-heatmap

Interactive Analysis: Unveiling the Staircase: The Pattern Emerges

Indeed! Typically, a heatmap with a staircase pattern emerges, trending upwards to the right.

I focused precisely on this staircase. Why a staircase shape?

Two Fundamental Principles

There are two primary reasons:

Reflection of Ingredients: As mentioned earlier, the synthetic result should reflect its ingredients. A, AB, ABC,... each have high cosine similarity with their respective ingredients, while having low cosine similarity with what they are not composed of.
Discrimination by Ingredients: However, there's another reason: ingredients must distinguish the composite result. For example, C should not resemble AB. But it should resemble ABC.

Reading the Heatmap: Two Perspectives

The first reason is reflected when reading the heatmap from left to right (Top-Down). The second reason is reflected when reading the heatmap from bottom to top (Bottom-Up).

Naming the Insights

I named the score reflecting the first reason the Top-Down (TD) score, and the score reflecting the second reason the Bottom-Up (BU) score.

You can directly experience the process by which these two scores are calculated. Interact with the controls to peel back the layers of this semantic tapestry.

Interactive Memory Analysis

app.anatomylab.item.heatmap-plot

app.anatomylab.item.gap-plot

Analysis Mode

TD (Top-Down) BU (Bottom-Up)

Analysis Step

1 10

What defines a 'good' memory? - The RSS Score

Move the slider and switch between TD and BU modes.

Doesn't it become clear what we are trying to quantify? Yes, to quantify this "distinctiveness," I followed these steps:

Step-by-Step Breakdown

Differentiate T1 and T2 Regions: T1 (High Similarity) and T2 (Low Similarity) regions are defined. For example, when examining the ABC synthetic probe column, cells for probes A, B, C become Region 1, while cells for D, E become Region 2.
Quantile of T1 Region: After sorting the cosine similarities in the T1 region, I obtained the 1st quartile (Q1), representing the lowest 25 percent of cosine similarities.
Quantile of T2 Region: Similarly, for the T2 region, I obtained the 3rd quartile (Q3), representing the highest 25 percent of cosine similarities.
Calculate Raw Separation Score: I then computed the value (Q1 of T1 - Q3 of T2).

The Core Intention

The intention is clear: The cosine similarity of targets that should be retrieved by RAG (T1 region) must be distinctly higher than the cosine similarity of targets that should not be retrieved by RAG (T2 region).

Therefore, even the low 1st quartile scores in the T1 region should be significantly higher than the high 3rd quartile scores in the T2 region, to ensure proper retrieval in RAG.

This is the foundation of a truly robust memory system.

Final Score Report

Final RSS Score : The Robust Seperation Score

Ultimately, by iterating through each cell, the average Top-Down score and the average Bottom-Up score are used to create the final RSS score.

The Mathematical Heart

The final formula is as follows:

rss = [(avg(TD) + avg(BU)) / 2 - abs(avg(TD) - avg(BU))]

The intention is that while both scores should be high for a good result, a penalty is applied if there's a skew towards one side.

Theoretically, this yields a score range of -2 to +2. Which isn't very aesthetically pleasing. So, the final score is multiplied by 500:

RSS = rss * 500

This is the single RSS score that appears on our benchmark.

app.anatomylab.body.epilogue.paragraph2.title

An Unexpected Horizon

The Giants Among Us

There are giants out there, aren't there, who can digest the raw ABCDE text directly without needing the embedding synthesis steps, and suffer no ill effects. This evaluation method can be applied to them just the same.

The Ultimate Challenge

Ultimately, this benchmark is a challenge.

Can a mini 512-token embedder + sequence-merger win against an 8192-token (or even larger) embedder giant?

That's why I created this benchmark, by adding a time efficiency measurement to the RSS score.

A Final Invitation

Thank you for reading this long explanation. Oh, by the way, you can redraw these heatmaps and graphs yourself using various models below.

Please, enjoy the experience! Your exploration shapes the future of AI memory.

Model Configuration

Configure the models to be analyzed. Changing settings will reset all analysis data.