Hard negatives make retrieval training data more demanding by pairing each query with a candidate that looks close to the model but is not the labeled positive answer. Sentence Transformers includes a hard-negative mining helper for turning simple anchor-positive rows into triplets that contrastive and ranking losses can use.
The mine_hard_negatives() helper embeds the anchor column and candidate corpus, ranks nearby candidates, and returns a Dataset with the requested output shape. For a first pass, a compact anchor-positive dataset and a small public embedding model are enough to confirm that the mined negative column contains near misses rather than random unrelated text.
Hard-negative mining is still a data-review step, not an automatic labeler. A candidate can be close because it is genuinely relevant, so review a sample of rows before using the output with MultipleNegativesRankingLoss, triplet losses, or a reranker training run.
$ pip install "sentence-transformers[train]"
The mining helper returns a Hugging Face Dataset, so the training extra is the safest install target for dataset preparation work.
from datasets import Dataset from sentence_transformers import SentenceTransformer from sentence_transformers.util import mine_hard_negatives dataset = Dataset.from_dict( { "query": [ "reset a user password", "enable two-factor authentication", "export audit logs", "restore a deleted project", "rotate an API token", "invite a new team member", ], "answer": [ "Open the user profile, choose Reset password, and send a recovery email.", "Open account security, scan the authenticator QR code, and save backup codes.", "Open compliance reports, choose Audit logs, and export the CSV file.", "Open deleted projects, select the project, and click Restore.", "Open API tokens, revoke the old token, and create a replacement token.", "Open team settings, enter the email address, and send the invitation.", ], } ) model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") mined = mine_hard_negatives( dataset=dataset, model=model, anchor_column_name="query", positive_column_name="answer", range_min=1, range_max=5, num_negatives=1, sampling_strategy="top", output_format="triplet", verbose=False, ) print(mined) for row in mined.select(range(min(3, len(mined)))): print(f"query: {row['query']}") print(f"positive: {row['answer']}") print(f"negative: {row['negative']}") print("---")
range_min=1 skips the nearest candidate after positives are excluded, which helps avoid near-duplicate answers in a tiny sample. Use range_max and num_negatives to widen or narrow the mined candidate set for a larger training corpus.
$ python mine_hard_negatives_example.py
Dataset({
features: ['query', 'answer', 'negative'],
num_rows: 6
})
query: reset a user password
positive: Open the user profile, choose Reset password, and send a recovery email.
negative: Open API tokens, revoke the old token, and create a replacement token.
---
query: enable two-factor authentication
positive: Open account security, scan the authenticator QR code, and save backup codes.
negative: Open API tokens, revoke the old token, and create a replacement token.
---
query: export audit logs
positive: Open compliance reports, choose Audit logs, and export the CSV file.
negative: Open the user profile, choose Reset password, and send a recovery email.
---
features: ['query', 'answer', 'negative']
Use output_format=“labeled-pair” or output_format=“labeled-list” instead when the next training step expects labeled pairs or grouped lists rather than triplets.
Remove rows where the negative text would also satisfy the query. Those false negatives teach the model to push away valid answers.