Near-duplicate text can hide inside support tickets, product titles, FAQ questions, and content snippets that use slightly different wording. Sentence Transformers paraphrase mining embeds each text item and returns the highest-scoring pairs, which helps a reviewer find likely duplicates without comparing every row by hand.
The paraphrase_mining() helper computes embeddings with a SentenceTransformer model and returns triplets in the form score, id1, id2. The IDs are positions in the input list, so keep the original text array or a separate record ID beside each sentence when the corpus comes from a database or export file.
A first pass works best on a small corpus where duplicate themes are easy to recognize by eye. Limit top_k and max_pairs during development, then adjust query_chunk_size or corpus_chunk_size when a larger corpus needs lower memory use.
$ python -m pip install --upgrade sentence-transformers
The first model load may download files from Hugging Face before printing results.
Related: How to install Sentence Transformers with pip
from sentence_transformers import SentenceTransformer, util sentences = [ "Reset a customer password from the account portal.", "Reset a user password in the account portal.", "Deploy the billing service to production.", "Release the billing service to production.", "Archive the weekly database backup.", "Bake sourdough bread after the dough rises.", ] model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cpu") pairs = util.paraphrase_mining( model, sentences, show_progress_bar=False, batch_size=16, top_k=3, max_pairs=8, ) print("Top paraphrase pairs:") for rank, (score, first_id, second_id) in enumerate(pairs[:5], start=1): print(f"{rank}. score={score:.4f}") print(f" {first_id}: {sentences[first_id]}") print(f" {second_id}: {sentences[second_id]}") expected_pair = frozenset({0, 1}) top_pairs = {frozenset({first_id, second_id}) for _, first_id, second_id in pairs[:3]} if expected_pair not in top_pairs: raise SystemExit("password reset pair missing from the top paraphrase results") print("verification: PASS password reset duplicate pair found")
Replace the inline list with exported records when moving beyond a trial run. Keep the source row ID next to each sentence, because paraphrase_mining() returns list positions rather than external record identifiers.
$ python paraphrase_mining.py Top paraphrase pairs: 1. score=0.9188 0: Reset a customer password from the account portal. 1: Reset a user password in the account portal. 2. score=0.8725 2: Deploy the billing service to production. 3: Release the billing service to production. 3. score=0.3629 0: Reset a customer password from the account portal. 2: Deploy the billing service to production. 4. score=0.3608 0: Reset a customer password from the account portal. 3: Release the billing service to production. verification: PASS password reset duplicate pair found
The lower-scoring pairs are semantic neighbors rather than guaranteed duplicates. Set a review threshold after inspecting real corpus output instead of deleting or merging records from the score alone.
pairs = util.paraphrase_mining( model, sentences, query_chunk_size=1000, corpus_chunk_size=20000, top_k=5, max_pairs=10000, )
Lower query_chunk_size or corpus_chunk_size when memory use is too high. Increase max_pairs only when the review queue needs more candidate pairs.
$ rm paraphrase_mining.py