Near-duplicate text can hide inside support tickets, product titles, FAQ questions, and content snippets that use slightly different wording. Sentence Transformers paraphrase mining embeds each text item and returns the highest-scoring pairs, which helps a reviewer find likely duplicates without comparing every row by hand.
The paraphrase_mining() helper computes embeddings with a SentenceTransformer model and returns triplets in the form score, id1, id2. The IDs are positions in the input list, so keep the original text array or a separate record ID beside each sentence when the corpus comes from a database or export file.
A first pass works best on a small corpus where duplicate themes are easy to recognize by eye. Limit top_k and max_pairs during development, then adjust query_chunk_size or corpus_chunk_size when a larger corpus needs lower memory use.
Steps to run paraphrase mining with Sentence Transformers:
- Install Sentence Transformers in the active Python environment.
$ python -m pip install --upgrade sentence-transformers
The first model load may download files from Hugging Face before printing results.
Related: How to install Sentence Transformers with pip - Create a paraphrase mining script with texts that contain known duplicate themes.
- paraphrase_mining.py
from sentence_transformers import SentenceTransformer, util sentences = [ "Reset a customer password from the account portal.", "Reset a user password in the account portal.", "Deploy the billing service to production.", "Release the billing service to production.", "Archive the weekly database backup.", "Bake sourdough bread after the dough rises.", ] model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cpu") pairs = util.paraphrase_mining( model, sentences, show_progress_bar=False, batch_size=16, top_k=3, max_pairs=8, ) print("Top paraphrase pairs:") for rank, (score, first_id, second_id) in enumerate(pairs[:5], start=1): print(f"{rank}. score={score:.4f}") print(f" {first_id}: {sentences[first_id]}") print(f" {second_id}: {sentences[second_id]}") expected_pair = frozenset({0, 1}) top_pairs = {frozenset({first_id, second_id}) for _, first_id, second_id in pairs[:3]} if expected_pair not in top_pairs: raise SystemExit("password reset pair missing from the top paraphrase results") print("verification: PASS password reset duplicate pair found")
Replace the inline list with exported records when moving beyond a trial run. Keep the source row ID next to each sentence, because paraphrase_mining() returns list positions rather than external record identifiers.
- Run the script.
$ python paraphrase_mining.py Top paraphrase pairs: 1. score=0.9188 0: Reset a customer password from the account portal. 1: Reset a user password in the account portal. 2. score=0.8725 2: Deploy the billing service to production. 3: Release the billing service to production. 3. score=0.3629 0: Reset a customer password from the account portal. 2: Deploy the billing service to production. 4. score=0.3608 0: Reset a customer password from the account portal. 3: Release the billing service to production. verification: PASS password reset duplicate pair found
- Check that the highest-scoring pair is the password-reset duplicate.
The lower-scoring pairs are semantic neighbors rather than guaranteed duplicates. Set a review threshold after inspecting real corpus output instead of deleting or merging records from the score alone.
- Tune the mining limits before running a larger corpus.
pairs = util.paraphrase_mining( model, sentences, query_chunk_size=1000, corpus_chunk_size=20000, top_k=5, max_pairs=10000, )
Lower query_chunk_size or corpus_chunk_size when memory use is too high. Increase max_pairs only when the review queue needs more candidate pairs.
- Remove the trial script when the mining check is finished.
$ rm paraphrase_mining.py
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.