How to run translated sentence mining with Sentence Transformers

Translated sentence mining finds likely parallel sentence pairs when two corpora describe the same domain in different languages but do not share row-level alignment. Sentence Transformers can embed each language into a shared vector space, compare candidates in both directions, and surface probable translations for review or machine translation data preparation.

Bitext mining works best with a multilingual model trained for cross-language alignment. Sentence Transformers documents sentence-transformers/LaBSE as the strongest model choice for production bitext mining; the smaller multilingual MiniLM model used in the script keeps a local CPU smoke test short enough for a compact trial run.

A compact first pass keeps each sentence index beside the original text, searches nearest neighbors with util.semantic_search(), and uses a margin score to reduce false matches caused by generally similar sentences. Margin scores can be greater than 1, so choose a cutoff from inspected corpus output rather than treating cosine similarity alone as the production decision.

Steps to run translated sentence mining with Sentence Transformers:

Create the translated sentence mining script.

translated_sentence_mining.py

from sentence_transformers import SentenceTransformer
from sentence_transformers import util
 
 
source_sentences = [
    "Reset a customer password.",
    "Renew the TLS certificate.",
    "Export customer invoices.",
    "Deploy the billing service.",
    "Enable admin multi-factor authentication.",
]
 
target_sentences = [
    "Exportar facturas de clientes.",
    "Implementar el servicio de facturacion.",
    "Restablecer la contrasena de un cliente.",
    "Renovar el certificado TLS.",
    "Activar la autenticacion multifactor del administrador.",
]
 
expected_pairs = {
    0: 2,
    1: 3,
    2: 0,
    3: 1,
    4: 4,
}
 
model = SentenceTransformer(
    "sentence-transformers/"
    "paraphrase-multilingual-MiniLM-L12-v2",
    device="cpu",
)
 
source_embeddings = model.encode(
    source_sentences,
    convert_to_tensor=True,
    normalize_embeddings=True,
    show_progress_bar=False,
)
target_embeddings = model.encode(
    target_sentences,
    convert_to_tensor=True,
    normalize_embeddings=True,
    show_progress_bar=False,
)
 
k = min(4, len(target_sentences))
source_to_target = util.semantic_search(
    source_embeddings,
    target_embeddings,
    top_k=k,
    score_function=util.dot_score,
)
target_to_source = util.semantic_search(
    target_embeddings,
    source_embeddings,
    top_k=k,
    score_function=util.dot_score,
)
 
source_means = [
    sum(hit["score"] for hit in hits) / len(hits)
    for hits in source_to_target
]
target_means = [
    sum(hit["score"] for hit in hits) / len(hits)
    for hits in target_to_source
]
 
candidates = []
for source_id, hits in enumerate(source_to_target):
    for hit in hits:
        target_id = hit["corpus_id"]
        cosine = hit["score"]
        mean_score = (
            source_means[source_id]
            + target_means[target_id]
        ) / 2
        margin = cosine / mean_score
        if target_to_source[target_id][0]["corpus_id"] == source_id:
            candidates.append((margin, cosine, source_id, target_id))
 
candidates.sort(reverse=True)
 
print("Mined translated sentence pairs:")
for rank, result in enumerate(candidates, start=1):
    margin, cosine, source_id, target_id = result
    print(f"{rank}. margin={margin:.3f} cosine={cosine:.3f}")
    print(f"   source[{source_id}]: {source_sentences[source_id]}")
    print(f"   target[{target_id}]: {target_sentences[target_id]}")
 
found_pairs = {
    source_id: target_id
    for _, _, source_id, target_id in candidates
}
missing = {
    source_id: target_id
    for source_id, target_id in expected_pairs.items()
    if found_pairs.get(source_id) != target_id
}
 
if missing:
    raise SystemExit(f"verification: FAIL missing pairs {missing}")
 
print("verification: PASS all expected translation pairs recovered")

Use a Python environment where Sentence Transformers is installed before running the script. Replace the inline lists with corpus rows or file records when moving beyond a trial run.
Related: How to install Sentence Transformers with pip

Run the mining script.

$ python translated_sentence_mining.py
Mined translated sentence pairs:
1. margin=2.125 cosine=0.965
   source[1]: Renew the TLS certificate.
   target[3]: Renovar el certificado TLS.
2. margin=1.951 cosine=0.832
   source[4]: Enable admin multi-factor authentication.
   target[4]: Activar la autenticacion multifactor del administrador.
3. margin=1.640 cosine=0.831
   source[2]: Export customer invoices.
   target[0]: Exportar facturas de clientes.
4. margin=1.574 cosine=0.842
   source[3]: Deploy the billing service.
   target[1]: Implementar el servicio de facturacion.
5. margin=1.316 cosine=0.530
   source[0]: Reset a customer password.
   target[2]: Restablecer la contrasena de un cliente.
verification: PASS all expected translation pairs recovered

The first model load may download files from Hugging Face before printing the mining result. The download and progress messages are not part of the mined-pair output.

Confirm that every source index maps to the expected target index.

The Spanish target list is intentionally shuffled, so source[0] matching target[2] and source[4] matching target[4] proves that the script used embeddings rather than row order.
Adjust the model and cutoff before mining a larger corpus.
```
model = SentenceTransformer(
    "sentence-transformers/LaBSE",
    device="cpu",
)
minimum_margin = 1.25
 
accepted_pairs = [
    (margin, source_id, target_id)
    for margin, cosine, source_id, target_id in candidates
    if margin >= minimum_margin
]
```
LaBSE is the stronger production-oriented bitext mining model. The cutoff depends on the corpus, language pair, and review tolerance, so inspect a labeled or manually reviewed sample before accepting pairs automatically.
Related: How to choose a Sentence Transformers model for semantic search
Remove the temporary mining script after moving the code into the alignment project.
```
$ rm translated_sentence_mining.py
```

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.