How to query nearest neighbors with SciPy KDTree

Nearest-neighbor queries find the closest rows in a coordinate or feature array to one or more target points. In SciPy, a KDTree indexes numeric point data so spatial lookups, embedding checks, and clustering prefilters do not need to compare every row manually.

scipy.spatial.KDTree stores the input as an n by m array, where each row is one point and each column is one coordinate or feature. Its query() method returns two aligned results: distances to the matched rows and indexes back into the original array.

A two-dimensional sensor dataset keeps the indexes and Euclidean distances easy to check by hand. Keep the source array unchanged after building the tree unless copy_data=True is used, because the tree may reference the original data buffer.

Steps to query nearest neighbors with SciPy KDTree:

  1. Create a Python script named nearest_neighbor_query.py.
    nearest_neighbor_query.py
    import numpy as np
    from scipy.spatial import KDTree
     
    np.set_printoptions(precision=3, suppress=True)
     
    points = np.array([
        [0.0, 0.0],
        [2.0, 3.0],
        [2.0, 4.0],
        [5.0, 4.0],
        [8.0, 2.0],
    ])
    labels = np.array(["depot", "sensor-a", "sensor-b", "sensor-c", "sensor-d"])
     
    queries = np.array([
        [2.2, 3.9],
        [7.4, 2.2],
    ])
     
    tree = KDTree(points)
    distances, indexes = tree.query(queries, k=2)
     
    for query_number, query in enumerate(queries, start=1):
        print(f"query {query_number}: {query}")
        for rank, (distance, index) in enumerate(
            zip(distances[query_number - 1], indexes[query_number - 1]),
            start=1,
        ):
            print(
                f"  {rank}: {labels[index]} index={index} "
                f"point={points[index]} distance={distance:.3f}"
            )
     
    manual_distance = np.linalg.norm(points[indexes[0, 0]] - queries[0])
    print(f"\nmanual check for query 1 nearest: {manual_distance:.3f}")
     
    bounded_distances, bounded_indexes = tree.query(
        queries[0],
        k=3,
        distance_upper_bound=1.0,
    )
    print("\nwithin distance 1.0 from query 1")
    print("distances:", bounded_distances)
    print("indexes:", bounded_indexes)

    KDTree and cKDTree are functionally identical in current SciPy. KDTree is the clearer default when old SciPy compatibility is not required.

  2. Run the script.
    $ python3 nearest_neighbor_query.py
    query 1: [2.2 3.9]
      1: sensor-b index=2 point=[2. 4.] distance=0.224
      2: sensor-a index=1 point=[2. 3.] distance=0.922
    query 2: [7.4 2.2]
      1: sensor-d index=4 point=[8. 2.] distance=0.632
      2: sensor-c index=3 point=[5. 4.] distance=3.000
    
    manual check for query 1 nearest: 0.224
    
    within distance 1.0 from query 1
    distances: [0.224 0.922   inf]
    indexes: [2 1 5]
  3. Match each returned index to the original point array.

    indexes[0, 0] is 2, so the closest row for the first query is points[2], labeled sensor-b.

  4. Confirm the nearest distance with the manual check line.

    The first query is 0.2 units away in x and 0.1 units away in y from sensor-b, so the Euclidean distance sqrt(0.2**2 + 0.1**2) rounds to 0.224.

  5. Use k for the number of neighbors to keep for each query point.

    When k=1, SciPy squeezes the neighbor dimension from the result. Use k=[1] if downstream code should keep a one-column result.

  6. Use distance_upper_bound when far matches should be treated as missing.

    The bounded query asks for three neighbors within distance 1.0. The third result prints inf with index 5 because the tree has five input rows and no third neighbor met the bound.

  7. Remove the demo script when it was only created for the check.
    $ rm nearest_neighbor_query.py