Large HDFS-to-HDFS or object-store copies need a job that can split work across the cluster. DistCp runs a MapReduce job for distributed copying, which makes it better suited than a local shell copy for multi-gigabyte directories and cross-cluster migrations.
The source and destination URIs control where the copy runs. Use explicit paths, run a dry listing before the job, and inspect counters afterward so accidental overwrite or missing-source mistakes are caught early.
DistCp can copy between HDFS and compatible object stores, but object-store renames and consistency behavior can differ from HDFS. Use the simplest copy options first and add update, delete, or bandwidth limits only when the job requires them.
Steps to copy data with Hadoop DistCp:
- List the source path before starting the copy.
$ hdfs dfs -ls hdfs://nn1.example.net/data/events Found 2 items drwxr-xr-x - analytics data 0 2026-06-17 02:11 hdfs://nn1.example.net/data/events/day=2026-06-16 drwxr-xr-x - analytics data 0 2026-06-17 02:12 hdfs://nn1.example.net/data/events/day=2026-06-17
- Create the destination parent directory when needed.
$ hdfs dfs -mkdir -p hdfs://nn2.example.net/archive
- Run DistCp with explicit source and destination URIs.
$ hadoop distcp hdfs://nn1.example.net/data/events hdfs://nn2.example.net/archive/events INFO tools.DistCp: DistCp job-id: job_1720000000000_0042 INFO mapreduce.Job: map 100% reduce 0% INFO mapreduce.Job: Job job_1720000000000_0042 completed successfully
- Check the YARN application for a successful final state.
$ yarn application -status application_1720000000000_0042 Final-State : SUCCEEDED Tracking-URL : http://rm01.example.net:8088/proxy/application_1720000000000_0042/
Related: How to list YARN applications
- Compare the copied directory size.
$ hdfs dfs -du -s -h hdfs://nn2.example.net/archive/events 90.5 G 181.0 G hdfs://nn2.example.net/archive/events
- Inspect the destination listing.
$ hdfs dfs -ls hdfs://nn2.example.net/archive/events Found 2 items drwxr-xr-x - analytics data 0 2026-06-17 03:11 hdfs://nn2.example.net/archive/events/day=2026-06-16 drwxr-xr-x - analytics data 0 2026-06-17 03:12 hdfs://nn2.example.net/archive/events/day=2026-06-17
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.