How to use Logstash to process and ingest CSV files

Logstash is a tool designed to collect, process, and ship data from a variety of sources. It can parse and transform structured data formats such as CSV, allowing for seamless ingestion into systems like Elasticsearch. CSV, a common format for tabular data storage, is widely used for exchanging structured data in business and technical environments. Logstash's flexibility in parsing CSV files makes it useful for real-time data pipelines and batch data processing.

Logstash works by defining pipelines that include input, filter, and output stages. The CSV filter plugin allows for accurate parsing of fields and records, ensuring each field is correctly interpreted according to the specified schema. Using other filters like mutate, Logstash can convert field types or perform data transformations before sending the processed data to a destination. This configuration-driven approach ensures structured data like CSV can be tailored to different requirements, whether for further analysis or database storage.

Logstash can handle both single CSV files and directories containing multiple CSV files. It supports real-time ingestion and processing as well as scheduled batch jobs. This flexibility makes it ideal for handling large datasets or continuously updated CSV sources. Using Logstash ensures that CSV data can be processed and shipped to various destinations, such as Elasticsearch, for indexing, visualization, or querying.

Steps to configure Logstash for CSV data processing:

Open the Logstash configuration file for editing.
Define the input section.
```
input {
  file {
    path => "/path/to/your/sample.csv"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}
```
Replace /path/to/your/sample.csv with the actual path to your CSV file. The sincedb_path ⇒ “/dev/null” ensures Logstash processes the file from the beginning each time it runs.
Configure the CSV filter.
```
filter {
  csv {
    separator => ","
    columns => ["id", "name", "age", "city"]
  }
}
```
Ensure the columns option corresponds to the header fields of your CSV file. The separator defines the delimiter used in the CSV file, which is a comma by default.
Apply field type conversion.
```
filter {
  mutate {
    convert => {
      "id" => "integer"
      "age" => "integer"
    }
  }
}
```
The mutate filter converts the id and age fields from strings to integers. Modify the field names and conversions as necessary for your data.
Set up the output section.
```
output {
  elasticsearch {
    hosts => ["http://localhost:9200"]
    index => "csv_data"
  }
  stdout {
    codec => rubydebug
  }
}
```
The Elasticsearch output sends the data to an Elasticsearch instance at localhost:9200. The stdout block is for debugging purposes and prints the output to the console.

Run Logstash with the configuration.

$ logstash -f /path/to/your/logstash.conf
{
     "id" => 1,
   "name" => "John Doe",
    "age" => 28,
   "city" => "New York",
"@version" => "1",
"@timestamp" => "2023-10-01T12:00:00.000Z",
     "path" => "/path/to/your/sample.csv",
     "host" => "localhost"
}

Replace /path/to/your/logstash.conf with the actual path to your Logstash configuration file. Running this command processes the CSV data according to your configuration and prints the output to the console.

Verify the data in Elasticsearch.

$ curl -X GET "localhost:9200/csv_data/_search?pretty"
{
  "hits" : {
    "total" : 3,
    "hits" : [
      {
        "_source" : {
          "id" : 1,
          "name" : "John Doe",
          "age" : 28,
          "city" : "New York"
        }
      },
      {
        "_source" : {
          "id" : 2,
          "name" : "Jane Smith",
          "age" : 34,
          "city" : "Los Angeles"
        }
      }
    ]
  }
}

Use the curl command to query the csv_data index in Elasticsearch and verify that the CSV data has been ingested successfully.

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.

Discuss the article:

Comment anonymously. Login not required.