Log Aggregation using BEK (Beats , Elasticsearch and Kibana ) stack and Ingest APIs
There are quite a few articles and resources available on the internet when it comes to creating a Log aggregation pipeline using ELK stack. For those unfamiliar , ELK stands for Elasticsearch , Logstash and Kibana. (This could be very well called the BELK stack with FileBeats being quite universally used to ship log data from files and fed onto logstash for further processing.)
Quite recently, I had an opportunity to setup a log aggregation pipeline for some of the DEV and QA environments. So far , these environments were simply ignored because expensive alternative solutions like Splunk et al. were reserved for more important UAT and Production environments.
Having previously done such a setup on an earlier version of Elastic stack , I honestly thought that the only way to do this is via good old FileBeats →Logstash →Elasticsearch →Kibana way.
But I ran into some hurdles when it came to shipping the logs from my beats daemon to Logstash. The culprit was my Logstash setup, which was done on an Amazon EC2 instance where the only possible ingress communication available to it was good old TCP over port 80. However, the communication between Filebeats and Logstash occurs over lumberjack protocol. For security reasons , all the other ports were blocked unless requested otherwise to be open.
Well , being a lazy bum that I am , I decided to find alternate routes rather than exploring the opening of additional ports ( which required me dealing with the security team and involved lot of explaining ). I started tinkering with the idea of using a reverse proxy on the EC2 instance using traefik and route the Filebeat data via port 80 onto Logstash. But I soon realised this will not work as I previously mentioned the protocol used between Beats and Logstash was not TCP but Lumberjack.
This is when I stumbled upon another alternative option provided by Filebeats to ship data directly over to Elasticsearch over TCP and skip using Logstash altogether. Now, some of you might inquire, what about the log filtering functions that Logstash used to provide. This is where the new Ingest Node provided by Elasticsearch comes in handy.
Ingest Nodes or APIs are like pre-processors that run over every document before they are indexed in Elasticsearch. There are many processors available with ingest apis , but we are interested in Grok processor which will allow us transform our unstructured log data into structured format and make it easier for us to search information. This feature allows us to replace most of the functions provided by Logstash for basic log aggregation. For folks unfamiliar with grok, check this site more information.
TL;DR;
Setup
- Elaticsearch and Kibana — You can have these installed on an Amazon EC2 instance or any other physical or virtual machine. The setups are shipped in different packaging formats, use the one convenient for you. I usually end up using the .zip format. The hardware requirements for the machine will completely depend upon the usage and the amount of data you need to keep on your stack. For fairly small usages and considering that you will be cleaning the logs every 4–5 days , a m4.xlarge or m4.2xlarge type of Amazon EC2 instance or its equivalent should be good enough with around 20 GB of storage. You can find additional hardware recommendations here. Alternatively , you can also explore Elasticsearch service on AWS. I eventually ended up with the service option on AWS just because of the simplicity and the scalability it provides. Configuration on these two services is very minimal and almost none if you are using default ports.
- FileBeat — You will need to have Filebeat installed on every location ( in my case Dev and QA boxes) where the logs are generated. Download Filebeat matching your hardware requirements and update the filebeat.yml with the required configuration for your log files. Follow steps mentioned here except step 3 which is only required for setting up Filebeat for Logstash. You can also skip the steps to setup Kibana dashboard if you really have no purpose for it at this time. For configuring Filebeat with Elasticsearch follow this link. Your filebeat.yml should look somewhat like this if you followed the instructions properly.
filebeat.prospectors:
- type: log
enabled: true
paths:
- /path/to/log
fields:
environment: "Dev-101"output.elasticsearch:
hosts: ["10.20.30.40:9200"]
protocol: https
path: /elasticsearch
index : "application-logs--%{+yyyy.MM.dd}"setup.template.name: "application-logs"
setup.template.pattern: "application-logs-*"
Few points to note …
Custom Fields — Don’t forget to add your custom fields, like fields.environment added in the example above. This will help you differentiate log data coming from different locations
Multiline — You can manage multiline log messages from within filebeat.yml itself. These are quite useful and mostly required for many log formats. You can find examples and instructions for setting the same here.
Load index template — One of the configuration steps involve loading index template onto Elasticsearch. If the Elasticsearch output is enabled in filebeat.yml , Filebeat will automatically load the template file, but in case you like to have a different index name , you have to change the value at certain places in the configuration. In our example above we have updated the index name to “application-logs-*”
3. Ingress Grok Processor — The last piece of the puzzle is to add Grok processor to make our log data structured. Elasticsearch provides simple REST apis for configuring Ingest processors. Using the PUT Pipeline API you can create a new pipeline and define a Grok Processor within the same.( A pipeline is nothing but a series of processors , which are executed in the order of their definition. )
PUT _ingest/pipeline/my_pipeline{
"description" : "My pipline for log data",
"processors": [
{
"grok": {
"field": "message",
"patterns": ["%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}"], "trace_match" : true
}
}
]
}
You can have multiple grok expressions in “patterns” field, but the processor will exit on the first match and use this pattern to structure your data. I also suggest setting the trace_match property to identify which of the grok expressions matched the message data in case you have multiple patterns. Use the Simulate pipeline API in order to test your processors against sample message data.
Once your pipeline is working , there is one more configuration required to be updated on your filebeat.yml. You need to add pipeline name created above in the yml file as shown in example below along with previous configurations.
output.elasticsearch:
...
pipeline: my_pipeline
Go ahead and start your Elasticsearch and Kibana. Run Filebeat using below command…
sudo ./filebeat -e -c filebeat.yml -d "*"
As soon as it starts, you should be able to see data being sent to Elasticsearch in batch mode. Filebeat has very low cpu and memory footprint, so you do not have to worry about letting this run as a background process on your machines.
You can optionally install Elasticsearch head chrome plugin to check index and documents created on Elasticsearch.
Finally, open up the Kibana dashboard and follow the instructions mentioned here in order to configure index pattern. The name of the index pattern you will need to search will be same as what you mentioned in your filebeat.yml.
Excellent !! you are all set to discover and search your log data in a single place rather than looking over at multiple locations.
Thats it folks ! Add comments or send me suggestions on how to improve this article.
Additional reading — Performance comparison between using Ingest API and Logstash