Data Analytics

Abstract

Data Analytics

Authors

Walter Fan

Status

WIP

Updated

2024-08-21

Overview

There are two kinds of data for data analytics:

  1. Log

  2. Metrics

The requirement is

  1. stable - high availablity

  2. quick - from source to destination

  3. keep long time/large capacity

  4. easy to learn

  5. provide API

  6. rich analytics and visualize tool

ELK + K

ElasticSearch + LogStash + Kibana + Kafka

Data flow:

LogStash -> Kafka -> ElasticSearch -> Kibana

FIG + K

Data flow:

Filebeat -> Kafka -> InfluxDB -> Grafana

Pinot

Pinot is a real-time distributed OLAP datastore, purpose-built to provide ultra low-latency analytics, even at extremely high throughput. It can ingest directly from streaming data sources - such as Apache Kafka and Amazon Kinesis - and make the events available for querying instantly. It can also ingest from batch data sources such as Hadoop HDFS, Amazon S3, Azure ADLS, and Google Cloud Storage.

At the heart of the system is a columnar store, with several smart indexing and pre-aggregation techniques for low latency. This makes Pinot the most perfect fit for user-facing realtime analytics. At the same time, Pinot is also a great choice for other analytical use-cases, such as internal dashboards, anomaly detection, and ad-hoc data exploration.

Iceberg

Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, and Hive to safely work with the same tables, at the same time.