TimeSeries DataStores

'cuz lost time is never found again

by AbhishekKr / @abionic

what will we discuss today
  • What you can do with it? Why TSDB?
  • What makes a (good) TSDB?
  • Existing Solutions.
elevator pitch

System focussed on data-storage optimized
for time based queries.
Some of the largest datasets have strong time components...
like stock market data, server logs, weather data, or even just the temperature in the server room.

TimeSeries Databases

  • Not a unique problem, any DB can be made to work.
  • VividCortex reached 332k/sec metrics over 3 MySQL nodes.
  • It is writing new TSDB, Catena (800k/sec in Beta).
  • Focussed solutions are to handle scale/queries optimally.
  • It's like a BigData problem with "pre-structured" data.

Analytics

Some analysis are simple

(image-courtesy: stackoverflow::analytics)
But some need correlation of time-series data

(image-courtesy: Spurious Correlations)
Common Time-Series Data for Analysis

(image-courtesy: /var/log)
Interesting Time-Series Data

(image-courtesy: histography.io)

Forecasting

Some series just seems random,
but is actually predictable.

(image-courtesy: dilbert)
Not all predictions are accurate.

(image-courtesy: xkcd)
But with enough data, they can be near perfect.

(image-courtesy: xkcd)
Popular Time-Series Data Forecasting.

(image-courtesy: Yahoo! Finance)
Popular Time-Series Data Forecasting.

(image-courtesy: Yahoo! Finance)
Critical Time-Series Data Forecasting.

(image-courtesy: Google Weather)
Critical Time-Series Data Forecasting.

(image-courtesy: Environment Canada)

Why?

Many kinds of analysis require keeping track of
multiple factors over a period of time.
Like...

Why?

  • Device Performance Analytics

Example: Finding out pattern of specific time-periods
when resource load is more or less.
Manage infrastructure costs
by using influenced elastic cloud.

Why?

  • Decision's impact via Survey Trends

Example: What marketing decisions were taken
at what time?
State of target customer class economy.
Any impact on sale of any influencing data.

Why?

  • Predicting herd mentality in Stock Exchange

Example: Which public company related event
had what impact?
Just general trend in competitors stock health
co-related with of your own.

Why?

  • Map Medical IoT monitoring with regular health checks

Example: Users average heartbeat
co-related with exercise done.
Warning based on old health issues
with current blood pressure trend.

Why?

  • Intrusion Detection Systems

Example: Seasonality of user requests
and trend of traffic increase.
Significant anomaly in such
can be used by IDS to predict attacks.

Daily examples could be
  • stock tick information from global stock exchanges
  • precious metals prices captured periodically
  • weather details at a specific long/lat at periodic interval
  • continuous sensor feed from manufacturing machines or oil rigs, solar panels, etc.
TimeSeries information is not necessarily different but
  • the volume and speed aspect of data
  • the sparseness of the information

that makes it challenging to be stored in traditional stores.

To analyze the data based on the time dimension,

keep arrival time of each feed and

optimize queries by it.

What makes a TimeSeries DataStore?

What?

Storing and Retrieval of Primary Data Points indexed by their TimeStamps.

What makes it better?

What more?

  • Consolidated Data Points

sum, avg, min, max, endpoints, a function specific to type of data
What more?

  • Consistency and Durability suiting to target domain.

Not all. If it is of life-impacting surveys, monetary transactions or any important prediction.
What more?

  • Scalable and Performant to fit the required scenarios.

circular-buffer OR big-data || 100s to Millions Records/Sec
What more?

  • Compressed Contiguous (old) Data Blobs

in wide-row formats when blobs of data are persisted, better compress
What more?

  • Reusable (BigData) Analytics Toolset

if utilize HBase/Cassandra backends, can plug-in existing data crunching mammoths
What more?

  • Non-Blocking Backups

timeseries keep coming at 'continuous (ir)regular intervals of time'
What more?

  • Auto(or default) managed load-balancing.

scaling up and down need be seamless; remember the data stream is coming
Popular Types
  • Relational Database (special schemas)
  • NoSQL Databases (epoch indexes)
  • NoSQL Databases (wide tables)
  • Column Oriented Databases

Existing Solutions


majority of these are opensource
and I'm biased
:)
RRDTool

One of the earliest and most popular TimeSeries DataStore.

Has persistence, in-memory caching & concurrent tasks.

A circular-buffer based store. Bad at Sparse Metrics.

No partition, replication or atomic integrity.

Graphite

Carbon: Twisted powered metrics processing daemon.

Whisper: time-series db library based on RRD principles.

Timestamp value is verified for its position while retrieval.

Multi-Archive Storage and Retrieval Behavior.
File per time-series.
Doesn't scale well as more file-descriptors per series.

OpenTSDB

Runs on Hadoop and HBase. Highly Scalable.

Since v2.0 provides good Plug-in architecture.

Involves lot of moving parts (Hadoop, HBase, Zookeeper).
All need to be managed.

DownSampling for graphs; not to feed into calculations.

KairosDB

Kind of re-write of OpenTSDB (not a fork) that runs on Cassandra. Highly Scalable.

Keep data and presentation of data separate.

InfluxDB

Series of Measurements + Unique Tagset.
Datapoints have fields and timestamp in nano epoch.

No external dependencies. Ordered k/v.
Started with LevelDB, then RocksDB.
Default to BoltDB currently (v0.9.1 I think).

WAL to enable BoltDB manage its memory swiftly.

Over HTTP. Useful SQL-ish language for data query.

Had Protobufs now Raw Bytes.

Druid

Lambda Architecture

Netflix's Atlas

Near real-time graphing for operational insights at scale.

Predictable Alerting (like lot less traffic than predicted)

Netflix handles more than 1TB analytics data/day with it.

In-memory (complete for 6hrs, roll-ups for 2 weeks). Pain.

Persists raw data in S3. Uses Hive to process old data.

Prometheus

A Service Monitoring System with built-in TSDB
by SoundCloud.

Has a query language, alerting and visualization.

Data-model as OpenTSDB. Metric names,
labelled with key-values.

Can tweak data handled in RAM and Disk (LevelDB).

Blueflood

By RackSpace Cloud Monitoring Team
for RealTime Analytics.

Auto-purges, not ideal for Batch Tasks on old data.

Uses Cassandra for datastorage.
Optional support of Zookeeper and ElasticSearch.

Kdb+ (commercial)

Columnar High Performance DB.
Built-in array language 'q' to work directly on data.
Can be used for streaming, real-time and historical data.

OLTP from 100 thousand to 1 million records/second/cpu.
OLAP from 1 million to 100 million records/second/cpu.

Popular in Financial Sectors.
Customers: Goldman Sachs, JP Morgan, Deutsche Bank, etc.
Also in Utilities, Telecom, Pharamceuticals, Oil-n-Gas sectors.

SaaS model over Kdb+ at TimeSeries.guru

32bit Free for Dev/PoC tasks not commercial. (1GB RAM)

SiteWhere (community edition)

It runs on MongoDB or Hadoop/HBase.

Provides 'Complex Event Processing' via Siddhi.

Provides search and analytics via Apache Solr.

Connect devices with MQTT, AMQP, Stomp, other protocols.

SaaS; IoT focussed; REST registration; Arduino and Android

tempoiq

legacy: tempodb, @gigaom; (commercial)

Focussed on IoT sensors data
for analysis, dashboarding and reporting.

Connect anything with flexible event data model, HTTPs, MQTT

something I was working on

MomentDB/GoShare

  • TimeSeries arranged as NameSpaced Keys
    httpd:ERROR:2015:10:16:54:45:34 = yada | 2015:10:16:54:57:34:httpd:ERROR = nada
  • HTTP and ZeroMQ support for now

W.I.P.
  • Optional Datastore Layer (just k/v or namespaced)
  • Distributed+Optimized Store
    (Shards, Replicators, Buffers, Compression)
  • Delegated DownSampler and Predicting Engines
  • MQTT support; also MsgPack -or- Cap'nProto
It's about TIME.

Questions

- Ranking of TimeSeries DBEngines
- feedback/contributions 'MomentDB/Goshare'
- this presentation @quick link WIP