Bulk Loading Time-Series Data @ TempoDB
(Cross-posted to blog.tempo-db.com)
In addition to our REST API and language-specific client libraries, we now offer the ability to bulk import data by uploading CSVs. The intent of this feature is to support the initial load of large amounts of historical data (many millions or billions of data points). By sending us CSVs (instead of just using our API normally), customers save themselves from having to build and monitor a large one-time job, and the problem is simplified to CSV generation.
How It Works Behind the scenes, we leverage the immutability of HBase’s (our main time series data store) underlying data files, as well as some of the distribution primitives offered by our Hadoop cluster. We use HDFS (a distributed file-system), MapReduce (Hadoop’s distributed batch job framework), and Scala to transform the provided CSVs into HFiles, which is the format HBase uses internally. After generating these HFiles, it’s just a matter of directing HBase to add these files to its collection, which is a fast and efficient process.
Immutability HBase has an interesting property which permits us to create and load this data out-of-band: the underlying data files of HBase are immutable. While the data in HBase is itself mutable (you can update records as you see fit), the HFiles which hold the data don’t change (with the exception of compactions, which can be ignored here). It achieves this by searching HFiles in reverse chronological order for relevant values, so updates (in more recent HFiles) will be found and returned instead of older (out-dated) values. For our data load, by telling HBase to add the generated files to its cache, we’re effectively doing a single bulk write to the database, with the entire data set. In reality, due to the distributed nature of HBase, we’re not actually talking about a single bulk write, but one per server (where each server holds a portion of the data) but the general idea is the same.
Distribution and Fault Tolerance At a fundamental level, we work hard to run only horizontally scalable systems, and part of that is the understanding that although all sorts of things can and will go wrong, the system as a whole needs to continue running. For a large data set (let’s say 10 billion data points), the batch job to process the CSVs and generate HFiles can take hours or days, and at a time span like that, getting an error and starting over would be very painful. To address this, we leverage the built-in fault-tolerance and distribution mechanisms provided by Hadoop’s MapReduce framework. This system, coupled with HDFS, means that every step of the process can handle various server/networking/random problems, and the job as a whole will recover.
This feature is new, and (at the moment) we accept a fairly limited CSV format. More info on how to bulk import your data can be found in our Support Center here. We’re eager for feedback on how you’d like to see it fit into your process. Drop us a note to discuss the specifics of your workflow at support@tempo-db.com.