cassandra batch的注意事项，cassandrabatch

和通数据库htsjk.Com2019-06-06 22:05 来源:未知阅读:15963 评论 320 热度5

标签：

cassandra batch的注意事项，cassandrabatch

https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/

https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/
However, a lot of people are used to databases where explicit batching is a performance improvement. If you did this in Cassandra you're very likely to see the performance reduce. You'd end up with some code like this (the code to build a single bound statement has been extracted out to a helper method):

Looks good right? Surely this means we get to send all our inserts in one go and the database can handle them in one storage action? Well, put simply, no. Cassandra is a distributed database, no single node can handle this type of insert even if you had a single replica per partition.

#xhe_tmpurl

CASSANDRA是分布数据库，BATCH只在数据属于同一个PARTITION时，使用BATCH才会提高效率

Cassandra anti-pattern: Logged batches

I've previously blogged about other anti-patterns:

Distributed joins
Unlogged batches

This post is similar to the unlogged batches post but is instead about logged batches.
We'll again go through an example Java application.

The good news is that the common misuse is virtually the same as the last article on unlogged batches, so you know what not to do. The bad news is if you do happen to misuse them it is even worse!

Let's see why. Logged batches are used to ensure that all the statements will eventually succeed. Cassandra achieves this by first writing all the statements to a batch log. That batch log is replicated to two other nodes in case the coordinator fails. If the coordinator fails then another replica for the batch log will take over.

Now that sounds like a lot of work. So if you try to use logged batches as a performance improvement then you'll be very disappointed! For a logged batch with 8 insert statements (equally distributed) in a 8 node cluster it will look something like this:

The coordinator has to do a lot more work than any other node in the cluster. Where if we were to just do them as regular inserts we'd be looking like this:

A nice even workload.

So when would you want to use logged batches?

Short answer: consistent denormalisation. In most cases you won't want to use them, they are a performance hit. However for some tables where you have denormalised you can decide to make sure that both statements succeed. Lets go back to our customer event table from the previous post but also add a customer events by staff id table: