cassandra写数据CommitLog

cassandra
两种方式:

Cassandra-ArchitectureCommitLog

Cassandra持久化-Durability

一种是配置commitlog_sync为periodic,定期模式;另外一种是batch,

默认(Cassandra1.2.19/3.0.0)为periodic,定期10000ms

#commitlog_sync: batch
#commitlog_sync_batch_window_in_ms: 50
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000

这里如果是periodic模式潜在丢数据的风险,来看看两种实现方式,大致调用顺序

StorageProxy. ->WritePerformer.apply()->counterWriteTask()/sendToHintedEndpoints()->((CounterMutation/mutation).apply()->Mutation.apply()->Keyspace.apply()->CommitLog.instance.add(mutation),主要看CommitLog.instance.add(mutation)

CommitLog.instance.add(mutation)

public ReplayPosition add(Mutation mutation)
{
    assert mutation != null;

    long size = Mutation.serializer.serializedSize(mutation, MessagingService.current_version);

    long totalSize = size + ENTRY_OVERHEAD_SIZE;
    if (totalSize > MAX_MUTATION_SIZE)
    {
        throw new IllegalArgumentException(String.format("Mutation of %s bytes is too large for the maxiumum size of %s",
                                                         totalSize, MAX_MUTATION_SIZE));
    }

    Allocation alloc = allocator.allocate(mutation, (int) totalSize);
    try
    {
        ICRC32 checksum = CRC32Factory.instance.create();
        final ByteBuffer buffer = alloc.getBuffer();
        BufferedDataOutputStreamPlus dos = new DataOutputBufferFixed(buffer);

        // checksummed length
        dos.writeInt((int) size);
        checksum.update(buffer, buffer.position() - 4, 4);
        buffer.putInt(checksum.getCrc());

        int start = buffer.position();
        // checksummed mutation
        Mutation.serializer.serialize(mutation, dos, MessagingService.current_version);
        checksum.update(buffer, start, (int) size);
        buffer.putInt(checksum.getCrc());
    }
    catch (IOException e)
    {
        throw new FSWriteError(e, alloc.getSegment().getPath());
    }
    finally
    {
        alloc.markWritten();
    }
    executor.finishWriteFor(alloc);
    return alloc.getReplayPosition();
    }

这里主要写buffer,没有刷盘,这时会有两种方式,就是之前说的periodic与batch,主要看 executor.finishWriteFor(alloc),起里边调用了maybeWaitForSync(),是一个抽像的,在BatchCommitLogService与PeriodicCommitLogService中实现

public void finishWriteFor(Allocation alloc)
{
    maybeWaitForSync(alloc);
    written.incrementAndGet();
}
protected abstract void maybeWaitForSync(Allocation alloc);

BatchCommitLogService中实现

protected void maybeWaitForSync(CommitLogSegment.Allocation alloc)
{
    // wait until record has been safely persisted to disk
    pending.incrementAndGet();
    alloc.awaitDiskSync(commitLog.metrics.waitingOnCommit);
    pending.decrementAndGet();
}
void waitForSync(int position, Timer waitingOnCommit)
{
    while (lastSyncedOffset < position)
    {
        WaitQueue.Signal signal = waitingOnCommit != null ?
                                  syncComplete.register(waitingOnCommit.time()) :
                                  syncComplete.register();
        if (lastSyncedOffset < position)
            signal.awaitUninterruptibly();
        else
            signal.cancel();
    }
}

这里面如果lastSyncedOffset < position是会一直等待的,知道lastSyncedOffset>=position,即当前alloc对应的buffer已被flush

PeriodicCommitLogService中实现,这里的关键是waitForSyncToCatchUp()

protected void maybeWaitForSync(CommitLogSegment.Allocation alloc)
{
    if (waitForSyncToCatchUp(Long.MAX_VALUE))
    {
        // wait until periodic sync() catches up with its schedule
        long started = System.currentTimeMillis();
        pending.incrementAndGet();
        while (waitForSyncToCatchUp(started))
        {
            WaitQueue.Signal signal = syncComplete.register(commitLog.metrics.waitingOnCommit.time());
            if (waitForSyncToCatchUp(started))
                signal.awaitUninterruptibly();
            else
                signal.cancel();
        }
        pending.decrementAndGet();
    }
}

waitForSyncToCatchUp()

private boolean waitForSyncToCatchUp(long started)
{
    return started > lastSyncedAt + blockWhenSyncLagsMillis;
}

这里的blockWhenSyncLagsMillis是1.5倍的commitlog_sync_period_in_ms

blockWhenSyncLagsMillis = (int) (DatabaseDescriptor.getCommitLogSyncPeriod() * 1.5);

为什么是1.5倍呢,我的理解是假设flush刷盘的时间是0.5个commitlog_sync_period,但是这个其实是不一定的,可能大于0.5,可能小于0.5,这里就潜在数据丢失了,假设这个确实flush一次不止0.5个commitlog_sync_period,那写完的数据其实是不确定一定刷盘了的。
具体的flush代码,位于AbstractCommitLogService中的start()方法中

long syncStarted = System.currentTimeMillis();
commitLog.sync(shutdown);
lastSyncedAt = syncStarted;
syncComplete.signalAll();

commitLog.sync()->segment.sync()->write(startMarker, sectionEnd),write在CompressedSegment与MemoryMappedSegment实现,最终都是调用的channel.force()

posted @ 2015-05-26 16:11  东岸往事  阅读(2148)  评论(0编辑  收藏  举报