因Compaction的实现细节比较繁琐,故分为两部分进行源码剖析,本篇为下篇,主讲compaction的实际work及后续工作。
再回到DBImpl::BackgroundCompaction:
void DBImpl::BackgroundCompaction() { mutex_.AssertHeld(); if (imm_ != NULL) { CompactMemTable(); return; } Compaction* c; bool is_manual = (manual_compaction_ != NULL); InternalKey manual_end; if (is_manual) { ManualCompaction* m = manual_compaction_; c = versions_->CompactRange(m->level, m->begin, m->end); m->done = (c == NULL); if (c != NULL) { manual_end = c->input(0, c->num_input_files(0) - 1)->largest; } Log(options_.info_log, "Manual compaction at level-%d from %s .. %s; will stop at %s\n", m->level, (m->begin ? m->begin->DebugString().c_str() : "(begin)"), (m->end ? m->end->DebugString().c_str() : "(end)"), (m->done ? "(end)" : manual_end.DebugString().c_str())); } else { c = versions_->PickCompaction(); } Status status; if (c == NULL) { // Nothing to do } else if (!is_manual && c->IsTrivialMove()) { // Move file to next level assert(c->num_input_files(0) == 1); FileMetaData* f = c->input(0, 0); c->edit()->DeleteFile(c->level(), f->number); c->edit()->AddFile(c->level() + 1, f->number, f->file_size, f->smallest, f->largest); status = versions_->LogAndApply(c->edit(), &mutex_); if (!status.ok()) { RecordBackgroundError(status); } VersionSet::LevelSummaryStorage tmp; Log(options_.info_log, "Moved #%lld to level-%d %lld bytes %s: %s\n", static_cast(f->number), c->level() + 1, static_cast(f->file_size), status.ToString().c_str(), versions_->LevelSummary(&tmp)); } else { CompactionState* compact = new CompactionState(c); status = DoCompactionWork(compact); if (!status.ok()) { RecordBackgroundError(status); } CleanupCompaction(compact); c->ReleaseInputs(); DeleteObsoleteFiles(); } delete c; if (status.ok()) { // Done } else if (shutting_down_.Acquire_Load()) { // Ignore compaction errors found during shutting down } else { Log(options_.info_log, "Compaction error: %s", status.ToString().c_str()); } if (is_manual) { ManualCompaction* m = manual_compaction_; if (!status.ok()) { m->done = true; } if (!m->done) { // We only compacted part of the requested range. Update *m // to the range that is left to be compacted. m->tmp_storage = manual_end; m->begin = &m->tmp_storage; } manual_compaction_ = NULL; } }
Compaction上篇已经把minor compaction(CompactMemTable)、手动合并前期(CompactRange)、自动合并前期(PickCompaction)等一一分析完,对于major compaction对应的手动和自动均已构造好Compaction对象,让我们看后续,有个!is_manual && c->IsTrivialMove()条件分支:
bool Compaction::IsTrivialMove() const { // Avoid a move if there is lots of overlapping grandparent data. // Otherwise, the move could create a parent file that will require // a very expensive merge later on. return (num_input_files(0) == 1 && num_input_files(1) == 0 && TotalFileSize(grandparents_) <= kMaxGrandParentOverlapBytes); }
在非手动且满足IsTrivialMove的条件下是不需要进行文件合并的,可以直接推至下一层。
下面看真正进行compaction work的分支,基于Compaction对象构造CompactionState,来看DoCompactionWork:
Status DBImpl::DoCompactionWork(CompactionState* compact) { const uint64_t start_micros = env_->NowMicros(); int64_t imm_micros = 0; // Micros spent doing imm_ compactions Log(options_.info_log, "Compacting %d@%d + %d@%d files", compact->compaction->num_input_files(0), compact->compaction->level(), compact->compaction->num_input_files(1), compact->compaction->level() + 1); assert(versions_->NumLevelFiles(compact->compaction->level()) > 0); assert(compact->builder == NULL); assert(compact->outfile == NULL); if (snapshots_.empty()) { compact->smallest_snapshot = versions_->LastSequence(); } else { compact->smallest_snapshot = snapshots_.oldest()->number_; } // Release mutex while we're actually doing the compaction work mutex_.Unlock(); Iterator* input = versions_->MakeInputIterator(compact->compaction); input->SeekToFirst(); Status status; ParsedInternalKey ikey; std::string current_user_key; bool has_current_user_key = false; SequenceNumber last_sequence_for_key = kMaxSequenceNumber; for (; input->Valid() && !shutting_down_.Acquire_Load(); ) { // Prioritize immutable compaction work if (has_imm_.NoBarrier_Load() != NULL) { const uint64_t imm_start = env_->NowMicros(); mutex_.Lock(); if (imm_ != NULL) { CompactMemTable(); bg_cv_.SignalAll(); // Wakeup MakeRoomForWrite() if necessary } mutex_.Unlock(); imm_micros += (env_->NowMicros() - imm_start); } Slice key = input->key(); if (compact->compaction->ShouldStopBefore(key) && compact->builder != NULL) { status = FinishCompactionOutputFile(compact, input); if (!status.ok()) { break; } } // Handle key/value, add to state, etc. bool drop = false; if (!ParseInternalKey(key, &ikey)) { // Do not hide error keys current_user_key.clear(); has_current_user_key = false; last_sequence_for_key = kMaxSequenceNumber; } else { if (!has_current_user_key || user_comparator()->Compare(ikey.user_key, Slice(current_user_key)) != 0) { // First occurrence of this user key current_user_key.assign(ikey.user_key.data(), ikey.user_key.size()); has_current_user_key = true; last_sequence_for_key = kMaxSequenceNumber; } if (last_sequence_for_key <= compact->smallest_snapshot) { // Hidden by an newer entry for same user key drop = true; // (A) } else if (ikey.type == kTypeDeletion && ikey.sequence <= compact->smallest_snapshot && compact->compaction->IsBaseLevelForKey(ikey.user_key)) { // For this user key: // (1) there is no data in higher levels // (2) data in lower levels will have larger sequence numbers // (3) data in layers that are being compacted here and have // smaller sequence numbers will be dropped in the next // few iterations of this loop (by rule (A) above). // Therefore this deletion marker is obsolete and can be dropped. drop = true; } last_sequence_for_key = ikey.sequence; } #if 0 Log(options_.info_log, " Compact: %s, seq %d, type: %d %d, drop: %d, is_base: %d, " "%d smallest_snapshot: %d", ikey.user_key.ToString().c_str(), (int)ikey.sequence, ikey.type, kTypeValue, drop, compact->compaction->IsBaseLevelForKey(ikey.user_key), (int)last_sequence_for_key, (int)compact->smallest_snapshot); #endif if (!drop) { // Open output file if necessary if (compact->builder == NULL) { status = OpenCompactionOutputFile(compact); if (!status.ok()) { break; } } if (compact->builder->NumEntries() == 0) { compact->current_output()->smallest.DecodeFrom(key); } compact->current_output()->largest.DecodeFrom(key); compact->builder->Add(key, input->value()); // Close output file if it is big enough if (compact->builder->FileSize() >= compact->compaction->MaxOutputFileSize()) { status = FinishCompactionOutputFile(compact, input); if (!status.ok()) { break; } } } input->Next(); } if (status.ok() && shutting_down_.Acquire_Load()) { status = Status::IOError("Deleting DB during compaction"); } if (status.ok() && compact->builder != NULL) { status = FinishCompactionOutputFile(compact, input); } if (status.ok()) { status = input->status(); } delete input; input = NULL; CompactionStats stats; stats.micros = env_->NowMicros() - start_micros - imm_micros; for (int which = 0; which < 2; which++) { for (int i = 0; i < compact->compaction->num_input_files(which); i++) { stats.bytes_read += compact->compaction->input(which, i)->file_size; } } for (size_t i = 0; i < compact->outputs.size(); i++) { stats.bytes_written += compact->outputs[i].file_size; } mutex_.Lock(); stats_[compact->compaction->level() + 1].Add(stats); if (status.ok()) { status = InstallCompactionResults(compact); } if (!status.ok()) { RecordBackgroundError(status); } VersionSet::LevelSummaryStorage tmp; Log(options_.info_log, "compacted to: %s", versions_->LevelSummary(&tmp)); return status; }
关于Snapshot和Version部分在本篇仅强调作用不会展开,后续在单独的文章中进行详细分析。
在做真正的compaction work时不应该阻塞DB的读写操作,所以先解锁;接着基于构造好的compaction调用VersionSet::MakeInputIterator生成迭代器TwoLevelIterator(用于非0level sstable文件的数据读取),对于0level直接生成加载到table cache的sstable迭代器,以及用于merge全部文件的MergingIterator,此处实现非常巧妙,依靠迭代器设计简洁的完成merge工作。
再就是基于MergingIterator做merge的过程,输入是全部待merge文件的数据迭代器,每个文件可看作有序list,实际可看作多路归并排序,每次迭代current_(包装的对SSTable进行遍历的IteratorWrapper)指向通过FindSmallest对每个sstable有序列表当前迭代位置进行遍历比较后返回的对应sstable迭代器(比较是InternalKeyComparator),当输入文件比较多时这里可以用最小堆进行优化,而不用当前可能会耗时的顺序遍历。
然后对当前的key做分析处理以及输出判断,调用ParseInternalKey正确解析后做drop或输出判断:
1> 后续key与前面重复且前一sequence(随后更新)不超过最小快照则丢弃
2> 当key打删除标记且当前sequence不超过最小快照并且未可能在grandparent以上层出现则丢弃
除上面两种情况外的key均正常输出。
需要注意的是在上面每次迭代都优先判断当前是否需要做minor compaction。
下面则对合并结果的key进行build输出,若当前已build的数据大小超过输出阈值(MaxOutputFileSize)则及时写文件输出并加入TableCache作为有效性验证(FinishCompactionOutputFile)。
对本次Compaction的读写数据情况进行统计记录,最后调用InstallCompactionResults记录本次Compaction的文件增删情况,生成新的版本信息并写入。
至此实质性的compaction work完成。
再回到DBImpl::BackgroundCompaction,做DBImpl::CleanupCompaction(异常处理以及从pending_outputs_中删除已经成功完成compaction的文件编号)和DBImpl::DeleteObsoleteFiles(删除非存活的sstable及其它废弃文件)。
若本次compaction是手动,则需要清除手动标记。