为方便不同平台移植或定制需求的开发,leveldb定义了平台无关的运行环境、文件系统、多线程等基础接口,封装于Env类中,并提供了默认的符合posix语义的派生类实现(PosixEnv),用户可根据需要选择或定制实现。
先来看Env提供的基本接口:
class Env { public: Env() { } virtual ~Env(); // The result of Default() belongs to leveldb and must never be deleted. static Env* Default(); // Create a brand new sequentially-readable file with the specified name. // On success, stores a pointer to the new file in *result and returns OK. // On failure stores NULL in *result and returns non-OK. If the file does // not exist, returns a non-OK status. // // The returned file will only be accessed by one thread at a time. virtual Status NewSequentialFile(const std::string& fname, SequentialFile** result) = 0; // Create a brand new random access read-only file with the // specified name. On success, stores a pointer to the new file in // *result and returns OK. On failure stores NULL in *result and // returns non-OK. If the file does not exist, returns a non-OK // status. // // The returned file may be concurrently accessed by multiple threads. virtual Status NewRandomAccessFile(const std::string& fname, RandomAccessFile** result) = 0; // Create an object that writes to a new file with the specified // name. Deletes any existing file with the same name and creates a // new file. On success, stores a pointer to the new file in // *result and returns OK. On failure stores NULL in *result and // returns non-OK. // // The returned file will only be accessed by one thread at a time. virtual Status NewWritableFile(const std::string& fname, WritableFile** result) = 0; // Create an object that either appends to an existing file, or // writes to a new file (if the file does not exist to begin with). // On success, stores a pointer to the new file in *result and // returns OK. On failure stores NULL in *result and returns // non-OK. // // The returned file will only be accessed by one thread at a time. // // May return an IsNotSupportedError error if this Env does // not allow appending to an existing file. Users of Env (including // the leveldb implementation) must be prepared to deal with // an Env that does not support appending. virtual Status NewAppendableFile(const std::string& fname, WritableFile** result); // Returns true iff the named file exists. virtual bool FileExists(const std::string& fname) = 0; // Store in *result the names of the children of the specified directory. // The names are relative to "dir". // Original contents of *results are dropped. virtual Status GetChildren(const std::string& dir, std::vector* result) = 0; // Delete the named file. virtual Status DeleteFile(const std::string& fname) = 0; // Create the specified directory. virtual Status CreateDir(const std::string& dirname) = 0; // Delete the specified directory. virtual Status DeleteDir(const std::string& dirname) = 0; // Store the size of fname in *file_size. virtual Status GetFileSize(const std::string& fname, uint64_t* file_size) = 0; // Rename file src to target. virtual Status RenameFile(const std::string& src, const std::string& target) = 0; // Lock the specified file. Used to prevent concurrent access to // the same db by multiple processes. On failure, stores NULL in // *lock and returns non-OK. // // On success, stores a pointer to the object that represents the // acquired lock in *lock and returns OK. The caller should call // UnlockFile(*lock) to release the lock. If the process exits, // the lock will be automatically released. // // If somebody else already holds the lock, finishes immediately // with a failure. I.e., this call does not wait for existing locks // to go away. // // May create the named file if it does not already exist. virtual Status LockFile(const std::string& fname, FileLock** lock) = 0; // Release the lock acquired by a previous successful call to LockFile. // REQUIRES: lock was returned by a successful LockFile() call // REQUIRES: lock has not already been unlocked. virtual Status UnlockFile(FileLock* lock) = 0; // Arrange to run "(*function)(arg)" once in a background thread. // // "function" may run in an unspecified thread. Multiple functions // added to the same Env may run concurrently in different threads. // I.e., the caller may not assume that background work items are // serialized. virtual void Schedule(void (*function)(void* arg), void* arg) = 0; // Start a new thread, invoking "function(arg)" within the new thread. // When "function(arg)" returns, the thread will be destroyed. virtual void StartThread(void (*function)(void* arg), void* arg) = 0; // *path is set to a temporary directory that can be used for testing. It may // or many not have just been created. The directory may or may not differ // between runs of the same process, but subsequent calls will return the // same directory. virtual Status GetTestDirectory(std::string* path) = 0; // Create and return a log file for storing informational messages. virtual Status NewLogger(const std::string& fname, Logger** result) = 0; // Returns the number of micro-seconds since some fixed point in time. Only // useful for computing deltas of time. virtual uint64_t NowMicros() = 0; // Sleep/delay the thread for the prescribed number of micro-seconds. virtual void SleepForMicroseconds(int micros) = 0; private: // No copying allowed Env(const Env&); void operator=(const Env&); };
针对leveldb文件IO的应用场景,定义了三种基本文件接口SequentialFile、RandomAccessFile、WritableFile等,其中SequentialFile用于顺序读,如日志文件的读取、MANIFEST文件的读取;RandomAccessFile用于随机只读,如sst文件的读取;WritableFile用于顺序写操作,又提供两种实例化接口,NewWritableFile和NewAppendableFile,区别在于前者如果存在同名文件会先删除在创建新新文件,后者追加写已存在同名文件或者直接写新文件,如正常情况下写日志文件是用的前者,在leveldb异常恢复启动时对于最后一个有可能复用的日志文件则使用后者继续追加写。
NewRandomAccessFile其实例化有两种方式,如果当前以mmap方式使用数量在控制之内,可用则优先创建PosixMmapReadableFile对象,否则创建PosixRandomAccessFile对象,前者更高效。
需要注意SequentialFile和RandomAccessFile定义的读接口:
virtual Status Read(size_t n, Slice* result, char* scratch) = 0;
virtual Status Read(uint64_t offset, size_t n, Slice* result, char* scratch) const = 0;
相比传统的read调用,多了一个参数,其中result用于返回最终读取的结果,scratch用作读buffer,使用时需注意,这种设计在很多处均有体现。
还定义了文件属性、目录操作、线程及文件锁等的基本接口,现结合PosixEnv类看下一些比较trick或者比较有考量的实现。
PosixRandomAccessFile是对RandomAccessFile的实现,Read它基于pread系统调用,不是我们常用的lseek+read/fread,pread不会修改本身文件指针offset,定位和读在一个调用里完成,多线程同时读安全。
PosixWritableFile是对WritableFile的实现,Sync操作值得注意,先使用SyncDirIfManifest检查如果当前写的是新MANIFEST文件,确保包含该文件的目录内容及时同步到磁盘,而后使用了fdatasync而不是fsync,前者不会同步文件的属性信息(metadata)如文件大小、访问时间、修改时间等,文件的数据本身和metadata往往是分开存储在硬盘的不同部分,后者至少需要两次IO操作,而前者只需要一次,更高效,当然metadata是不可或缺的,否则也无法正常读取到修改内容,这里主要为提升性能。
PosixMmapReadableFile是对RandomAccessFile的实现,为了实现对只读文件的快速随机访问,这里基于mmap实现,相比pread更高效。
PosixEnv定义了MmapLimiter类的一个变量,用来控制mmap读写文件的个数,默认最多1000,防止把虚拟内存消耗过多。MmapLimiter需要多线程读写支持,实现上使用了mutex锁。
PosixLockTable使用set保存锁定的文件,对于set的插入和删除使用了mutex锁。
锁定文件使用fcntl,基于flock结构体锁定整个文件。
LockFile和UnlockFile两个方法确保对PosixFileLock对象的创建和销毁。
Schedule用于初始创建后台线程和把待调度执行的任务(BGItem)放入队列(std::deque),BGThread用于从队列中取出任务并执行。
另有StartThread用于创建线程,并通过StartThreadWrapper传入StartThreadState执行函数调用。
Default用于创建默认的PosixEnv单例,这里单例基于pthread_once实现。
整个Envt特别是PosixEnv主要针对如linux的文件系统、多线程等操作,利用相关系统调用或者C库接口,有选择性的高效实现。