尝试通过 nydus[1] 源码理解工作流程。可能由于代码变动导致和本文记录的内容有出入。

1. 环境准备

  1. git clone https://github.com/dragonflyoss/image-service.git
  2. cd image-service
  3. make

编译的目标文件位于 target 文件夹内,默认编译的 debug 版本。

可以看到,项目的二进制文件包含 nydusctl (命令行工具)、nydusd(nydus 主体程序,以守护进程的形式运行)、nydus-image(nydus 镜像文件处理工具)三种。

  1. all: build
  2. # Targets that are exposed to developers and users.
  3. build: .format
  4. ${CARGO} build $(CARGO_COMMON)$(CARGO_BUILD_FLAGS)
  5. # Cargo will skip checking if it is already checked
  6. ${CARGO} clippy $(CARGO_COMMON) --workspace $(EXCLUDE_PACKAGES) --bins --tests -- -Dwarnings
  7. .format:
  8. ${CARGO} fmt -- --check

执行 make编译项目时,会首先使用 cargo fmt -- --check 命令对代码格式进行检查。

本文使用的 nydus 版本:

  1. ./target/debug/nydusd --version

2. 代码流程理解

项目的入口函数位于 src/bin 目录下:

分别对应生成的二进制文件 nydusctlnydusdnydus-image,首先,理解最重要的部分nydusd

Nydusd 是运行在用户态的守护进程,可以通过 nydus-snapshotter 进行管理,主要负责处理 fuse 下发的 I/O 请求,当数据不存在本地缓存时,从 backend(registry,OSS,localfs)获取数据内容。

nydusd启动命令:

  1. mkdir /rafs_mnt
  2. ./target/debug/nydusd fuse --thread-num 4 --mountpoint /rafs_mnt --apisock api_sock

2.1 入口函数

src/bin/nydusd/main.rs

首先,从命令行提取参数值,开启日志。

接下来是解析子命令,nydusd 包括 3 个子命令,分别是 singleton、fuse 和 virtiofs:

对于每个子命令,都会再次获取对应的命令参数也就是 args 中 subcommand 的参数内容。fuse指定nydusd 作为专门针对 FUSE 的 server 运行,virtiofs指定nydusd专门作为 virtiofs 的 server 运行,singleton指定nydusd作为全局守护进程运行,可以同时为 blobcache/fscache/fuse/virtio-fs 提供服务。

2.2 FUSE subcommand 启动流程

  1. process_default_fs_service(subargs, bti, apisock, true)?;
  2. // 函数声明
  3. fn process_default_fs_service(
  4. args: SubCmdArgs, //提取的子命令参数
  5. bti: BuildTimeInfo, // 编译时信息
  6. apisock: Option<&str>, // api socket 路径
  7. is_fuse: bool, // 是否为 fuse 文件系统
  8. ) -> Result<()> { 内容太长,省略 }

该函数初始化默认的文件系统服务。

首先根据三个参数生成挂载命令:

virtual_mnt 是挂载的目录位置。

(1)shared_dir 不为空时

  1. let cmd = FsBackendMountCmd {
  2. fs_type: nydus::FsBackendType::PassthroughFs,
  3. source: shared_dir.to_string(),
  4. config: "".to_string(),
  5. mountpoint: virtual_mnt.to_string(),
  6. prefetch_files: None,
  7. };

(2)bootstrap 不为空(只使用 rafs 文件系统)

检测是否传入localfs-dir参数,如果传入,则根据传入的参数生成配置信息,否则,必须传入config参数。此外,解析传入的 prefetch_files 列表:

  1. let config = match args.value_of("localfs-dir") {
  2. Some(v) => {
  3. format!(
  4. r###"
  5. {{
  6. "device": {{
  7. "backend": {{
  8. "type": "localfs",
  9. "config": {{
  10. "dir": {:?},
  11. "readahead": true
  12. }}
  13. }},
  14. "cache": {{
  15. "type": "blobcache",
  16. "config": {{
  17. "compressed": false,
  18. "work_dir": {:?}
  19. }}
  20. }}
  21. }},
  22. "mode": "direct",
  23. "digest_validate": false,
  24. "iostats_files": false
  25. }}
  26. "###,
  27. v, v
  28. )
  29. }
  30. None => match args.value_of("config") {
  31. Some(v) => std::fs::read_to_string(v)?,
  32. None => {
  33. let e = DaemonError::InvalidArguments(
  34. "both --config and --localfs-dir are missing".to_string(),
  35. );
  36. returnErr(e.into());
  37. }
  38. },
  39. };
  40. let prefetch_files: Option<Vec<String>> = args
  41. .values_of("prefetch-files")
  42. .map(|files| files.map(|s| s.to_string()).collect());
  1. let cmd = FsBackendMountCmd {
  2. fs_type: nydus::FsBackendType::Rafs,
  3. source: b.to_string(),
  4. config: std::fs::read_to_string(config)?,
  5. mountpoint: virtual_mnt.to_string(),
  6. prefetch_files,
  7. };

当生成挂载命令cmd后,接下来会根据 opts 参数新建 vfs 实例。

  1. let vfs = fuse_backend_rs::api::Vfs::new(opts);
  2. let vfs = Arc::new(vfs);

2.3 Vfs 结构体分析

  1. /// A union fs that combines multiple backend file systems.
  2. pubstruct Vfs {
  3. next_super: AtomicU8,
  4. root: PseudoFs,
  5. // mountpoints maps from pseudo fs inode to mounted fs mountpoint data
  6. mountpoints: ArcSwap<HashMap<u64, Arc<MountPointData>>>,
  7. // superblocks keeps track of all mounted file systems
  8. superblocks: ArcSuperBlock,
  9. opts: ArcSwap<VfsOptions>,
  10. initialized: AtomicBool,
  11. lock: Mutex<()>,
  12. }

新建 Vfs 实例的时候:

  1. impl Vfs {
  2. /// Create a new vfs instance
  3. pubfn new(opts: VfsOptions) -> Self {
  4. Vfs {
  5. // 下一个可用的 pseudo index
  6. next_super: AtomicU8::new((VFS_PSEUDO_FS_IDX + 1) asu8),
  7. // 挂载点,是一个 Hashmap
  8. mountpoints: ArcSwap::new(Arc::new(HashMap::new())),
  9. // 超级块,数组
  10. superblocks: ArcSwap::new(Arc::new(vec![None; MAX_VFS_INDEX])),
  11. // root,是一个 PseudoFs 实例
  12. root: PseudoFs::new(),
  13. // 传入的参数
  14. opts: ArcSwap::new(Arc::new(opts)),
  15. // 锁
  16. lock: Mutex::new(()),
  17. // 是否已经初始化
  18. initialized: AtomicBool::new(false),
  19. }
  20. }
  21. ...
  22. }

next_super的值初始化为 1,长度为 64 位的 inode number 被拆分为两部分,前 8 位用于标记被挂载的文件系统类型,剩下的 56 位供后端文件系统使用,最大值为VFS_MAX_INO

  1. /// Maximum inode number supported by the VFS for backend file system
  2. pubconst VFS_MAX_INO: u64 = 0xff_ffff_ffff_ffff;
  3. // The 64bit inode number for VFS is divided into two parts:
  4. // 1. an 8-bit file-system index, to identify mounted backend file systems.
  5. // 2. the left bits are reserved for backend file systems, and it's limited to VFS_MAX_INO.
  6. const VFS_INDEX_SHIFT: u8 = 56;
  7. const VFS_PSEUDO_FS_IDX: VfsIndex = 0;

Vfs结构体中root的类型为PseudoFs

  1. pubstruct PseudoFs {
  2. // 下一个可用的 inode
  3. next_inode: AtomicU64,
  4. // 根 inode,指向 PseudoInode 类型的指针
  5. root_inode: Arc<PseudoInode>,
  6. // inodes,类行为 Hashmap
  7. inodes: ArcSwap<HashMap<u64, Arc<PseudoInode>>>,
  8. lock: Mutex<()>, // Write protect PseudoFs.inodes and PseudoInode.children
  9. }

PseudoInode类型:

  1. struct PseudoInode {
  2. // 当前 inode
  3. ino: u64,
  4. // parent 的 inode
  5. parent: u64,
  6. // children 的列表(PseudoInode 类型的指针)
  7. children: ArcSwap<Vec<Arc<PseudoInode>>>,
  8. name: String,
  9. }

nydus 中 Vfs 结构体的组成图示:

回到新建 vfs 实例之后的流程。接下来会获取 daemon_id 和 supervisor 参数(在 live-upgrade/failover 的时候需要)。

然后,根据挂载命令创建 NydusDaemon

2.4 针对 FUSE 的 NydusDaemon

is_fusetrue 时,开始创建 daemon:

(1)获取 fuse server 的线程数量值;

(2)获取 mountpoint 参数的值;

(3)创建 daemon

  1. let daemon = {
  2. fusedev::create_fuse_daemon(
  3. mountpoint, // 挂载点路径
  4. vfs, // 创建的 vfs 实例
  5. supervisor,
  6. daemon_id,
  7. threads, // 线程数量
  8. apisock, // api socket 路径
  9. args.is_present("upgrade"),
  10. !args.is_present("writable"),
  11. p, // failover-policy
  12. mount_cmd, // 挂载命令
  13. bti,
  14. )
  15. .map(|d| {
  16. info!("Fuse daemon started!");
  17. d
  18. })
  19. .map_err(|e| {
  20. error!("Failed in starting daemon: {}", e);
  21. e
  22. })?
  23. };
  24. DAEMON_CONTROLLER.set_daemon(daemon);

fusedev::create_fuse_daemon 函数中,主要的逻辑如下:

(1)创建两个 channel

  1. let (trigger, events_rx) = channel::<DaemonStateMachineInput>();
  2. let (result_sender, result_receiver) = channel::<DaemonResult<()>>();

channel 是用于线程间通信,返回值分别为 senderrecver,例如:(trigger, events_rx) 中,trigger 为发送者,events_rx 为接收者。

(2)创建 Service 实例

  1. let service = FusedevFsService::new(vfs, &mnt, supervisor.as_ref(), fp, readonly)?;
  2. impl FusedevFsService {
  3. fn new(
  4. vfs: Arc<Vfs>,
  5. mnt: &Path,
  6. supervisor: Option<&String>,
  7. fp: FailoverPolicy,
  8. readonly: bool,
  9. ) -> Result<Self> {
  10. // 创建和 FUSE 的 session
  11. let session = FuseSession::new(mnt, "rafs", "", readonly).map_err(|e| eother!(e))?;
  12. let upgrade_mgr = supervisor
  13. .as_ref()
  14. .map(|s| Mutex::new(UpgradeManager::new(s.to_string().into())));
  15. Ok(FusedevFsService {
  16. vfs: vfs.clone(),
  17. conn: AtomicU64::new(0),
  18. failover_policy: fp,
  19. session: Mutex::new(session),
  20. server: Arc::new(Server::new(vfs)),
  21. upgrade_mgr,
  22. backend_collection: Default::default(),
  23. inflight_ops: Mutex::new(Vec::new()),
  24. })
  25. }
  26. ...
  27. }

(3)创建 Daemon 实例:

  1. let daemon = Arc::new(FusedevDaemon {
  2. bti,
  3. id,
  4. supervisor,
  5. threads_cnt, // 线程数量
  6. state: AtomicI32::new(DaemonState::INIT asi32),
  7. result_receiver: Mutex::new(result_receiver),
  8. request_sender: Arc::new(Mutex::new(trigger)),
  9. service: Arc::new(service),
  10. state_machine_thread: Mutex::new(None),
  11. fuse_service_threads: Mutex::new(Vec::new()),
  12. });

其中,FusedevFsService::new() 函数会调用FuseSession::new函数,创建和内核 FUSE 通信的 session,只是还没有挂载和连接请求。

FuseSession::new() 为外部 fuse-backend-rs[2] creat,对应代码如下:

创建好的 session 实例存储在 FusedevFsService 结构体的 session 属性,同时用 Mutex 包裹,只允许互斥访问。

创建好的service 作为 FusedevDaemon 结构体 service 属性的值,使用 Arc 包裹,允许并发访问。

2.5 nydusd 状态机

machineDaemonStateMachineContext 结构体的实例,存储了 daemon 的 PID,指向 daemon 实例的指针,以及接收请求和返回结果的 channel,用于线程间通信。

  1. let machine = DaemonStateMachineContext::new(daemon.clone(), events_rx, result_sender);

nydusd 的状态机用于维护 nydusd 的状态,具体的状态转移策略如下:

  1. state_machine! {
  2. derive(Debug, Clone)
  3. pub DaemonStateMachine(Init)
  4. // Init意味着 nydusd 刚启动,可能已经配置好了,
  5. // 但还没有和内核协商双方的能力,也没有尝试通过
  6. // 挂载 /fuse/dev 来建立fuse会话(如果是fusedev后端)
  7. Init => {
  8. Mount => Ready,
  9. Takeover => Ready[Restore],
  10. Stop => Die[StopStateMachine],
  11. },
  12. // Ready表示 nydusd 已经准备就绪,
  13. // Fuse会话被创建。状态可以转换为 Running 或 Die
  14. Ready => {
  15. Start => Running[StartService],
  16. Stop => Die[Umount],
  17. Exit => Die[StopStateMachine],
  18. },
  19. // Running 意味着 nydusd 已经成功地准备好了
  20. // 作为用户空间 fuse 文件系统所需的内容,
  21. // 但是,必要的 capability 协商可能还没有完成,
  22. // 通过 fuse-rs 来判断
  23. Running => {
  24. Stop => Ready [TerminateService],
  25. },
  26. }

machine.kick_state_machine() 方法用于启动状态机线程。

  1. let machine_thread = machine.kick_state_machine()?;

该线程的名称为state_machine,通过 top -Hp NYDUSD_PID 可以看到:

该线程是一个死循环,用于接收来自 channel 消息。(消息从哪发送?)

  1. self.request_receiver.recv()

其中,recv() 函数会阻塞,接收 DaemonStateMachineInput 类型的消息,保存在 event 变量中,self.sm.consume(&event) 方法处理每个 event,完成相应操作,并修改状态为新的值。

处理完成后,通过 result_sender channel 返回状态消息。(传递给谁?)

然后,会打印日志信息,包括上一次的状态,本次状态,输入和输出。

启动 nydusd 时打印的关于 State machine 的日志信息:

状态机线程接收的消息来自哪里呢?这就需要回到创建 channel的地方:

request_receiver对应的 channel名为trigger,和result_sender对应的channel名为result_receiver,都存储在daemon中:

  1. let daemon = Arc::new(FusedevDaemon {
  2. ...
  3. result_receiver: Mutex::new(result_receiver),
  4. request_sender: Arc::new(Mutex::new(trigger)),
  5. ...
  6. });

这两个channelon_event函数中被使用:

  1. impl DaemonStateMachineSubscriber for FusedevDaemon {
  2. fn on_event(&self, event: DaemonStateMachineInput) -> DaemonResult<()> {
  3. self.request_sender
  4. .lock()
  5. .unwrap()
  6. .send(event)
  7. .map_err(|e| DaemonError::Channel(format!("send {:?}", e)))?;
  8. self.result_receiver
  9. .lock()
  10. .expect("Not expect poisoned lock!")
  11. .recv()
  12. .map_err(|e| DaemonError::Channel(format!("recv {:?}", e)))?
  13. }
  14. }

因此,state_machine 通过 channel接收来自nydusd 的消息,从而改变状态,例如,对于stop操作:

2.5.1 FUSE 启动 service

上面提到,state_machine线程会改变nydusd的状态,对于 StartService 事件,会运行 d.start() 方法,并且在运行成功之后通过 set_state(DaemonState::RUNNING) 将 Daemon 的状态设置为 RUNNING。

  1. let r = match action {
  2. Some(a) => match a {
  3. StartService => d.start().map(|r| {
  4. d.set_state(DaemonState::RUNNING);
  5. r
  6. }),
  7. ...
  8. },
  9. _ => Ok(()),
  10. };

不同类型 Daemon 的 d.start() 方法实现不一样,对于 FusedevDaemon,start() 内容如下:

  1. fn start(&self) -> DaemonResult<()> {
  2. info!("start {} fuse servers", self.threads_cnt);
  3. for _ in0..self.threads_cnt {
  4. let waker = DAEMON_CONTROLLER.alloc_waker();
  5. self.kick_one_server(waker)
  6. .map_err(|e| DaemonError::StartService(format!("{:?}", e)))?;
  7. }
  8. Ok(())
  9. }

这里会根据 threads_cnt,开启对应数量的线程。其中,DAEMON_CONTROLLER.alloc_waker() 只是复制了对 DAEMON_CONTROLLER.waker 的引用。

  1. pubfn alloc_waker(&self) -> Arc<Waker> {
  2. self.waker.clone()
  3. }

kick_one_server(waker)FusedevDaemon 结构体的方法:

  1. fn kick_one_server(&self, waker: Arc<Waker>) -> Result<()> {
  2. letmut s = self.service.create_fuse_server()?;
  3. let inflight_op = self.service.create_inflight_op();
  4. let thread = thread::Builder::new()
  5. .name("fuse_server".to_string())
  6. .spawn(move || {
  7. ifletErr(err) = s.svc_loop(&inflight_op) {
  8. warn!("fuse server exits with err: {:?}, exiting daemon", err);
  9. ifletErr(err) = waker.wake() {
  10. error!("fail to exit daemon, error: {:?}", err);
  11. }
  12. }
  13. // Notify the daemon controller that one working thread has exited.
  14. Ok(())
  15. })
  16. .map_err(DaemonError::ThreadSpawn)?;
  17. self.fuse_service_threads.lock().unwrap().push(thread);
  18. Ok(())
  19. }

kick_one_server方法启动了名为 fuse_server 的线程,成功启动的线程存储在 FusedevDaemon.fuse_service_threads 中。

2.5.2 FUSE server 线程(处理 FUSE 请求)

在启动线程前,创建了 fuse serverinflight operatoinscreate_fuse_server() 是 FusedevFsService 结构实现的方法:

  1. fn create_fuse_server(&self) -> Result<FuseServer> {
  2. FuseServer::new(self.server.clone(), self.session.lock().unwrap().deref())
  3. }

create_fuse_server()方法通过 FuseServer::new()方法进行实例化,传入的参数中,self.server.clone() 是对 server 的引用,self.session.lock().unwrap().deref()session 的去引用实例,方法的返回值是 FuseServer 结构的实例。

  1. fn new(server: Arc<Server<Arc<Vfs>>>, se: &FuseSession) -> Result<FuseServer> {
  2. let ch = se.new_channel().map_err(|e| eother!(e))?;
  3. Ok(FuseServer { server, ch })
  4. }

创建 FuseServer 结构的实例之前,首先通过 FuseSessionnew_channel() 方法创建 fuse channel,并存储在 FuseServer 实例中。

FuseSession 是 fuse-backend-rs 中的结构,new_channel() 方法用于创建新的 channel:

FuseChannel::new()方法如下:

create_inflight_op() 方法也是 FusedevFsService 结构实现的方法,返回的 inflight_op 被添加到 FusedevFsService 结构的 inflight_ops中:

  1. fn create_inflight_op(&self) -> FuseOpWrapper {
  2. let inflight_op = FuseOpWrapper::default();
  3. // "Not expected poisoned lock"
  4. self.inflight_ops.lock().unwrap().push(inflight_op.clone());
  5. inflight_op
  6. }

FuseOpWrapper::default() 方法用于对 FuseOpWrapper 初始化,随后被追加到self.inflight_ops中。

创建好fuse serverinflight operatoins之后,启动fuse_server线程。其中,s.svc_loop(&inflight_op) 方法是线程的主要处理逻辑:

  1. fn svc_loop(&mutself, metrics_hook: &dyn MetricsHook) -> Result<()> {
  2. // Given error EBADF, it means kernel has shut down this session.
  3. let _ebadf = Error::from_raw_os_error(libc::EBADF);
  4. loop {
  5. // 通过 channel(epoll)获取 FUSE 请求
  6. ifletSome((reader, writer)) = self.ch.get_request().map_err(|e| {
  7. warn!("get fuse request failed: {:?}", e);
  8. Error::from_raw_os_error(libc::EINVAL)
  9. })? {
  10. ifletErr(e) =
  11. self.server
  12. .handle_message(reader, writer.into(), None, Some(metrics_hook))
  13. {
  14. match e {
  15. fuse_backend_rs::Error::EncodeMessage(_ebadf) => {
  16. returnErr(eio!("fuse session has been shut down"));
  17. }
  18. _ => {
  19. error!("Handling fuse message, {}", DaemonError::ProcessQueue(e));
  20. continue;
  21. }
  22. }
  23. }
  24. } else {
  25. info!("fuse server exits");
  26. break;
  27. }
  28. }
  29. Ok(())
  30. }

这是一个死循环,self.ch.get_request() 也是 fuse-backend-rs 中 FuseChannel 结构的方法,用于通过 channel 从 fuse 内核模块获取(通过 unix socket fd 进行通信) fuse 请求。

返回的值包括 readerwriter,作为方法handle_message() 的参数,同时还会传入metrics_hook用于收集数据。self.server.handle_message() 负责处理每个 fuse 请求,也是 fuse-backend-rs 中 Server 实现的方法:

fuse-backend-rs实现了针对不同Opcode的方法:

  1. let res = match in_header.opcode {
  2. x if x == Opcode::Lookup asu32 => self.lookup(ctx),
  3. x if x == Opcode::Forget asu32 => self.forget(ctx), // No reply.
  4. x if x == Opcode::Getattr asu32 => self.getattr(ctx),
  5. x if x == Opcode::Setattr asu32 => self.setattr(ctx),
  6. x if x == Opcode::Readlink asu32 => self.readlink(ctx),
  7. x if x == Opcode::Symlink asu32 => self.symlink(ctx),
  8. x if x == Opcode::Mknod asu32 => self.mknod(ctx),
  9. x if x == Opcode::Mkdir asu32 => self.mkdir(ctx),
  10. x if x == Opcode::Unlink asu32 => self.unlink(ctx),
  11. x if x == Opcode::Rmdir asu32 => self.rmdir(ctx),
  12. x if x == Opcode::Rename asu32 => self.rename(ctx),
  13. x if x == Opcode::Link asu32 => self.link(ctx),
  14. x if x == Opcode::Open asu32 => self.open(ctx),
  15. x if x == Opcode::Read asu32 => self.read(ctx),
  16. x if x == Opcode::Write asu32 => self.write(ctx),
  17. x if x == Opcode::Statfs asu32 => self.statfs(ctx),
  18. x if x == Opcode::Release asu32 => self.release(ctx),
  19. x if x == Opcode::Fsync asu32 => self.fsync(ctx),
  20. x if x == Opcode::Setxattr asu32 => self.setxattr(ctx),
  21. x if x == Opcode::Getxattr asu32 => self.getxattr(ctx),
  22. x if x == Opcode::Listxattr asu32 => self.listxattr(ctx),
  23. x if x == Opcode::Removexattr asu32 => self.removexattr(ctx),
  24. x if x == Opcode::Flush asu32 => self.flush(ctx),
  25. x if x == Opcode::Init asu32 => self.init(ctx),
  26. x if x == Opcode::Opendir asu32 => self.opendir(ctx),
  27. x if x == Opcode::Readdir asu32 => self.readdir(ctx),
  28. x if x == Opcode::Releasedir asu32 => self.releasedir(ctx),
  29. x if x == Opcode::Fsyncdir asu32 => self.fsyncdir(ctx),
  30. x if x == Opcode::Getlk asu32 => self.getlk(ctx),
  31. x if x == Opcode::Setlk asu32 => self.setlk(ctx),
  32. x if x == Opcode::Setlkw asu32 => self.setlkw(ctx),
  33. x if x == Opcode::Access asu32 => self.access(ctx),
  34. x if x == Opcode::Create asu32 => self.create(ctx),
  35. x if x == Opcode::Bmap asu32 => self.bmap(ctx),
  36. x if x == Opcode::Ioctl asu32 => self.ioctl(ctx),
  37. x if x == Opcode::Poll asu32 => self.poll(ctx),
  38. x if x == Opcode::NotifyReply asu32 => self.notify_reply(ctx),
  39. x if x == Opcode::BatchForget asu32 => self.batch_forget(ctx),
  40. x if x == Opcode::Fallocate asu32 => self.fallocate(ctx),
  41. x if x == Opcode::Readdirplus asu32 => self.readdirplus(ctx),
  42. x if x == Opcode::Rename2 asu32 => self.rename2(ctx),
  43. x if x == Opcode::Lseek asu32 => self.lseek(ctx),
  44. #[cfg(feature = "virtiofs")]
  45. x if x == Opcode::SetupMapping asu32 => self.setupmapping(ctx, vu_req),
  46. #[cfg(feature = "virtiofs")]
  47. x if x == Opcode::RemoveMapping asu32 => self.removemapping(ctx, vu_req),
  48. // Group reqeusts don't need reply together
  49. x => match x {
  50. x if x == Opcode::Interrupt asu32 => {
  51. self.interrupt(ctx);
  52. Ok(0)
  53. }
  54. x if x == Opcode::Destroy asu32 => {
  55. self.destroy(ctx);
  56. Ok(0)
  57. }
  58. _ =>ctx.reply_error(io::Error::from_raw_os_error(libc::ENOSYS)),
  59. },
  60. };

在每个方法中,调用了self.fs.xxx()方法完成操作,以mkdir为例:

这个fs指的是什么呢?在Server结构体定义中看到,fs是实现了FileSystem + Sync的 trait:

  1. /// Fuse Server to handle requests from the Fuse client and vhost user master.
  2. pubstruct Server<F: FileSystem + Sync> {
  3. fs: F,
  4. vers: ArcSwap<ServerVersion>,
  5. }

还记得创建FuseServer的时候吗?

  1. struct FuseServer {
  2. server: Arc<Server<Arc<Vfs>>>,
  3. ch: FuseChannel,
  4. }
  5. impl FuseServer {
  6. fn new(server: Arc<Server<Arc<Vfs>>>, se: &FuseSession) -> Result<FuseServer> {
  7. let ch = se.new_channel().map_err(|e| eother!(e))?;
  8. Ok(FuseServer { server, ch })
  9. }
  10. ...
  11. }

这里FuseServer结构体中server类型Arc<Server<Arc<Vfs>>>中的Server就是Server结构体,因此,fs的类型是Arc<Vfs>

fuse-backend-rs中对 Vfs 实现了 FileSystem trait:

fuse_server 线程可以通过top -Hp NYDUSD_PID 看到:

日志信息:

2.5.3 FUSE 终止 service

状态机收到TerminateService事件时,先执行d.interrupt(),然后等待线程结束,最后设置状态。

  1. TerminateService => {
  2. d.interrupt();
  3. let res = d.wait_service();
  4. if res.is_ok() {
  5. d.set_state(DaemonState::READY);
  6. }
  7. res
  8. }

interrupt() 方法:

  1. fn interrupt(&self) {
  2. let session = self
  3. .service
  4. .session
  5. .lock()
  6. .expect("Not expect poisoned lock.");
  7. ifletErr(e) = session.wake().map_err(DaemonError::SessionShutdown) {
  8. error!("stop fuse service thread failed: {:?}", e);
  9. }
  10. }

wait_service() 方法:

  1. fn wait_service(&self) -> DaemonResult<()> {
  2. loop {
  3. let handle = self.fuse_service_threads.lock().unwrap().pop();
  4. ifletSome(handle) = handle {
  5. handle
  6. .join()
  7. .map_err(|e| {
  8. DaemonError::WaitDaemon(
  9. *e.downcast::<Error>()
  10. .unwrap_or_else(|e| Box::new(eother!(e))),
  11. )
  12. })?
  13. .map_err(DaemonError::WaitDaemon)?;
  14. } else {
  15. // No more handles to wait
  16. break;
  17. }
  18. }
  19. Ok(())
  20. }

2.5.4 FUSE Umount 操作

Umount 事件和 TerminateService 事件的操作几乎一样,只是会在执行d.interrupt()之前先断开和 fuse 内核模块的连接:

  1. Umount => d.disconnect().map(|r| {
  2. // Always interrupt fuse service loop after shutdown connection to kernel.
  3. // In case that kernel does not really shutdown the session due to some reasons
  4. // causing service loop keep waiting of `/dev/fuse`.
  5. d.interrupt();
  6. d.wait_service()
  7. .unwrap_or_else(|e| error!("failed to wait service {}", e));
  8. // at least all fuse thread stopped, no matter what error each thread got
  9. d.set_state(DaemonState::STOPPED);
  10. r
  11. }),

断开连接的d.disconnect() 方法:

  1. fn disconnect(&self) -> DaemonResult<()> {
  2. self.service.disconnect()
  3. }

最终调用了session.umount() 方法:

  1. fn disconnect(&self) -> DaemonResult<()> {
  2. let mutsession = self.session.lock().expect("Not expect poisoned lock.");
  3. session.umount().map_err(DaemonError::SessionShutdown)?;
  4. session.wake().map_err(DaemonError::SessionShutdown)?;
  5. Ok(())
  6. }

fuse-backend-rs 中umount方法的实现:

  1. /// Destroy a fuse session.
  2. pub fnumount(&mutself) -> Result<()> {
  3. ifletSome(file) =self.file.take() {
  4. ifletSome(mountpoint) =self.mountpoint.to_str() {
  5. fuse_kern_umount(mountpoint, file)
  6. } else {
  7. Err(SessionFailure("invalid mountpoint".to_string()))
  8. }
  9. } else {
  10. Ok(())
  11. }
  12. }

此外,还有 Restore 和 StopStateMachine 事件:

  1. Restore => {
  2. let res = d.restore();
  3. if res.is_ok() {
  4. d.set_state(DaemonState::READY);
  5. }
  6. res
  7. }
  8. StopStateMachine => {
  9. d.set_state(DaemonState::STOPPED);
  10. Ok(())
  11. }

Daemon 的状态为 STOPPED 时会结束此进程:

  1. if d.get_state() == DaemonState::STOPPED {
  2. break;
  3. }

状态机的功能到此结束。

回到create_fuse_daemon函数,到目前为止,已经创建了daemon对象并启动了状态机线程,状态机线程存储在daemon中:

2.6 Mount FUSE 文件系统

如果不是热升级和 failover 操作,会向 FUSE 内核模块发起 mount 操作请求:

  1. // 1. api_sock 已经存在,但不是热升级操作,也不是 failover
  2. // 2. api_sock 不存在
  3. if (api_sock.as_ref().is_some() && !upgrade && !is_crashed(&mnt, api_sock.as_ref().unwrap())?)
  4. || api_sock.is_none()
  5. {
  6. ifletSome(cmd) = mount_cmd {
  7. daemon.service.mount(cmd)?;
  8. }
  9. daemon.service.session.lock().unwrap()
  10. .mount()
  11. .map_err(|e| eother!(e))?;
  12. daemon.on_event(DaemonStateMachineInput::Mount)
  13. .map_err(|e| eother!(e))?;
  14. daemon.on_event(DaemonStateMachineInput::Start)
  15. .map_err(|e| eother!(e))?;
  16. daemon.service.conn
  17. .store(calc_fuse_conn(mnt)?, Ordering::Relaxed);
  18. }

如果mount_cmd不为 None,则通过daemon.service.mount(cmd)挂载后端文件系统:

  1. // NOTE: This method is not thread-safe, however, it is acceptable as
  2. // mount/umount/remount/restore_mount is invoked from single thread in FSM
  3. fn mount(&self, cmd: FsBackendMountCmd) -> DaemonResult<()> {
  4. ifself.backend_from_mountpoint(&cmd.mountpoint)?.is_some() {
  5. returnErr(DaemonError::AlreadyExists);
  6. }
  7. let backend = fs_backend_factory(&cmd)?;
  8. let index = self.get_vfs().mount(backend, &cmd.mountpoint)?;
  9. info!("{} filesystem mounted at {}", &cmd.fs_type, &cmd.mountpoint);
  10. self.backend_collection().add(&cmd.mountpoint, &cmd)?;
  11. // Add mounts opaque to UpgradeManager
  12. ifletSome(mutmgr_guard) = self.upgrade_mgr() {
  13. upgrade::add_mounts_state(&mutmgr_guard, cmd, index)?;
  14. }
  15. Ok(())
  16. }

首先通过self.backend_from_mountpoint(&cmd.mountpoint)方法检查传入的路径是否已经被挂载。如果已经存在,则返回错误。

backend_from_mountpoint方法调用了Vfsget_rootfs方法,首先得到传入pathinode,然后查看对应inode是否存在mountpoints Hashmap 中:

  1. /// Get the mounted backend file system alongside the path if there's one.
  2. pubfn get_rootfs(&self, path: &str) -> VfsResult<Option<Arc<BackFileSystem>>> {
  3. // Serialize mount operations. Do not expect poisoned lock here.
  4. let _guard = self.lock.lock().unwrap();
  5. let inode = matchself.root.path_walk(path).map_err(VfsError::PathWalk)? {
  6. Some(i) => i,
  7. None => returnOk(None),
  8. };
  9. ifletSome(mnt) = self.mountpoints.load().get(&inode) {
  10. Ok(Some(self.get_fs_by_idx(mnt.fs_idx).map_err(|e| {
  11. VfsError::NotFound(format!("fs index {}, {:?}", mnt.fs_idx, e))
  12. })?))
  13. } else {
  14. // Pseudo fs dir inode exists, but that no backend is ever mounted
  15. // is a normal case.
  16. Ok(None)
  17. }
  18. }

然后,通过fs_backend_factory(&cmd)方法获取文件系统后端,该方法的返回值是实现了BackendFileSystem+Sync+Sendtrait 的结构体。

fs_backend_factory方法中,首先验证预取文件列表:

然后根据传入的fs_type分别进行实例化,目前支持两种类型:

  1. pubenum FsBackendType {
  2. Rafs,
  3. PassthroughFs,
  4. }

2.6.1 初始化 RAFS backend

首先,解析从cmd传入的config内容,并根据传入的bootstrap文件路径,打开用于(从 bootstrap 中)读取文件系统的元数据信息的reader,绑定到bootstrap变量。接下来创建 rafs 实例,传入参数包括配置信息、挂载路径、bootstrap文件对应的reader

  1. FsBackendType::Rafs => {
  2. let rafs_config = RafsConfig::from_str(cmd.config.as_str())?;
  3. let mutbootstrap = <dyn RafsIoRead>::from_file(&cmd.source)?;
  4. let mutrafs = Rafs::new(rafs_config, &cmd.mountpoint, &mutbootstrap)?;
  5. rafs.import(bootstrap, prefetch_files)?;
  6. info!("RAFS filesystem imported");
  7. Ok(Box::new(rafs))
  8. }

通过Rafs::new(rafs_config, &cmd.mountpoint, &mut bootstrap)方法创建 rafs 实例。

首先,准备配置信息storage_conf,并通过传入的conf参数创建RafsSuper实例。创建RafsSuper只是初始化配置信息,包括 RafsMode(有 Direct 和 Cached 两种可选)。接下来,通过sb.load(r)方法从bootstarp加载 RAFS 超级块的信息。RAFS V5 和 V6 两个版本的加载方式不同,try_load_v6方法:

  1. pub(crate) fntry_load_v6(&mutself,r: &mut RafsIoReader) -> Result<bool> {
  2. let end =r.seek_to_end(0)?;
  3. r.seek_to_offset(0)?;
  4. // 创建 RAFSV6SuperBlock 实例
  5. let mutsb = RafsV6SuperBlock::new();
  6. // 读取 RAFS V6 的超级块信息
  7. // offset 1024,length 128
  8. ifsb.load(r).is_err() {
  9. returnOk(false);
  10. }
  11. if !sb.is_rafs_v6() {
  12. returnOk(false);
  13. }
  14. sb.validate(end)?;
  15. // 设置 RAFS 超级块的 meta 信息
  16. self.meta.version = RAFS_SUPER_VERSION_V6;
  17. self.meta.magic =sb.magic();
  18. self.meta.meta_blkaddr =sb.s_meta_blkaddr;
  19. self.meta.root_nid =sb.s_root_nid;
  20. // 创建 RafsV6SuperBlockExt 实例
  21. let mutext_sb = RafsV6SuperBlockExt::new();
  22. // 读取 RAFS V6 的扩展超级块信息
  23. // offset 1024 + 128,length 256
  24. ext_sb.load(r)?;
  25. ext_sb.validate(end)?;
  26. // 设置 RAFS 超级块的 meta 信息
  27. self.meta.chunk_size =ext_sb.chunk_size();
  28. self.meta.blob_table_offset =ext_sb.blob_table_offset();
  29. self.meta.blob_table_size =ext_sb.blob_table_size();
  30. self.meta.chunk_table_offset =ext_sb.chunk_table_offset();
  31. self.meta.chunk_table_size =ext_sb.chunk_table_size();
  32. self.meta.inodes_count =sb.inodes_count();
  33. self.meta.flags = RafsSuperFlags::from_bits(ext_sb.flags())
  34. .ok_or_else(|| einval!(format!("invalid super flags {:x}",ext_sb.flags())))?;
  35. info!("rafs superblock features: {}",self.meta.flags);
  36. // 设置 RAFS 超级块 meta 中的预取列表信息
  37. self.meta.prefetch_table_entries =ext_sb.prefetch_table_size() / size_of::<u32>() asu32;
  38. self.meta.prefetch_table_offset =ext_sb.prefetch_table_offset();
  39. trace!(
  40. "prefetch table offset {} entries {} ",
  41. self.meta.prefetch_table_offset,
  42. self.meta.prefetch_table_entries
  43. );
  44. matchself.mode {
  45. // 如果 RAFS 模式是 Direct,还需要创建
  46. // DirectSuperBlockV6 实例并读取相关信息
  47. RafsMode::Direct => {
  48. let mutsb_v6 = DirectSuperBlockV6::new(&self.meta);
  49. sb_v6.load(r)?;
  50. self.superblock = Arc::new(sb_v6);
  51. Ok(true)
  52. }
  53. RafsMode::Cached => Err(enosys!("Rafs v6 does not support cached mode")),
  54. }
  55. }

RAFS 超级块信息加载后,获取blob信息,然后创建rafs实例:

  1. pubfn new(conf: RafsConfig, id: &str,r: &mut RafsIoReader) -> RafsResult<Self> {
  2. let storage_conf = Self::prepare_storage_conf(&conf)?;
  3. let mutsb = RafsSuper::new(&conf).map_err(RafsError::FillSuperblock)?;
  4. sb.load(r).map_err(RafsError::FillSuperblock)?;
  5. // 获取 super block 之后,从中获取 blob 信息(BlobInfo)
  6. let blob_infos =sb.superblock.get_blob_infos();
  7. // 根据配置信息和 blobs 信息,遍历每条 blob_info,
  8. // 创建 BlobDevice 的实例
  9. let device =
  10. BlobDevice::new(&storage_conf, &blob_infos).map_err(RafsError::CreateDevice)?;
  11. // 创建 rafs 实例
  12. let rafs = Rafs {
  13. id: id.to_string(),
  14. device, // BlobDevice
  15. ios: metrics::FsIoStats::new(id),
  16. sb: Arc::new(sb),
  17. initialized: false, // 还未初始化
  18. digest_validate: conf.digest_validate,
  19. fs_prefetch: conf.fs_prefetch.enable, // 支持预取
  20. amplify_io: conf.amplify_io,
  21. prefetch_all: conf.fs_prefetch.prefetch_all,
  22. xattr_enabled: conf.enable_xattr, // 开启 xattr
  23. i_uid: geteuid().into(), // uid
  24. i_gid: getegid().into(), // gid
  25. i_time: SystemTime::now()
  26. .duration_since(SystemTime::UNIX_EPOCH)
  27. .unwrap()
  28. .as_secs(),
  29. };
  30. // Rafs v6 does must store chunk info into local file cache. So blob cache is required
  31. if rafs.metadata().is_v6() {
  32. if conf.device.cache.cache_type != "blobcache" {
  33. returnErr(RafsError::Configure(
  34. "Rafs v6 must have local blobcache configured".to_string(),
  35. ));
  36. }
  37. if conf.digest_validate {
  38. returnErr(RafsError::Configure(
  39. "Rafs v6 doesn't support integrity validation yet".to_string(),
  40. ));
  41. }
  42. }
  43. rafs.ios.toggle_files_recording(conf.iostats_files);
  44. rafs.ios.toggle_access_pattern(conf.access_pattern);
  45. rafs.ios
  46. .toggle_latest_read_files_recording(conf.latest_read_files);
  47. Ok(rafs)
  48. }

关于 rafs 文件系统(以 v6 为例)元数据在 bootstrap 文件中的分布,在 rafs/src/metadata/layout/v6.rs 中有详细定义:

  1. /// EROFS metadata slot size.
  2. pubconst EROFS_INODE_SLOT_SIZE: usize = 1 << EROFS_INODE_SLOT_BITS;
  3. /// EROFS logical block size.
  4. pubconst EROFS_BLOCK_SIZE: u64 = 1u64 << EROFS_BLOCK_BITS;
  5. /// EROFS plain inode.
  6. pubconst EROFS_INODE_FLAT_PLAIN: u16 = 0;
  7. /// EROFS inline inode.
  8. pubconst EROFS_INODE_FLAT_INLINE: u16 = 2;
  9. /// EROFS chunked inode.
  10. pubconst EROFS_INODE_CHUNK_BASED: u16 = 4;
  11. /// EROFS device table offset.
  12. pub constEROFS_DEVTABLE_OFFSET: u16 =
  13. EROFS_SUPER_OFFSET + EROFS_SUPER_BLOCK_SIZE + EROFS_EXT_SUPER_BLOCK_SIZE;
  14. pubconst EROFS_I_VERSION_BIT: u16 = 0;
  15. pubconst EROFS_I_VERSION_BITS: u16 = 1;
  16. pubconst EROFS_I_DATALAYOUT_BITS: u16 = 3;
  17. // Offset of EROFS super block.
  18. pub constEROFS_SUPER_OFFSET: u16 = 1024;
  19. // Size of EROFS super block.
  20. pubconst EROFS_SUPER_BLOCK_SIZE: u16 = 128;
  21. // Size of extended super block, used for rafs v6 specific fields
  22. const EROFS_EXT_SUPER_BLOCK_SIZE: u16 = 256;
  23. // Magic number for EROFS super block.
  24. const EROFS_SUPER_MAGIC_V1: u32 = 0xE0F5_E1E2;
  25. // Bits of EROFS logical block size.
  26. const EROFS_BLOCK_BITS: u8 = 12;
  27. // Bits of EROFS metadata slot size.
  28. const EROFS_INODE_SLOT_BITS: u8 = 5;

创建rafs实例后,通过rafs.import(bootstrap, prefetch_files)方法初始化(导入bootstrapprefetch信息):

  1. /// Import an rafs bootstrap to initialize the filesystem instance.
  2. pub fnimport(
  3. &mutself,
  4. r: RafsIoReader,
  5. prefetch_files: Option<Vec<PathBuf>>,
  6. ) -> RafsResult<()> {
  7. ifself.initialized {
  8. returnErr(RafsError::AlreadyMounted);
  9. }
  10. ifself.fs_prefetch {
  11. // Device should be ready before any prefetch.
  12. self.device.start_prefetch();
  13. self.prefetch(r, prefetch_files);
  14. }
  15. self.initialized = true;
  16. Ok(())
  17. }

主要是开启prefetch线程,self.prefetch(r, prefetch_files)方法传入两个参数,r是 bootstrap 文件的 reader,prefetch_files是已经从 bootstrap 读取的预取文件列表:

  1. fn prefetch(&self, reader: RafsIoReader, prefetch_files: Option<Vec<PathBuf>>) {
  2. let sb = self.sb.clone();
  3. let device = self.device.clone();
  4. let prefetch_all = self.prefetch_all;
  5. let root_ino = self.root_ino();
  6. let _ = std::thread::spawn(move || {
  7. Self::do_prefetch(root_ino, reader, prefetch_files, prefetch_all, sb, device);
  8. });
  9. }

do_prefetch方法中,首先设置每个blob对应device的状态为允许prefetch,然后,根据prefetch_files进行预取:

  1. pub fnimport(
  2. &mutself,
  3. r: RafsIoReader,
  4. prefetch_files: Option<Vec<PathBuf>>,
  5. ) -> RafsResult<()> {
  6. ifself.initialized {
  7. returnErr(RafsError::AlreadyMounted);
  8. }
  9. ifself.fs_prefetch {
  10. // Device should be ready before any prefetch.
  11. self.device.start_prefetch();
  12. self.prefetch(r, prefetch_files);
  13. }
  14. self.initialized = true;
  15. Ok(())
  16. }

self.prefetch(r, prefetch_files)方法中,开启了预取线程:

  1. fn prefetch(&self, reader: RafsIoReader, prefetch_files: Option<Vec<PathBuf>>) {
  2. let sb = self.sb.clone();
  3. let device = self.device.clone();
  4. let prefetch_all = self.prefetch_all;
  5. let root_ino = self.root_ino();
  6. let _ = std::thread::spawn(move || {
  7. Self::do_prefetch(root_ino, reader, prefetch_files, prefetch_all, sb, device);
  8. });
  9. }

线程中运行do_prefetch方法,按 chunk 粒度进行预取:

  1. fn do_prefetch(
  2. root_ino: u64,
  3. mutreader: RafsIoReader, // bootstrap 对应的 reader
  4. prefetch_files: Option<Vec<PathBuf>>,
  5. prefetch_all: bool,
  6. sb: Arc<RafsSuper>,
  7. device: BlobDevice,
  8. ) {
  9. // First do range based prefetch for rafs v6.
  10. if sb.meta.is_v6() {
  11. // 生成 BlobPrefetchRequest,按 chunk 为粒度的请求
  12. let mutprefetches = Vec::new();
  13. for blob in sb.superblock.get_blob_infos() {
  14. let sz = blob.prefetch_size();
  15. if sz > 0 {
  16. let mutoffset = 0;
  17. whileoffset < sz {
  18. // 按 chunk 为粒度生成请求
  19. let len = cmp::min(sz -offset, RAFS_DEFAULT_CHUNK_SIZE);
  20. prefetches.push(BlobPrefetchRequest {
  21. blob_id: blob.blob_id().to_owned(),
  22. offset,
  23. len,
  24. });
  25. offset+= len;
  26. }
  27. }
  28. }
  29. if !prefetches.is_empty() {
  30. // 通过 device 的 prefetch 进行预取
  31. device.prefetch(&[], &prefetches).unwrap_or_else(|e| {
  32. warn!("Prefetch error, {:?}", e);
  33. });
  34. }
  35. }
  36. let fetcher = |desc: &mut BlobIoVec, last: bool| {
  37. ifdesc.size() asu64 > RAFS_MAX_CHUNK_SIZE
  38. ||desc.len() > 1024
  39. || (last &&desc.size() > 0)
  40. {
  41. trace!(
  42. "fs prefetch: 0x{:x} bytes for {} descriptors",
  43. desc.size(),
  44. desc.len()
  45. );
  46. device.prefetch(&[desc], &[]).unwrap_or_else(|e| {
  47. warn!("Prefetch error, {:?}", e);
  48. });
  49. desc.reset();
  50. }
  51. };
  52. let mutignore_prefetch_all = prefetch_files
  53. .as_ref()
  54. .map(|f| f.len() == 1 && f[0].as_os_str() == "/")
  55. .unwrap_or(false);
  56. // Then do file based prefetch based on:
  57. // - prefetch listed passed in by user
  58. // - or file prefetch list in metadata
  59. let inodes = prefetch_files.map(|files| Self::convert_file_list(&files, &sb));
  60. let res = sb.prefetch_files(&device, &mutreader, root_ino, inodes, &fetcher);
  61. match res {
  62. Ok(true) =>ignore_prefetch_all = true,
  63. Ok(false) => {}
  64. Err(e) => info!("No file to be prefetched {:?}", e),
  65. }
  66. // Last optionally prefetch all data
  67. if prefetch_all && !ignore_prefetch_all {
  68. let root = vec![root_ino];
  69. let res = sb.prefetch_files(&device, &mutreader, root_ino, Some(root), &fetcher);
  70. ifletErr(e) = res {
  71. info!("No file to be prefetched {:?}", e);
  72. }
  73. }
  74. }

生成预取请求列表后,通过deviceprefetch方法进行预取:

  1. /// Try to prefetch specified blob data.
  2. pubfn prefetch(
  3. &self,
  4. io_vecs: &[&BlobIoVec],
  5. prefetches: &[BlobPrefetchRequest],
  6. ) -> io::Result<()> {
  7. for idx in0..prefetches.len() {
  8. // 根据 blob_id 获取 blob 信息
  9. ifletSome(blob) = self.get_blob_by_id(&prefetches[idx].blob_id) {
  10. // 通过 blob 的 prefetch 方法进行预取
  11. let _ = blob.prefetch(blob.clone(), &prefetches[idx..idx + 1], &[]);
  12. }
  13. }
  14. for io_vec in io_vecs.iter() {
  15. ifletSome(blob) = self.get_blob_by_iovec(io_vec) {
  16. // Prefetch errors are ignored.
  17. let _ = blob
  18. .prefetch(blob.clone(), &[], &io_vec.bi_vec)
  19. .map_err(|e| {
  20. error!("failed to prefetch blob data, {}", e);
  21. });
  22. }
  23. }
  24. Ok(())
  25. }

根据 blob_id获取 blob 后,调用prefetch方法:

  1. fn prefetch(
  2. &self,
  3. blob_cache: Arc<dyn BlobCache>,
  4. prefetches: &[BlobPrefetchRequest],
  5. bios: &[BlobIoDesc],
  6. ) -> StorageResult<usize> {
  7. // Handle blob prefetch request first, it may help performance.
  8. for req in prefetches {
  9. // 生成异步预取请求消息
  10. let msg = AsyncPrefetchMessage::new_blob_prefetch(
  11. blob_cache.clone(),
  12. req.offset asu64,
  13. req.len asu64,
  14. );
  15. // 将请求消息通过 channel 传递给 worker
  16. let _ = self.workers.send_prefetch_message(msg);
  17. }
  18. // Then handle fs prefetch
  19. let max_comp_size = self.prefetch_batch_size();
  20. let mutbios = bios.to_vec();
  21. bios.sort_by_key(|entry| entry.chunkinfo.compressed_offset());
  22. self.metrics.prefetch_unmerged_chunks.add(bios.len() asu64);
  23. BlobIoMergeState::merge_and_issue(
  24. &bios,
  25. max_comp_size,
  26. max_comp_size asu64 >> RAFS_MERGING_SIZE_TO_GAP_SHIFT,
  27. |req: BlobIoRange| {
  28. // 生成异步预取请求消息
  29. let msg = AsyncPrefetchMessage::new_fs_prefetch(blob_cache.clone(), req);
  30. let _ = self.workers.send_prefetch_message(msg);
  31. },
  32. );
  33. Ok(0)
  34. }

接收预取消息并进行处理的函数:

  1. asyncfn handle_prefetch_requests(mgr: Arc<AsyncWorkerMgr>, rt: &Runtime) {
  2. // Max 1 active requests per thread.
  3. mgr.prefetch_sema.add_permits(1);
  4. whileletOk(msg) = mgr.prefetch_channel.recv().await {
  5. mgr.handle_prefetch_rate_limit(&msg).await;
  6. let mgr2 = mgr.clone();
  7. match msg {
  8. AsyncPrefetchMessage::BlobPrefetch(blob_cache, offset, size) => {
  9. let token = Semaphore::acquire_owned(mgr2.prefetch_sema.clone())
  10. .await
  11. .unwrap();
  12. if blob_cache.is_prefetch_active() {
  13. rt.spawn_blocking(move || {
  14. let _ = Self::handle_blob_prefetch_request(
  15. mgr2.clone(),
  16. blob_cache,
  17. offset,
  18. size,
  19. );
  20. drop(token);
  21. });
  22. }
  23. }
  24. AsyncPrefetchMessage::FsPrefetch(blob_cache, req) => {
  25. let token = Semaphore::acquire_owned(mgr2.prefetch_sema.clone())
  26. .await
  27. .unwrap();
  28. if blob_cache.is_prefetch_active() {
  29. rt.spawn_blocking(move || {
  30. let _ = Self::handle_fs_prefetch_request(mgr2.clone(), blob_cache, req);
  31. drop(token)
  32. });
  33. }
  34. }
  35. AsyncPrefetchMessage::Ping => {
  36. let _ = mgr.ping_requests.fetch_add(1, Ordering::Relaxed);
  37. }
  38. AsyncPrefetchMessage::RateLimiter(_size) => {}
  39. }
  40. mgr.prefetch_inflight.fetch_sub(1, Ordering::Relaxed);
  41. }
  42. }

目前,有两种预取的方法:Blob 模式和 Fs 模式。

(1) Blob 模式预取

对应的处理函数为handle_blob_prefetch_request

  1. fn handle_blob_prefetch_request(
  2. mgr: Arc<AsyncWorkerMgr>,
  3. cache: Arc<dyn BlobCache>,
  4. offset: u64,
  5. size: u64,
  6. ) -> Result<()> {
  7. trace!(
  8. "storage: prefetch blob {} offset {} size {}",
  9. cache.blob_id(),
  10. offset,
  11. size
  12. );
  13. if size == 0 {
  14. returnOk(());
  15. }
  16. // 获取 blob object
  17. ifletSome(obj) = cache.get_blob_object() {
  18. // 获取 (offset, offset + size) 范围内的内容
  19. ifletErr(e) = obj.fetch_range_compressed(offset, size) {
  20. warn!(
  21. "storage: failed to prefetch data from blob {}, offset {}, size {}, {}, will try resend",
  22. cache.blob_id(),
  23. offset,
  24. size,
  25. e
  26. );
  27. ASYNC_RUNTIME.spawn(asyncmove {
  28. let mutinterval = interval(Duration::from_secs(1));
  29. interval.tick().await;
  30. // 如果失败,重新发起预取消息
  31. let msg = AsyncPrefetchMessage::new_blob_prefetch(cache.clone(), offset, size);
  32. let _ = mgr.send_prefetch_message(msg);
  33. });
  34. }
  35. } else {
  36. warn!("prefetch blob range is not supported");
  37. }
  38. Ok(())
  39. }

其中,主要的处理函数为obj.fetch_range_compressed(offset, size)

  1. fn fetch_range_compressed(&self, offset: u64, size: u64) -> Result<()> {
  2. let meta = self.meta.as_ref().ok_or_else(|| einval!())?;
  3. let meta = meta.get_blob_meta().ok_or_else(|| einval!())?;
  4. let mutchunks = meta.get_chunks_compressed(offset, size, self.prefetch_batch_size())?;
  5. ifletSome(meta) = self.get_blob_meta_info()? {
  6. chunks = self.strip_ready_chunks(meta, None,chunks);
  7. }
  8. ifchunks.is_empty() {
  9. Ok(())
  10. } else {
  11. self.do_fetch_chunks(&chunks, true)
  12. }
  13. }

meta.get_chunks_compressed方法用于获取包含(offset, offset + size)范围的chunk列表:

  1. pubfn get_chunks_compressed(
  2. &self,
  3. start: u64,
  4. size: u64,
  5. batch_size: u64,
  6. ) -> Result<Vec<Arc<dyn BlobChunkInfo>>> {
  7. let end = start.checked_add(size).ok_or_else(|| {
  8. einval!(einval!(format!(
  9. "get_chunks_compressed: invalid start {}/size {}",
  10. start, size
  11. )))
  12. })?;
  13. if end > self.state.compressed_size {
  14. returnErr(einval!(format!(
  15. "get_chunks_compressed: invalid end {}/compressed_size {}",
  16. end, self.state.compressed_size
  17. )));
  18. }
  19. let batch_end = if batch_size <= size {
  20. end
  21. } else {
  22. std::cmp::min(
  23. start.checked_add(batch_size).unwrap_or(end),
  24. self.state.compressed_size,
  25. )
  26. };
  27. self.state
  28. .get_chunks_compressed(start, end, batch_end, batch_size)
  29. }

BlobMetaChunkArray::V2版本的self.state.get_chunks_compressed方法实际的处理函数内容如下:

  1. fn _get_chunks_compressed<T: BlobMetaChunkInfo>(
  2. state: &Arc<BlobMetaState>,
  3. chunk_info_array: &[T],
  4. start: u64,
  5. end: u64,
  6. batch_end: u64,
  7. batch_size: u64,
  8. ) -> Result<Vec<Arc<dyn BlobChunkInfo>>> {
  9. let mutvec = Vec::with_capacity(512);
  10. let mutindex = Self::_get_chunk_index_nocheck(chunk_info_array, start, true)?;
  11. let entry = Self::get_chunk_entry(state, chunk_info_array,index)?;
  12. // Special handling of ZRan chunks
  13. if entry.is_zran() {
  14. let zran_index = entry.get_zran_index();
  15. let pos = state.zran_info_array[zran_index asusize].in_offset();
  16. let mutzran_last = zran_index;
  17. whileindex > 0 {
  18. let entry = Self::get_chunk_entry(state, chunk_info_array,index - 1)?;
  19. if !entry.is_zran() {
  20. returnErr(einval!(
  21. "inconsistent ZRan and non-ZRan chunk information entries"
  22. ));
  23. } elseif entry.get_zran_index() != zran_index {
  24. // reach the header chunk associated with the same ZRan context.
  25. break;
  26. } else {
  27. index-= 1;
  28. }
  29. }
  30. let mutvec = Vec::with_capacity(128);
  31. for entry in &chunk_info_array[index..] {
  32. entry.validate(state)?;
  33. if !entry.is_zran() {
  34. returnErr(einval!(
  35. "inconsistent ZRan and non-ZRan chunk information entries"
  36. ));
  37. }
  38. if entry.get_zran_index() !=zran_last {
  39. let ctx = &state.zran_info_array[entry.get_zran_index() asusize];
  40. if ctx.in_offset() + ctx.in_size() asu64 - pos > batch_size
  41. && entry.compressed_offset() > end
  42. {
  43. returnOk(vec);
  44. }
  45. zran_last = entry.get_zran_index();
  46. }
  47. vec.push(BlobMetaChunk::new(index, state));
  48. }
  49. returnOk(vec);
  50. }
  51. vec.push(BlobMetaChunk::new(index, state));
  52. let mutlast_end = entry.compressed_end();
  53. iflast_end >= batch_end {
  54. Ok(vec)
  55. } else {
  56. whileindex + 1 < chunk_info_array.len() {
  57. index+= 1;
  58. let entry = Self::get_chunk_entry(state, chunk_info_array,index)?;
  59. // Avoid read amplify if next chunk is too big.
  60. iflast_end >= end && entry.compressed_end() > batch_end {
  61. returnOk(vec);
  62. }
  63. vec.push(BlobMetaChunk::new(index, state));
  64. last_end = entry.compressed_end();
  65. iflast_end >= batch_end {
  66. returnOk(vec);
  67. }
  68. }
  69. Err(einval!(format!(
  70. "entry not found index {} chunk_info_array.len {}",
  71. index,
  72. chunk_info_array.len(),
  73. )))
  74. }
  75. }

获取包含的chunks之后,通过self.strip_ready_chunks方法分离这些chunks(具体含义未深究):

  1. fn strip_ready_chunks(
  2. &self,
  3. meta: Arc<BlobMetaInfo>,
  4. old_chunks: Option<&[Arc<dyn BlobChunkInfo>]>,
  5. mutextended_chunks: Vec<Arc<dyn BlobChunkInfo>>,
  6. ) -> Vec<Arc<dyn BlobChunkInfo>> {
  7. ifself.is_zran {
  8. let mutset = HashSet::new();
  9. for c inextended_chunks.iter() {
  10. if !matches!(self.chunk_map.is_ready(c.as_ref()), Ok(true)) {
  11. set.insert(meta.get_zran_index(c.id()));
  12. }
  13. }
  14. let first = old_chunks.as_ref().map(|v| v[0].id()).unwrap_or(u32::MAX);
  15. let mutstart = 0;
  16. whilestart <extended_chunks.len() {
  17. let id =extended_chunks[start].id();
  18. if id == first ||set.contains(&meta.get_zran_index(id)) {
  19. break;
  20. }
  21. start+= 1;
  22. }
  23. let last = old_chunks
  24. .as_ref()
  25. .map(|v| v[v.len() - 1].id())
  26. .unwrap_or(u32::MAX);
  27. let mutend =extended_chunks.len() - 1;
  28. whileend >start {
  29. let id =extended_chunks[end].id();
  30. if id == last ||set.contains(&meta.get_zran_index(id)) {
  31. break;
  32. }
  33. end-= 1;
  34. }
  35. assert!(end >=start);
  36. ifstart == 0 &&end ==extended_chunks.len() - 1 {
  37. extended_chunks
  38. } else {
  39. extended_chunks[start..=end].to_vec()
  40. }
  41. } else {
  42. while !extended_chunks.is_empty() {
  43. let chunk = &extended_chunks[extended_chunks.len() - 1];
  44. if matches!(self.chunk_map.is_ready(chunk.as_ref()), Ok(true)) {
  45. extended_chunks.pop();
  46. } else {
  47. break;
  48. }
  49. }
  50. extended_chunks
  51. }
  52. }

然后,通过self.do_fetch_chunks(&chunks, true)方法获取chunks的数据:

  1. fn do_fetch_chunks(&self, chunks: &[Arc<dyn BlobChunkInfo>], prefetch: bool) -> Result<()> {
  2. // Validate input parameters.
  3. assert!(!chunks.is_empty());
  4. if chunks.len() > 1 {
  5. for idx in0..chunks.len() - 1 {
  6. assert_eq!(chunks[idx].id() + 1, chunks[idx + 1].id());
  7. }
  8. }
  9. // Get chunks not ready yet, also marking them as in-flight.
  10. let bitmap = self
  11. .chunk_map
  12. .as_range_map()
  13. .ok_or_else(|| einval!("invalid chunk_map for do_fetch_chunks()"))?;
  14. let chunk_index = chunks[0].id();
  15. let count = chunks.len() asu32;
  16. let pending = match bitmap.check_range_ready_and_mark_pending(chunk_index, count)? {
  17. None => returnOk(()),
  18. Some(v) => v,
  19. };
  20. let mutstatus = vec![false; count asusize];
  21. let (start_idx, end_idx) = ifself.is_zran {
  22. for chunk_id in pending.iter() {
  23. status[(*chunk_id - chunk_index) asusize] = true;
  24. }
  25. (0, pending.len())
  26. } else {
  27. let mutstart = u32::MAX;
  28. let mutend = 0;
  29. for chunk_id in pending.iter() {
  30. status[(*chunk_id - chunk_index) asusize] = true;
  31. start = std::cmp::min(*chunk_id - chunk_index,start);
  32. end = std::cmp::max(*chunk_id - chunk_index,end);
  33. }
  34. ifend <start {
  35. returnOk(());
  36. }
  37. (start asusize,end asusize)
  38. };
  39. let start_chunk = &chunks[start_idx];
  40. let end_chunk = &chunks[end_idx];
  41. let (blob_offset, blob_end, blob_size) =
  42. self.get_blob_range(&chunks[start_idx..=end_idx])?;
  43. trace!(
  44. "fetch data range {:x}-{:x} for chunk {}-{} from blob {:x}",
  45. blob_offset,
  46. blob_end,
  47. start_chunk.id(),
  48. end_chunk.id(),
  49. chunks[0].blob_index()
  50. );
  51. // 从 backend 读取数据
  52. matchself.read_chunks_from_backend(
  53. blob_offset,
  54. blob_size,
  55. &chunks[start_idx..=end_idx],
  56. prefetch,
  57. ) {
  58. Ok(mutbufs) => {
  59. ifself.is_compressed {
  60. let res =
  61. Self::persist_cached_data(&self.file, blob_offset,bufs.compressed_buf());
  62. for idx in start_idx..=end_idx {
  63. ifstatus[idx] {
  64. self.update_chunk_pending_status(chunks[idx].as_ref(), res.is_ok());
  65. }
  66. }
  67. } else {
  68. for idx in start_idx..=end_idx {
  69. let mutbuf = matchbufs.next() {
  70. None => returnErr(einval!("invalid chunk decompressed status")),
  71. Some(Err(e)) => {
  72. for idx in idx..=end_idx {
  73. ifstatus[idx] {
  74. bitmap.clear_range_pending(chunks[idx].id(), 1)
  75. }
  76. }
  77. returnErr(e);
  78. }
  79. Some(Ok(v)) => v,
  80. };
  81. ifstatus[idx] {
  82. ifself.dio_enabled {
  83. self.adjust_buffer_for_dio(&mutbuf)
  84. }
  85. self.persist_chunk_data(chunks[idx].as_ref(),buf.as_ref());
  86. }
  87. }
  88. }
  89. }
  90. Err(e) => {
  91. for idx in0..chunks.len() {
  92. ifstatus[idx] {
  93. bitmap.clear_range_pending(chunks[idx].id(), 1)
  94. }
  95. }
  96. returnErr(e);
  97. }
  98. }
  99. if !bitmap.wait_for_range_ready(chunk_index, count)? {
  100. if prefetch {
  101. returnErr(eio!("failed to read data from storage backend"));
  102. }
  103. // if we are in on-demand path, retry for the timeout chunks
  104. for chunk in chunks {
  105. matchself.chunk_map.check_ready_and_mark_pending(chunk.as_ref()) {
  106. Err(e) => returnErr(eio!(format!("do_fetch_chunks failed, {:?}", e))),
  107. Ok(true) => {}
  108. Ok(false) => {
  109. info!("retry for timeout chunk, {}", chunk.id());
  110. let mutbuf = alloc_buf(chunk.uncompressed_size() asusize);
  111. self.read_chunk_from_backend(chunk.as_ref(), &mutbuf)
  112. .map_err(|e| {
  113. self.update_chunk_pending_status(chunk.as_ref(), false);
  114. eio!(format!("read_raw_chunk failed, {:?}", e))
  115. })?;
  116. ifself.dio_enabled {
  117. self.adjust_buffer_for_dio(&mutbuf)
  118. }
  119. self.persist_chunk_data(chunk.as_ref(), &buf);
  120. }
  121. }
  122. }
  123. }
  124. Ok(())
  125. }

其中self.read_chunks_from_backend方法实现从 backend 读取数据:

  1. fn read_chunks_from_backend<'a, 'b>(
  2. &'aself,
  3. blob_offset: u64,
  4. blob_size: usize,
  5. chunks: &'b [Arc<dyn BlobChunkInfo>],
  6. prefetch: bool,
  7. ) -> Result<ChunkDecompressState<'a, 'b>>
  8. where
  9. Self: Sized,
  10. {
  11. // Read requested data from the backend by altogether.
  12. let mutc_buf = alloc_buf(blob_size);
  13. let start = Instant::now();
  14. let nr_read = self
  15. .reader()
  16. .read(c_buf.as_mut_slice(), blob_offset)
  17. .map_err(|e| eio!(e))?;
  18. if nr_read != blob_size {
  19. returnErr(eio!(format!(
  20. "request for {} bytes but got {} bytes",
  21. blob_size, nr_read
  22. )));
  23. }
  24. let duration = Instant::now().duration_since(start).as_millis();
  25. debug!(
  26. "read_chunks_from_backend: {} {} {} bytes at {}, duration {}ms",
  27. std::thread::current().name().unwrap_or_default(),
  28. if prefetch { "prefetch" } else { "fetch" },
  29. blob_size,
  30. blob_offset,
  31. duration
  32. );
  33. let chunks = chunks.iter().map(|v| v.as_ref()).collect();
  34. Ok(ChunkDecompressState::new(blob_offset, self, chunks,c_buf))
  35. }

self.reader().read方法是对 backend 的抽象,每个请求失败后会重试retry_count次:

  1. fn read(&self,buf: &mut [u8], offset: u64) -> BackendResult<usize> {
  2. let mutretry_count = self.retry_limit();
  3. let begin_time = self.metrics().begin();
  4. loop {
  5. matchself.try_read(buf, offset) {
  6. Ok(size) => {
  7. self.metrics().end(&begin_time,buf.len(), false);
  8. returnOk(size);
  9. }
  10. Err(err) => {
  11. ifretry_count > 0 {
  12. warn!(
  13. "Read from backend failed: {:?}, retry count {}",
  14. err,retry_count
  15. );
  16. retry_count-= 1;
  17. } else {
  18. self.metrics().end(&begin_time,buf.len(), true);
  19. ERROR_HOLDER
  20. .lock()
  21. .unwrap()
  22. .push(&format!("{:?}", err))
  23. .unwrap_or_else(|_| error!("Failed when try to hold error"));
  24. returnErr(err);
  25. }
  26. }
  27. }
  28. }
  29. }

不同 backend 的try_read方法实现不同,目前,nydus分别实现了localfsregistryOSS三种 backend。

(2) Fs 模式预取

对应的处理函数为handle_fs_prefetch_request

  1. fn handle_fs_prefetch_request(
  2. mgr: Arc<AsyncWorkerMgr>,
  3. cache: Arc<dyn BlobCache>,
  4. req: BlobIoRange,
  5. ) -> Result<()> {
  6. let blob_offset = req.blob_offset;
  7. let blob_size = req.blob_size;
  8. trace!(
  9. "storage: prefetch fs data from blob {} offset {} size {}",
  10. cache.blob_id(),
  11. blob_offset,
  12. blob_size
  13. );
  14. if blob_size == 0 {
  15. returnOk(());
  16. }
  17. // Record how much prefetch data is requested from storage backend.
  18. // So the average backend merged request size will be prefetch_data_amount/prefetch_mr_count.
  19. // We can measure merging possibility by this.
  20. mgr.metrics.prefetch_mr_count.inc();
  21. mgr.metrics.prefetch_data_amount.add(blob_size);
  22. ifletSome(obj) = cache.get_blob_object() {
  23. obj.prefetch_chunks(&req)?;
  24. } else {
  25. cache.prefetch_range(&req)?;
  26. }
  27. Ok(())
  28. }

Fs 模式的预取有两种情况,(1)如果有缓存的blob时:

  1. fn prefetch_chunks(&self, range: &BlobIoRange) -> Result<()> {
  2. let chunks_extended;
  3. let mutchunks = &range.chunks;
  4. ifletSome(v) = self.extend_pending_chunks(chunks, self.prefetch_batch_size())? {
  5. chunks_extended = v;
  6. chunks = &chunks_extended;
  7. }
  8. let mutstart = 0;
  9. whilestart <chunks.len() {
  10. // Figure out the range with continuous chunk ids, be careful that `end` is inclusive.
  11. let mutend =start;
  12. whileend <chunks.len() - 1 &&chunks[end + 1].id() ==chunks[end].id() + 1 {
  13. end+= 1;
  14. }
  15. self.do_fetch_chunks(&chunks[start..=end], true)?;
  16. start =end + 1;
  17. }
  18. Ok(())
  19. }

准备好chunks后,也是调用了do_fetch_chunks方法,和 Blob 模式相同。

(2)如果没有缓存blob,则使用cache.prefetch_range(&req)方法:

  1. fn prefetch_range(&self, range: &BlobIoRange) -> Result<usize> {
  2. let mutpending = Vec::with_capacity(range.chunks.len());
  3. if !self.chunk_map.is_persist() {
  4. let mutd_size = 0;
  5. for c in range.chunks.iter() {
  6. d_size = std::cmp::max(d_size, c.uncompressed_size() asusize);
  7. }
  8. let mutbuf = alloc_buf(d_size);
  9. for c in range.chunks.iter() {
  10. ifletOk(true) = self.chunk_map.check_ready_and_mark_pending(c.as_ref()) {
  11. // The chunk is ready, so skip it.
  12. continue;
  13. }
  14. // For digested chunk map, we must check whether the cached data is valid because
  15. // the digested chunk map cannot persist readiness state.
  16. let d_size = c.uncompressed_size() asusize;
  17. matchself.read_file_cache(c.as_ref(), &mutbuf[0..d_size]) {
  18. // The cached data is valid, set the chunk as ready.
  19. Ok(_v) => self.update_chunk_pending_status(c.as_ref(), true),
  20. // The cached data is invalid, queue the chunk for reading from backend.
  21. Err(_e) =>pending.push(c.clone()),
  22. }
  23. }
  24. } else {
  25. for c in range.chunks.iter() {
  26. ifletOk(true) = self.chunk_map.check_ready_and_mark_pending(c.as_ref()) {
  27. // The chunk is ready, so skip it.
  28. continue;
  29. } else {
  30. pending.push(c.clone());
  31. }
  32. }
  33. }
  34. let muttotal_size = 0;
  35. let mutstart = 0;
  36. whilestart <pending.len() {
  37. // Figure out the range with continuous chunk ids, be careful that `end` is inclusive.
  38. let mutend =start;
  39. whileend <pending.len() - 1 &&pending[end + 1].id() ==pending[end].id() + 1 {
  40. end+= 1;
  41. }
  42. let (blob_offset, _blob_end, blob_size) = self.get_blob_range(&pending[start..=end])?;
  43. matchself.read_chunks_from_backend(blob_offset, blob_size, &pending[start..=end], true)
  44. {
  45. Ok(mutbufs) => {
  46. total_size+= blob_size;
  47. ifself.is_compressed {
  48. let res = Self::persist_cached_data(
  49. &self.file,
  50. blob_offset,
  51. bufs.compressed_buf(),
  52. );
  53. for c inpending.iter().take(end + 1).skip(start) {
  54. self.update_chunk_pending_status(c.as_ref(), res.is_ok());
  55. }
  56. } else {
  57. for idx instart..=end {
  58. let buf = matchbufs.next() {
  59. None => returnErr(einval!("invalid chunk decompressed status")),
  60. Some(Err(e)) => {
  61. forchunk in &mutpending[idx..=end] {
  62. self.update_chunk_pending_status(chunk.as_ref(), false);
  63. }
  64. returnErr(e);
  65. }
  66. Some(Ok(v)) => v,
  67. };
  68. self.persist_chunk_data(pending[idx].as_ref(), &buf);
  69. }
  70. }
  71. }
  72. Err(_e) => {
  73. // Clear the pending flag for all chunks in processing.
  74. forchunk in &mutpending[start..=end] {
  75. self.update_chunk_pending_status(chunk.as_ref(), false);
  76. }
  77. }
  78. }
  79. start =end + 1;
  80. }
  81. Ok(total_size)
  82. }

明确需要获取的数据 range 后,直接调用read_chunks_from_backend从 backend 读取内容。

2.6.2 初始化 PassthroughFs backend

创建 fs 配置信息实例,根据配置信息创建 PassthroughFs 实例:

  1. let fs_cfg = Config {
  2. root_dir: cmd.source.to_string(),
  3. do_import: false,
  4. writeback: true,
  5. no_open: true,
  6. xattr: true,
  7. ..Default::default()
  8. };
  9. // TODO: Passthrough Fs needs to enlarge rlimit against host. We can exploit `MountCmd`
  10. // `config` field to pass such a configuration into here.
  11. let passthrough_fs =
  12. PassthroughFs::<()>::new(fs_cfg).map_err(DaemonError::PassthroughFs)?;
  13. passthrough_fs
  14. .import()
  15. .map_err(DaemonError::PassthroughFs)?;
  16. info!("PassthroughFs imported");
  17. Ok(Box::new(passthrough_fs))

创建 PassthroughFs 实例:

  1. /// Create a Passthrough file system instance.
  2. pubfn new(cfg: Config) -> io::Result<PassthroughFs<S>> {
  3. // Safe because this is a constant value and a valid C string.
  4. let proc_self_fd_cstr = unsafe { CStr::from_bytes_with_nul_unchecked(PROC_SELF_FD_CSTR) };
  5. // 打开 /proc/self/fd 文件
  6. let proc_self_fd = Self::open_file(
  7. libc::AT_FDCWD,
  8. proc_self_fd_cstr,
  9. libc::O_PATH | libc::O_NOFOLLOW | libc::O_CLOEXEC,
  10. 0,
  11. )?;
  12. Ok(PassthroughFs {
  13. inode_map: InodeMap::new(),
  14. next_inode: AtomicU64::new(fuse::ROOT_ID + 1),
  15. handle_map: HandleMap::new(),
  16. next_handle: AtomicU64::new(1),
  17. mount_fds: MountFds::new(),
  18. proc_self_fd,
  19. writeback: AtomicBool::new(false),
  20. no_open: AtomicBool::new(false),
  21. no_opendir: AtomicBool::new(false),
  22. killpriv_v2: AtomicBool::new(false),
  23. no_readdir: AtomicBool::new(cfg.no_readdir),
  24. perfile_dax: AtomicBool::new(false),
  25. cfg,
  26. phantom: PhantomData,
  27. })
  28. }

passthrough_fs.import() 初始化文件系统。

  1. /// Initialize the Passthrough file system.
  2. pubfn import(&self) -> io::Result<()> {
  3. let root = CString::new(self.cfg.root_dir.as_str()).expect("CString::new failed");
  4. let (file_or_handle, st, ids_altkey, handle_altkey) = Self::open_file_or_handle(
  5. self.cfg.inode_file_handles,
  6. libc::AT_FDCWD,
  7. &root,
  8. &self.mount_fds,
  9. |fd, flags, _mode| {
  10. let pathname = CString::new(format!("{}", fd))
  11. .map_err(|e| io::Error::new(io::ErrorKind::InvalidData, e))?;
  12. Self::open_file(self.proc_self_fd.as_raw_fd(), &pathname, flags, 0)
  13. },
  14. )
  15. .map_err(|e| {
  16. error!("fuse: import: failed to get file or handle: {:?}", e);
  17. e
  18. })?;
  19. // Safe because this doesn't modify any memory and there is no need to check the return
  20. // value because this system call always succeeds. We need to clear the umask here because
  21. // we want the client to be able to set all the bits in the mode.
  22. unsafe { libc::umask(0o000) };
  23. // Not sure why the root inode gets a refcount of 2 but that's what libfuse does.
  24. self.inode_map.insert(
  25. fuse::ROOT_ID,
  26. InodeData::new(
  27. fuse::ROOT_ID,
  28. file_or_handle,
  29. 2,
  30. ids_altkey,
  31. st.get_stat().st_mode,
  32. ),
  33. ids_altkey,
  34. handle_altkey,
  35. );
  36. Ok(())
  37. }

初始化 backend 文件系统完成。

回到daemon.service.mount(cmd)方法。接下来,通过self.get_vfs().mount(backend, &cmd.mountpoint)方法挂载 backend 文件系统:

  1. /// Mount a backend file system to path
  2. pubfn mount(&self, fs: BackFileSystem, path: &str) -> VfsResult<VfsIndex> {
  3. let (entry, ino) = fs.mount().map_err(VfsError::Mount)?;
  4. if ino > VFS_MAX_INO {
  5. fs.destroy();
  6. returnErr(VfsError::InodeIndex(format!(
  7. "Unsupported max inode number, requested {} supported {}",
  8. ino, VFS_MAX_INO
  9. )));
  10. }
  11. // Serialize mount operations. Do not expect poisoned lock here.
  12. let _guard = self.lock.lock().unwrap();
  13. ifself.initialized() {
  14. let opts = self.opts.load().deref().out_opts;
  15. fs.init(opts).map_err(|e| {
  16. VfsError::Initialize(format!("Can't initialize with opts {:?}, {:?}", opts, e))
  17. })?;
  18. }
  19. let index = self.allocate_fs_idx().map_err(VfsError::FsIndex)?;
  20. self.insert_mount_locked(fs, entry, index, path)
  21. .map_err(VfsError::Mount)?;
  22. Ok(index)
  23. }

首先,通过fs.mount()方法获取 backend 文件系统root inodeentry和最大的inode,对于 RAFS:

  1. impl BackendFileSystem for Rafs {
  2. fn mount(&self) -> Result<(Entry, u64)> {
  3. let root_inode = self.sb.get_inode(self.root_ino(), self.digest_validate)?;
  4. self.ios.new_file_counter(root_inode.ino());
  5. let e = self.get_inode_entry(root_inode);
  6. // e 为 root inode 的 entry,第二个参数是支持的最大 inode 值
  7. Ok((e, self.sb.get_max_ino()))
  8. }
  9. ...
  10. }

然后,通过self.allocate_fs_idx()方法分配可用的index:

由于nydus通过index区分不同的pseudofs文件系统(具体来说,长度为 64 位的 inode 中前 8 位),因此,最多可以有 256 个pseudofs文件系统。

接下来,通过self.insert_mount_locked(fs, entry, index, path)方法挂载path,并且将index和新建pseudofsentry关联起来:

  1. fn insert_mount_locked(
  2. &self,
  3. fs: BackFileSystem,
  4. mutentry: Entry,
  5. fs_idx: VfsIndex,
  6. path: &str,
  7. ) -> Result<()> {
  8. // The visibility of mountpoints and superblocks:
  9. // superblock should be committed first because it won't be accessed until
  10. // a lookup returns a cross mountpoint inode.
  11. let mutsuperblocks = self.superblocks.load().deref().deref().clone();
  12. let mutmountpoints = self.mountpoints.load().deref().deref().clone();
  13. // 挂载 path,得到 inode
  14. let inode = self.root.mount(path)?;
  15. let real_root_ino =entry.inode;
  16. // 根据 index 对 inodes 进行 hash
  17. entry.inode = self.convert_inode(fs_idx,entry.inode)?;
  18. // 如果已经存在 mountpoint,先设置为 None
  19. // Over mount would invalidate previous superblock inodes.
  20. ifletSome(mnt) =mountpoints.get(&inode) {
  21. superblocks[mnt.fs_idx asusize] = None;
  22. }
  23. superblocks[fs_idx asusize] = Some(Arc::new(fs));
  24. self.superblocks.store(Arc::new(superblocks));
  25. trace!("fs_idx {} inode {}", fs_idx, inode);
  26. let mountpoint = Arc::new(MountPointData {
  27. fs_idx,
  28. ino: real_root_ino,
  29. root_entry:entry,
  30. _path: path.to_string(),
  31. });
  32. // 将新的 mount 添加到 self.mountpoints
  33. mountpoints.insert(inode, mountpoint);
  34. self.mountpoints.store(Arc::new(mountpoints));
  35. Ok(())
  36. }

其中,self.root.mount(path)方法创建新的pseudofs,如果path对应的pseudofs已经存在,则直接返回,否则,创建新的pseudofs

  1. // mount creates path walk nodes all the way from root
  2. // to @path, and returns pseudo fs inode number for the path
  3. pubfn mount(&self, mountpoint: &str) -> Result<u64> {
  4. let path = Path::new(mountpoint);
  5. if !path.has_root() {
  6. error!("pseudo fs mount failure: invalid mount path {}", mountpoint);
  7. returnErr(Error::from_raw_os_error(libc::EINVAL));
  8. }
  9. letmut inodes = self.inodes.load();
  10. letmut inode = &self.root_inode;
  11. 'outer: for component in path.components() {
  12. trace!("pseudo fs mount iterate {:?}", component.as_os_str());
  13. match component {
  14. Component::RootDir => continue,
  15. Component::CurDir => continue,
  16. Component::ParentDir => inode = inodes.get(&inode.parent).unwrap(),
  17. Component::Prefix(_) => {
  18. error!("unsupported path: {}", mountpoint);
  19. returnErr(Error::from_raw_os_error(libc::EINVAL));
  20. }
  21. Component::Normal(path) => {
  22. let name = path.to_str().unwrap();
  23. // Optimistic check without lock.
  24. for child in inode.children.load().iter() {
  25. if child.name == name {
  26. inode = inodes.get(&child.ino).unwrap();
  27. continue'outer;
  28. }
  29. }
  30. ...
  31. // 没找到对应 name 的 node,新建
  32. let new_node = self.create_inode(name, inode);
  33. inodes = self.inodes.load();
  34. inode = inodes.get(&new_node.ino).unwrap();
  35. }
  36. }
  37. }
  38. // Now we have all path components exist, return the last one
  39. Ok(inode.ino)
  40. }

self.convert_inode(fs_idx, entry.inode)方法将pseudofs的 inode 根据 index 进行偏移,避免多个pseudofs的 inode 相同:

  1. // 1. Pseudo fs 的根 inode 不进行 hash
  2. // 2. 由于 Index 总是大于 0,因此 pseudo fs 的 inodes 不受影响(也会进行 hash)
  3. // 3. 其它 inodes通过 (index << 56 | inode) 进行 hash
  4. fn convert_inode(&self, fs_idx: VfsIndex, inode: u64) -> Result<u64> {
  5. // Do not hash negative dentry
  6. if inode == 0 {
  7. returnOk(inode);
  8. }
  9. if inode > VFS_MAX_INO {
  10. returnErr(Error::new(
  11. ErrorKind::Other,
  12. format!(
  13. "Inode number {} too large, max supported {}",
  14. inode, VFS_MAX_INO
  15. ),
  16. ));
  17. }
  18. let ino: u64 = ((fs_idx asu64) << VFS_INDEX_SHIFT) | inode;
  19. trace!(
  20. "fuse: vfs fs_idx {} inode {} fuse ino {:#x}",
  21. fs_idx,
  22. inode,
  23. ino
  24. );
  25. Ok(ino)
  26. }

挂载 backend 文件系统结束。

根据mount_cmd准备好文件系统后端(例如,RAFS backend),接下来通过 FUSE 进行挂载。daemon.service.session.lock().unwrap().mount()函数是fuse-backend-rsFuseSession结构体的方法:

fuse_kern_mount方法中,准备好需要的参数后,会调用nix crate 中的mount方法,这个方法最终调用了libc中的mount函数:

接下来,会向状态机线程发送MountStart两个事件,状态机的变化如下:

当状态转换为StartService时,会执行上面分析的d.start()方法,最终将状态修改为RUNNING

  1. StartService => d.start().map(|r| {
  2. d.set_state(DaemonState::RUNNING);
  3. r
  4. }),

nydusd 在运行期间有 8 个线程,到目前为止,我们已经启动了其中的 6 个线程(fuse_server 的数量可以配置),接下来,还要启动两个线程 nydus-http-server 和 api-server。

最后,获取挂载点的 major 和 minor 信息,存储在元数据中。

create_fuse_daemon() 方法执行完成后,如果成功会打印如下日志信息:

参考资料

[1] nydus: https://github.com/dragonflyoss/image-service.git

[2] fuse-backend-rs: https://github.com/cloud-hypervisor/fuse-backend-rs

nydusd 源码理解(一)的更多相关文章

  1. Caffe源码理解2:SyncedMemory CPU和GPU间的数据同步

    目录 写在前面 成员变量的含义及作用 构造与析构 内存同步管理 参考 博客:blog.shinelee.me | 博客园 | CSDN 写在前面 在Caffe源码理解1中介绍了Blob类,其中的数据成 ...

  2. 基于SpringBoot的Environment源码理解实现分散配置

    前提 org.springframework.core.env.Environment是当前应用运行环境的公开接口,主要包括应用程序运行环境的两个关键方面:配置文件(profiles)和属性.Envi ...

  3. jedis的源码理解-基础篇

    [jedis的源码理解-基础篇][http://my.oschina.net/u/944165/blog/127998] (关注实现关键功能的类)   基于jedis 2.2.0-SNAPSHOT   ...

  4. VUEJS2.0源码理解--优

    VUEJS2.0源码理解 http://jiongks.name/blog/vue-code-review/#pingback-112428

  5. AdvanceEast源码理解

    目录 文章思路 源码理解 一. 标签点形式 按顺序排列四个点,逆时针旋转,且第一个点为左上角点(刚开始选择最左边的点, 二. 标签切边 三. loss计算 四. NMS 最后说明 文章思路 大神的gi ...

  6. Pytorch学习之源码理解:pytorch/examples/mnists

    Pytorch学习之源码理解:pytorch/examples/mnists from __future__ import print_function import argparse import ...

  7. .NET Core 3.0之深入源码理解Startup的注册及运行

    原文:.NET Core 3.0之深入源码理解Startup的注册及运行   写在前面 开发.NET Core应用,直接映入眼帘的就是Startup类和Program类,它们是.NET Core应用程 ...

  8. 深入源码理解Spring整合MyBatis原理

    写在前面 聊一聊MyBatis的核心概念.Spring相关的核心内容,主要结合源码理解Spring是如何整合MyBatis的.(结合右侧目录了解吧) MyBatis相关核心概念粗略回顾 SqlSess ...

  9. HashMap源码理解一下?

    HashMap 是一个散列桶(本质是数组+链表),散列桶就是数据结构里面的散列表,每个数组元素是一个Node节点,该节点又链接着多个节点形成一个链表,故一个数组元素 = 一个链表,利用了数组线性查找和 ...

  10. JS魔法堂:剖析源码理解Promises/A规范

    一.前言 Promises/A是由CommonJS组织制定的异步模式编程规范,有不少库已根据该规范及后来经改进的Promises/A+规范提供了实现 如Q, Bluebird, when, rsvp. ...

随机推荐

  1. 如何在 Jenkins CI/CD 流水线中保护密钥?

    CI/CD 流水线是 DevOps 团队软件交付过程的基本组成部分.该流水线利用自动化和持续监控来实现软件的无缝交付.通过持续自动化,确保 CI/CD 流水线每一步的安全性非常重要.在流水线的各个阶段 ...

  2. [Qt基础内容-08] Qt中MVC的M(Model)

    Qt中MVC的M(Model)简单介绍 Qt有自己的MVC框架,分别是model(模型).view(视图).delegate(委托),这篇文章,简单的介绍以下Qt中有关model(模型)的类以及一些基 ...

  3. RabbitMQ各个端口被占用的进程说明

    官方地址:https://www.rabbitmq.com/networking.html#ports 端口 描述 4369 erlang 发现端口,被 epmd 占用,用于 RabbitMQ 节点和 ...

  4. 几篇关于MySQL数据同步到Elasticsearch的文章---第五篇:logstash-input-jdbc实现mysql 与elasticsearch实时同步深入详解

    文章转载自: https://blog.csdn.net/laoyang360/article/details/51747266 引言: elasticsearch 的出现使得我们的存储.检索数据更快 ...

  5. 记录一次Bitbucket鉴权的坑

    目录 发生了什么 什么原因 如何解决 总结 发生了什么 今天首次在Fedora上使用git,因为没有小王八(TortoiseGit)帮助,其过程异常焦灼-- 反正经过一系列折腾,我在本地新建了一个项目 ...

  6. Linux Subsystem For Android 11!适用于Debian GNU/Linux的Android子系统,完美兼容ARM安卓软件!

    本文将讲述如何在Debian Stable 系统安装一个Android 11子系统,并且这个子系统带有Houdini可以兼容专为移动设备开发的ARM软件.在root权限下,编辑/etc/apt/sou ...

  7. vue-router(路由嵌套)

    文章目录 1.项目结构 2.路由嵌套 3.界面(使用elementui) 4.效果展示 1.项目结构 2.路由嵌套 import Vue from 'vue' import Router from ' ...

  8. 1、在SrpingBoot的环境当中使用JSP及相关功能

    创建webapp目录 由于SpringBoot项目不建议直接访问jsp页面,但是我现在要做的事情需要去访问,那么我就需要在原有的项目基础上为访问jsp页面进行一个调整 首先在项目当中,java和res ...

  9. java程序员在交接别人的工作时如何保证顺利交接?

    序言 各位好啊,我是会编程的蜗牛,作为java开发者,尤其是在职场混迹了多年的老手,肯定会遇到同事离职的情况,或者自己跳槽的情况,这些都免不了需要做好交接工作,不管是别人交接给我们,还是我们交接给别人 ...

  10. 2.httprunner-yaml用例结构

    前言: httprunner3.x版本弱化了api层的概念 直接在testcase中写request请求 如果是单个请求,也可以直接写成一个testcase 每个testcase必须具有两个类属性:c ...