Xv6-Lab-1
Operating System C Linux中文在下面👇
Introduction
xv6 is an educational operating system used in MIT’s course S6.081 on operating systems. It is a simplified OS that runs only on x86 architecture. This blog aims to document the process of completing Lab 1 and highlight some key concepts.
Sleep
The first task is to implement the sleep
command in xv6. This is not a system call but a bash command in xv6. The implementation involves calling the sleep()
system call.
This task is relatively simple but is noteworthy for revealing the system call process in xv6. The header file exposes the interface for the sleep()
system call. By looking at the sys_sleep()
function in the sysproc.c
file, one can see the specific implementation of the system call. The connection between the user-space function and the kernel function is established through a piece of RISC-V assembly:
.global sleep
sleep:
li a7, SYS_sleep
ecall
ret
This assembly code places the system call number SYS_sleep
into the a7
register, then invokes the ecall <system_call_num>
instruction to trigger the system call.
The sleep()
system call also involves the concept of ticks. Ticks are a crucial topic in OS. In xv6, ticks represent the interval between two clock interrupts. In Linux, ticks are set as a macro definition, often defaulting to 100, meaning the clock ticks 100 times per second, analogous to frequency in physics. When performing benchmarks, functions like times()
return time in ticks. Increasing the tick rate improves clock precision but also raises observation overhead. Thus, programmers need to choose an appropriate tick rate for their applications, commonly referred to as Software Clock. For high-precision observations, Windows uses the QPC (QueryPerformanceCounter) function, while Linux uses clock_gettime()
, with the former providing microsecond precision and the latter nanosecond precision.
Ping Pong
This exercise helped me understand the basic usage of pipes.
An easy way to remember that the input end comes first is that file descriptor 0 is standard input, and file descriptor 1 is standard output.
https://www.gnu.org/software/libc/manual/html_node/Creating-a-Pipe.html
The file descriptors generated by pipes follow the same rules as the default file descriptors provided by bash to each process it starts. In other words, the first file descriptor returned by a pipe is the input end, and the second is the output end. The default file descriptors for a process are 0, 1, and 2, representing standard input, standard output, and standard error output, respectively.
To ensure proper process scheduling, the parent process needs to sleep()
after writing data to allow the child process to execute first. Otherwise, the parent process might read the data meant for the child process. Besides using sleep()
, the parent process could theoretically use other IPC synchronization mechanisms. This is necessary in systems with lower scheduling frequencies.
“If no data is available, a read on a pipe waits for either data to be written or for all file descriptors referring to the write end to be closed;” (Cox et al., p. 16) (pdf)
Regarding how to close a pipe, the pipe system call provides a convenient IPC mechanism: the sender closes the write end, the receiver reads until EOF, then the receiver closes the read end, completing the exit. This operation requires the receiver to close the write end beforehand, ensuring that only the sending process has a reference to the sending end before it closes the write end. Otherwise, the receiver will block on the read
operation. It’s a good habit to close unnecessary file descriptors.
Additionally, we need to note that after fork()
, the child process copies the parent process’s file descriptor table and shares the file offset. Do not attempt to write data to the pipe from the child process, as this will cause data confusion:
“Although fork copies the file descriptor table, each underlying file offset is shared between parent and child. Consider this example:” (Cox et al., p. 15) (pdf)
Primes
This exercise can be seen as a supplement and advancement of the pipe exercise. In this exercise, the main process writes objects to be filtered into the pipe one by one, and the child process sieves out composite numbers, retaining primes in the child process context. This allows for determining if incoming numbers are prime. Child processes are created only when necessary, maximizing resource efficiency.
This exercise requires using recursive functions, and I learned how to let the compiler skip the recursive dead loop check. This technique involves adding __attribute__((noreturn))
to the function declaration. The experimental environment has strict compilation checks, and compilation fails without this annotation:
~/Doc/xv6-labs-2023 on main !2 make qemu ✔ base at 15:09:11
riscv64-unknown-elf-gcc -Wall -Werror -O -fno-omit-frame-pointer -ggdb -gdwarf-2 -DSOL_UTIL -DLAB_UTIL -MD -mcmodel=medany -ffreestanding -fno-common -nostdlib -mno-relax -I. -fno-stack-protector -fno-pie -no-pie -c -o user/primes.o user/primes.c
user/primes.c: In function 'child_routine':
user/primes.c:6:5: error: infinite recursion detected [-Werror=infinite-recursion]
6 | int child_routine(const int left_read_pipe, const int left_write_pipe)
| ^~~~~~~~~~~~~
user/primes.c:47:17: note: recursive call
47 | child_routine(right_pipes[0], right_pipes[1]);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make: *** [user/primes.o] Error 1
Pipes are superior to files for two reasons:
- Pipes are in-memory buffers, theoretically faster.
- Pipes have no size limit, whereas files do.
- Pipes handle garbage collection automatically, while files often need manual deletion (except for those in
/tmp
). - Pipes allow producers and consumers to process data in parallel, which files do not.
Find
This exercise involves implementing the find
command, a file system exercise. The find
command searches for files in the file system. This exercise covers basic file system operations, such as reading directories, reading files, and comparing file names.
What exactly is a directory? According to the developer documentation:
In an ext4 filesystem, a directory is more or less a flat file that maps an arbitrary byte string (usually ASCII) to an inode number on the filesystem.
https://www.kernel.org/doc/html/latest/filesystems/ext4/directory.html
Through this experiment, I understood the Unix system’s dirent
structure and file name comparison. The dirent
structure represents a directory entry, containing the file name, file type, file inode, etc. It stands for Directory Entry. File name comparison is done using the strcmp()
function, a standard C library function for comparing two strings.
The find
command skips files with an inode value of 0 because these are unused directory entries and do not need to be displayed. I also learned about the new C memory manipulation function memmove()
, similar to strcpy()
.
The struct dirent
in real systems is much more complex than in xv6 because real systems need to support more functions, such as file permissions, file size, file creation time, etc. Real file directories are not necessarily stored linearly and may be tree-structured or mixed with hash tables and hash calculations to improve search efficiency. For instance, ext3 uses an htree hash tree as the data structure for file directories. According to the Linux developer documentation:
A linear array of directory entries isn’t great for performance, so a new feature was added to ext3 to provide a faster (but peculiar) balanced tree keyed off a hash of the directory entry name.
https://www.kernel.org/doc/html/latest/filesystems/ext4/directory.html
Xargs
This is an exercise requiring meticulous pointer manipulation. The goal is to implement the xargs
command, which passes standard input content as arguments to another command. This involves pipes, file descriptor operations, process creation, etc.
The approach is to split the input string from standard input and assemble it with new arguments passed through argv
, then forward it to the child process using exec()
. Therefore, before forwarding, we need to allocate a sufficiently large space to store the possible arguments and commands.
How large should this space be? The experiment author did not specify clearly (the author’s description of MAXARG
is ambiguous). In theory, xv6 Bash has limits on the length and number of arguments. I found from documentation:
And as an additional limit since 2.6.23, one argument must not be longer than MAX_ARG_STRLEN (131072).
https://www.in-ulm.de/~mascheck/various/argmax/#maximum_number
Different systems have varying limits on the length and number of arguments. I eventually used char __argv[100][MAXARG]
for argument storage, allowing for 100 arguments, each up to MAXARG
bytes long.
Debugging
While working on the xargs
experiment, I encountered issues with the exec()
function failing. Using printf()
for kernel debugging was challenging due to the need to modify the exec()
implementation. I attempted to attach gdb for further debugging.
Kernel debugging is interesting because the kernel runs in a virtual environment or directly on hardware, meaning gdb cannot directly attach to the kernel process. Thus, remote debugging through gdb’s stub is required. The following explanation clarifies the basics of kernel debugging:
Remote debugging is a very important technique for kernel development in general: the basic idea is that the main debugger (GDB in this case) runs separately from the program being debugged (the xv6 kernel atop QEMU) - they could be on completely separate machines, in fact.
https://web.archive.org/web/20190308091152/http://zoo.cs.yale.edu:80/classes/cs422/2011/lec/l2-hw
Why are fork()
and exec()
separate?
Although fork()
and exec()
are often used together, they are two different system calls. The main reason is to facilitate I/O redirection in the parent process. Additionally, fork()
uses copy-on-write technology to avoid unnecessary memory copying.
简介
xv6是mit s6.081操作系统课程使用的教学操作系统,是一个简化的,只能在x86上运行的操作系统。本博客旨在记录完成Lab1的过程以及一些关键知识点。
Sleep
第一个任务是实现xv6的sleep
指令。这不是一个系统调用,而是xv6的bash指令。实现方式是调用sleep()
系统调用。
本任务相对简单,值得玩味的是该任务揭示了xv6的系统调用调用过程:头文件暴露系统调用sleep()
的接口。查看sysproc.c
文件中的sys_sleep()
函数,可以看到系统调用的具体实现。而连接用户态函数和内核函数的是一段RISC-V汇编:
.global sleep
sleep:
li a7, SYS_sleep
ecall
ret
这段汇编代码的作用是将系统调用号SYS_sleep
放入寄存器a7
,然后调用ecall <system_call_number>
指令,触发系统调用。
sleep()
系统调用还涉及ticks这一概念。ticks在OS中是一个绕不开的话题,在xv6中代表两次时钟中断之间的间隔。在Linux中,ticks会被设置为一个宏定义,往往默认值为100,这意味着时钟每秒钟滴答100下,可以理解为物理里的频率。在做基准测试时,当程序员使用诸如times()
这样的函数时,会返回以ticks为单位的时间。当ticks被调高时,时钟精度增加,但观测开销也随之水涨船高。所以程序员需要为自己的应用程序选择合适的ticks值,这一般被称为Software Clock。如果需要进行高精度观测,在Windows下往往会使用QPC(QueryPerformanceCounter)函数,而在Linux下会使用clock_gettime()
函数,前者提供微秒级别的精度,后者能提供纳秒级别精度。
Ping Pong
这个练习帮助我理解了管道的基本用法。
An easy way to remember that the input end comes first is that file descriptor 0 is standard input, and file descriptor 1 is standard output.
https://www.gnu.org/software/libc/manual/html_node/Creating-a-Pipe.html
管道产生的文件描述符和每个由bash启动的进程自带的默认文件描述符的规则保持了一致性。换句话说,管道返回的第一个文件描述符是输入端,第二个文件描述符是输出端。进程的默认文件描述符是0,1,2,分别代表标准输入,标准输出,标准错误输出。
为了保证进程调度的正确性,父进程在写入数据后需要sleep()
,让子进程先执行。否则会将本应该由子进程读的数据读走。父进程除了用sleep()
,理论上还可以用其他的IPC同步机制。在调度频率较低的系统中,这么做是必要的。
“If no data is available, a read on a pipe waits for either data to be written or for all file descriptors referring to the write end to be closed;” (Cox 等, p. 16) (pdf)
关于如何关闭管道,管道系统调用提供了一个便捷的IPC机制,即发送方关闭写端,接收方读到EOF,然后接收方关闭读端,这样就完成了退出。这一操作有一个前提条件,就是接收方需要提前关闭写端,这样在发送方关闭写端之前,只有发送进程对发送端有引用,否则会造成接收方阻塞在read
操作上。养成关闭不需要的文件描述符是一个好习惯。
此外,我们还需注意,fork()
完后,子进程将拷贝父进程的文件描述符表,且共享文件偏移量。不要试图从子进程中往管道中写数据,这样会造成数据混乱:
“Although fork copies the file descriptor table, each underlying file offset is shared between parent and child. Consider this example:” (Cox 等, p. 15) (pdf)
Primes
本练习可以看做是管道练习的补充和进阶。在这个练习中,主进程将需要筛选的对象逐个写入管道,子进程逐个筛去合数,将质数保存在子进程的上下文中,这样在下一个数到来的时候,可以进行判断其是否为质数。子进程只在必要的时候创建,这样便可以尽最大可能节约系统资源。
本练习需要使用递归函数,我学到了如何让编译器跳过函数的递归死循环检查。这个技巧是在函数声明的时候加上__attribute__((noreturn))
。实验环境设置了严格的编译检查,如果不加这个注释,编译无法通过:
~/Doc/xv6-labs-2023 on main !2 make qemu ✔ base at 15:09:11
riscv64-unknown-elf-gcc -Wall -Werror -O -fno-omit-frame-pointer -ggdb -gdwarf-2 -DSOL_UTIL -DLAB_UTIL -MD -mcmodel=medany -ffreestanding -fno-common -nostdlib -mno-relax -I. -fno-stack-protector -fno-pie -no-pie -c -o user/primes.o user/primes.c
user/primes.c: In function 'child_routine':
user/primes.c:6:5: error: infinite recursion detected [-Werror=infinite-recursion]
6 | int child_routine(const int left_read_pipe, const int left_write_pipe)
| ^~~~~~~~~~~~~
user/primes.c:47:17: note: recursive call
47 | child_routine(right_pipes[0], right_pipes[1]);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make: *** [user/primes.o] Error 1
管道优于文件,其原因有二:
- 管道是内存中的缓冲区,理论上速度更快。
- 管道没有大小上限,文件有大小上限。
- 管道自动进行垃圾回收,文件往往需要手动删除(放在
/tmp
下除外) - 管道允许生产者和消费者并行处理数据,文件不允许。
Find
这个练习是一个文件系统练习,需要实现find
指令。find
指令的功能是在文件系统中查找文件。这个练习涉及到了文件系统的基本操作,如读取目录,读取文件,文件名比较等。
究竟什么是目录?引用开发者文档的解释:
In an ext4 filesystem, a directory is more or less a flat file that maps an arbitrary byte string (usually ASCII) to an inode number on the filesystem.
https://www.kernel.org/doc/html/latest/filesystems/ext4/directory.html
通过这个实验,我理解了Unix系统的dirent
结构体,以及文件名的比较。dirent
结构体是一个目录项,包含了文件名,文件类型,文件inode等信息。是Directory Entry的缩写。而文件名的比较是通过strcmp()
函数实现的,这个函数是C语言的标准库函数,用于比较两个字符串是否相等。
对于inode值为0的文件,find
指令会跳过。这是因为inode为0的文件是未使用的目录项,不需要显示。我还学到了新的C语言内存操作函数memmove()
,这个函数类似于strcpy()
。
真实系统的struct dirent
结构体比xv6中的要复杂得多,因为真实系统中的文件系统要支持更多的功能,比如文件权限,文件大小,文件创建时间等。实际的文件目录不一定是线性存储的,可能是树状结构,甚至掺杂了哈希表以及哈希计算以提高查找效率,比如ext3中就引入了htree哈希树作为表示文件目录的数据结构。参考Linux开发者文档:
A linear array of directory entries isn’t great for performance, so a new feature was added to ext3 to provide a faster (but peculiar) balanced tree keyed off a hash of the directory entry name.
https://www.kernel.org/doc/html/latest/filesystems/ext4/directory.html
Xargs
不得不说这是一个需要细腻指针操作的练习。。。这个练习的目的是实现xargs
指令,这个指令的功能是将标准输入的内容作为参数传递给另一个指令。这个练习涉及到了管道,文件描述符的操作,进程的创建等。
本实验的思路是对从标准输入的字符串进行分割后和argv
传入的新参数进行组装,然后用exec()
转发给子进程执行。所以在转发前,我们需要开辟一块足够大的空间存放可能的参数和指令。
这块内存需要开辟多大?实验的作者并没有明确规定(实验作者对MAXARG
的描述充满歧义。。。),理论上xv6 Bash的参数长度和参数个数是有上限的。我查阅文档发现:
And as additional limit since 2.6.23, one argument must not be longer than MAX_ARG_STRLEN (131072).
https://www.in-ulm.de/~mascheck/various/argmax/#maximum_number
不同系统对于参数的长度和个数都有不同的限制。我最后使用char __argv[100][MAXARG];
作为参数的存储空间,这样可以存储100个参数,每个参数的长度为MAXARG
字节。
关于调试
我在做xargs
实验的时候,遇到了exec()
函数执行失败的问题。单纯使用printf()
对内核进行调试让我吃尽了苦头,因为修改exec()
函数的实现。我尝试attach gdb进行进一步调试。
调试内核是有趣的,因为内核需要运行在一个虚拟环境中,甚至直接运行在硬件之上,这意味着gdb无法直接attach到内核进程上。所以我们需要使用gdb的远程调试功能,通过gdb的stub来调试内核。下面这段话阐述了内核调试的基本原理:
Remote debugging is a very important technique for kernel development in general: the basic idea is that the main debugger (GDB in this case) runs separately from the program being debugged (the xv6 kernel atop QEMU) - they could be on completely separate machines, in fact.
https://web.archive.org/web/20190308091152/http://zoo.cs.yale.edu:80/classes/cs422/2011/lec/l2-hw
为什么fork()
和exec()
是分开的?
虽然fork()
和exec()
往往一起使用,但是它们是两个不同的系统调用。主要原因是方便在父进程中完成I/O重定向。此外,fork()
使用了copy-on-write技术,避免了无效的内存拷贝。