91 lines
		
	
	
		
			5.7 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
			
		
		
	
	
			91 lines
		
	
	
		
			5.7 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
Development notes regarding VND. Original document by David van Moolenbroek.
 | 
						|
 | 
						|
 | 
						|
DESIGN DECISIONS
 | 
						|
 | 
						|
As simple as the VND driver implementation looks, several important decisions
 | 
						|
had to be made in the design process. These decisions are listed here.
 | 
						|
 | 
						|
Multiple instances instead of a single instance: The decision to spawn a
 | 
						|
separate driver instance for each VND unit was not ideologically inspired, but
 | 
						|
rather based on a practical issue. Namely, users may reasonably expect to be
 | 
						|
able to set up a VND using a backing file that resides on a file system hosted
 | 
						|
on another VND. If one single driver instance were to host both VND units, its
 | 
						|
implementation would have to perform all its backcalls to VFS asynchronously,
 | 
						|
so as to be able to process another incoming request that was initiated as part
 | 
						|
of such an ongoing backcall. As of writing, MINIX3 does not support any form of
 | 
						|
asynchronous I/O, but this would not even be sufficient: the asynchrony would
 | 
						|
have to extend even to the close(2) call that takes place during device
 | 
						|
unconfiguration, as this call could spark I/O to another VND device.
 | 
						|
Ultimately, using one driver instance per VND unit avoids these complications
 | 
						|
altogether, thus making nesting possible with a maximum depth of the number of
 | 
						|
VFS threads. Of course, this comes at the cost of having more VND driver
 | 
						|
processes; in order to avoid this cost in the common case, driver instances are
 | 
						|
dynamically started and stopped by vndconfig(8).
 | 
						|
 | 
						|
copyfd(2) instead of openas(2): Compared to the NetBSD interface, the MINIX3
 | 
						|
VND API requires that the user program configuring a device pass in a file
 | 
						|
descriptor in the vnd_ioctl structure instead of a pointer to a path name.
 | 
						|
While binary compatibility with NetBSD would be impossible anyway (MINIX3 can
 | 
						|
not support pointers in IOCTL data structures), providing a path name buffer
 | 
						|
would be closer to what NetBSD does. There are two reasons behind the choice to
 | 
						|
pass in a file descriptor instead. First, performing an open(2)-like call as
 | 
						|
a driver backcall is tricky in terms of avoiding deadlocks in VFS, since it
 | 
						|
would by nature violate the VFS locking order. On top of that, special
 | 
						|
provisions would have to be added to support opening a file in the context of
 | 
						|
another process so that chrooted processes would be supported, for example.
 | 
						|
In contrast, copying a file descriptor to a remote process is relatively easy
 | 
						|
because there is only one potential deadlock case to cover - that of the given
 | 
						|
file descriptor identifying the VFS filp object used to control the very same
 | 
						|
device - and VFS need only implement a procedure that very much resembles
 | 
						|
sending a file descriptor across a UNIX domain socket. Second, since passing a
 | 
						|
file descriptor is effectively passing an object capability, it is easier to
 | 
						|
improve the isolation of the VND drivers in the future, as described below.
 | 
						|
 | 
						|
No separate control device: The driver uses the same minor (block) device for
 | 
						|
configuration and for actual (whole-disk) I/O, instead of exposing a separate
 | 
						|
device that exists only for the purpose of configuring the device. The reason
 | 
						|
for this is that such a control device simply does not fit the NetBSD
 | 
						|
opendisk(3) API. While MINIX3 may at some point implement support for NetBSD's
 | 
						|
notion of raw devices, such raw devices are still expected to support I/O, and
 | 
						|
that means they cannot be control-only. In this regard, it should be mentioned
 | 
						|
that the entire VND infrastructure relies on block caches being invalidated
 | 
						|
properly upon (un)configuration of VND units, and that such invalidation
 | 
						|
(through the REQ_FLUSH file system request) is currently initiated only by
 | 
						|
closing block devices. Support for configuration or I/O through character
 | 
						|
devices would thus require more work on that side first. In any case, the
 | 
						|
primary downside of not having a separate control device is that handling
 | 
						|
access permissions on device open is a bit of a hack in order to keep the
 | 
						|
MINIX3 userland happy.
 | 
						|
 | 
						|
 | 
						|
FUTURE IMPROVEMENTS
 | 
						|
 | 
						|
Currently, the VND driver instances are run as root just and only because the
 | 
						|
copyfd(2) call requires root. Obviously, nonroot user processes should never
 | 
						|
be able to copy file descriptors from arbitrary processes, and thus, some
 | 
						|
security check is required there. However, an access control list for VFS calls
 | 
						|
would be a much better solution: in that case, VND driver processes can be
 | 
						|
given exclusive rights to the use of the copyfd(2) call, while they can be
 | 
						|
given a normal driver UID at the same time.
 | 
						|
 | 
						|
In MINIX3's dependability model, drivers are generally not considered to be
 | 
						|
malicious. However, the VND case is interesting because it is possible to
 | 
						|
isolate individual driver instances to the point of actual "least authority".
 | 
						|
The copyfd(2) call currently allows any file descriptor to be copied, but it
 | 
						|
would be possible to extend the scheme to let user processes (and vndconfig(8)
 | 
						|
in particular) mark the file descriptors that may be the target of a copyfd(2)
 | 
						|
call. One of several schemes may be implemented in VFS for this purpose. For
 | 
						|
example, each process could be allowed to mark one of its file descriptors as
 | 
						|
"copyable" using a new VFS call, and VFS would then allow copyfd(2) only on a
 | 
						|
"copyable" file descriptor from a process blocked on a call to the driver that
 | 
						|
invoked copyfd(2). This approach precludes hiding a VND driver behind a RAID
 | 
						|
or FBD (etc) driver, but more sophisticated approaches can solve that as well.
 | 
						|
Regardless of the scheme, the end result would be a situation where the VND
 | 
						|
drivers are strictly limited to operating on the resources given to them.
 | 
						|
 | 
						|
Note that copyfd(2) was originally called dupfrom(2), and then extended to copy
 | 
						|
file descriptors *to* remote processes as well. The latter is not as security
 | 
						|
sensitive, but may have to be restricted in a similar way. If this is not
 | 
						|
possible, copyfd(2) can always be split into multiple calls.
 |