-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable block operations for more than 1 block #325
Comments
The following comments from #315 are also relevant to this discussion:
Regarding atomicity guarantees and the "ideal block size" at the solo5_block API level: I'd have to study the problem in depth, so I can't give any informed opinion right now. So, concentrating on the core of the issue which could be immediately addressed, which is allowing block I/O in multiples of the block size per request: In order to do that, the following needs to be done:
Summary: We can enable I/O at >1 block, aligned to the block size at the block layer, but the above points need to be resolved. Regarding atomicity guarantees, I don't think enabling this would make things any "worse" (or less ill-defined) than they are now. However, someone needs to do the above work. I don't have time for this in the foreseeable future, but am happy to review patches. |
I've thought about this a bit more, especially the implications for
cannot be expressed using libseccomp. However, we can express rules that verify:
So, if we were to define the "maximum I/O unit" as, say, @ricarkol What do you think? What I'm essentially saying is that we punt on trying for a 100% correct solution for now. (See also my comments about hand-written BPF at #352 (comment) ) |
@mato I think that sounds like a reasonable solution. I'm not too fond of the idea of having to deal with BPF macro code here either. Some other semi-related points:
|
@ricarkol making the accessible amount smaller works well for some cases (preventing the allocation of extra extents; preventing attempts to write past block device limits which will result in an error anyway), but it works very poorly in others (particularly when the unikernel expects an 8K ( I think the latter is less intuitive than the former and likely to result in more bug reports, even if this is clearly a trade-off, and the proper solution would be to generate the correct BPF code that we all agree should be there. |
I agree with @cfcs that "wasting" the last X kB seems like asking for trouble, that's very counter-intuitive. Also, I expect the seccomp issue to be fixed properly at some future point, so would not want to then change the behaviour. Regarding the actual value of X (maximum I/O unit): I think we should actually make it part of the API (by adding it to Regarding the actual "maximum I/O unit" value for hvt and spt, any preferences? Based on your experiments, 4k or 8k seems reasonable. |
@g2p Any comments on this, especially regarding an optimum "maximum I/O unit" size for wodan? |
Wodan would ideally use a much larger IO size of up to 4MiB (the size of a SSD's erase block). |
@g2p The thing is, as things stand today all our APIs involve copying. So, ignoring the issues with seccomp, a large (e.g. 4MiB) block size is not practical as it would need at least that much fixed buffer space in the Solo5 layer (i.e. per unikernel). If you want to pack large numbers of unikernels on a single server, those numbers would quickly add up. Now, 512 bytes is obviously too small, which is why I suggested a compromise of 4k or 8k. |
#528 improve the situation where, at least, we have an argument which allows us to specify how large the chunk is. However, as @mato pointed out, a question remains about the I merged #528 because it unlocks some performance issues about our access to block devices. But I will not consider this issue as solved. Moreover, we definitely should go deeper now on this side because it breaks a bit the conservative design of Solo5. |
Hi, I think with #528 we're doing great. The seccomp rules setup allow to pread64/pwrite64 only on the specific file descriptor (that was never questioned), with the length (ARG2) being the block_basic.block_size, and the offset (ARG3) being <= capacity - block_size. Thus we cannot ever read beyond the end of the file (esp. with the additional check in block_attach that the file is actually a multiple of the block size). |
This discussion started in PR #315
The issue is that the solo5 interface is limited to a block at a time and the block is too small. And the problem with that is that it's inefficient as it usually involves too many exits for hvt, or too many syscalls for spt. The following experiment quantifies that. This is the time in seconds to read a 1GB file sequentially for an increasing block size on both hvt and spt (*):
spt is already pretty fast at 512, but it can be 4x faster by increasing the block size to 8192. This experiment's point is not to increase the block size necessarily, but to allow for multiblock requests instead. FS are already trying to perform IO in larger blocks (usually 4096 Bytes): it would be nice to not split them.
(*) details of the experiment. This is the unikernel: https://gist.github.com/ricarkol/b66c899edd96fd7f8fb2fbaeabad0694#file-solo5-blksize-test-c. This is how the blk size was changed: https://gist.github.com/ricarkol/b66c899edd96fd7f8fb2fbaeabad0694#file-blk-size-diff. Used an Ubuntu 18.04.1 running on an Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz. The disk was stored as a file on an SSD formatted with ext4 (caches were not dropped before each experiment).
The text was updated successfully, but these errors were encountered: